Skip to Main Content

Data Management

Learn how to manage your data to ensure it stays available and readable, how to write a data management plan, and how to comply with funding agency data mandates.
When packaging your data, think long-term and long-distance.  Your goal as an author is to make your data usable by as many people as possible, whether they are far away geographically or living in the future, without extra input or instructions.  Here are some tips to consider when formatting and preparing your data for preservation:
 
  • Think long-term!  Package your data in open, non-propietary formats that will remain accessible even as technology evolves in the future, as opposed to proprietary formats whose use is restricted to particular programs or people.
    • Some proprietary formats are more ubiquitous than others--it's unlikely Microsoft products will disappear anytime soon.  Still, it's best to package data in formats that can't only be read by one program, but are more universally accessible.
    • Consider:
      • XML instead of databases.
      • TIFF for images.  They will not display on the Internet, but they are the best for preserving image data without loss.
      • PDF for text files.  TXT, RTF, and HTML files are also acceptable.
      • CSV (rather than Excel's .xslx) for raw data.
  • Recognize that in some cases, you may need to save your data in two different formats:  one you can manipulate for your work, and one for long-term preservation.
     
  • Create good metadata for your research data!  Identify any relevant standards for data and metadata content and format, and follow them to make sure the data can be used by others.
     
  • Know whether any laws or regulations will affect how you package your data.  For example, HIPAA regulations and NIH policies will affect how you treat personally identifiable information while packaging data that is meant to be widely shared.

Listed below are some links information about suggested data standards for making research data more universally findable, searchable, and citable.

Metadata is, at its simplest, data about data.  It usually includes information about the content, context, and/or accessibility of a data set.  Descriptive metadata may be a required consideration in data management plans and is vital in making published data sets more findable, accessible, interoperable, and reproducable (FAIR). 

Metadata can exist in multiple formats, including as a separate text or HTML document that accompanies a data set, an XML document linked to the data files, or as information embedded in an XML data file.  (XML is often used for metadata records because it can be easily integrated into many different systems.)

Various metadata standards specify what pieces of information to include and how to express them when describing a data set. Each metadata standard is composed of various elements or fields, individual pieces of information that facilitate searching similar items through shared terminology and construction.  There are three main types of metadata elements:

  • Descriptive: describes the content and context of an object. e.g. title, author/creator, subject
  • Technical/structural: describes the format, process, and interrelatedness of an object. e.g. file format, size, dimensions (for images), set (if part of a series)
  • Administrative: describes the information needed to manage or use an object. e.g. permissions, creation date, required software, provenance

Some examples of metadata standards are linked below, along with a description of whether they are best used within a specific discipline or across many subject areas.

Preparing Yourself

  1. Prepare yourself!  Creating good metadata begins with preparation and organization.  Gather all of your information together, especially if it is distributed among multiple people.  Then you can plan what you need to do.
     
  2. Use existing information whenever possible.  The information will often already be written by the time you need it for a metadata record.  Reuse text from your funding proposals, such as the abstract, purpose, location, etc.  You can also create a data dictionary during the data collection and analysis stages of your research and reference that in your metadata.
     
  3. Choose keywords and other descriptive tags wisely.  Consider all the interpretations of your vocabulary choices, and use a thesaurus to come up with alternate terms you may not otherwise have thought of. 
     
  4. Review your metadata to make sure it is complete and accurate.  Include as many details so users can know what to expect from your data before they begin going through it.  Make sure your descriptions are clear and do not omit any important information.
     
  5. If possible, include unique identifiers like an ORCID (Open Researcher and Contributer ID) for the authors of and contributors to the research.

Preparing Your Metadata

At the dataset level, good metadata includes information about:

  • The context of the data: project history, objectives, hypotheses.
  • Data collection methods: protocols, sampling, instruments, data scale, resolution, temporal coverage, geographic coverage, hardware/software and other equipment.
  • Structure of the data: Data files, relationships between files.
  • Sources of data used
  • Data checking, validation, proofing
  • Modifications to the data since their creation
  • Identification of different versions of the data
  • Information on access and use conditions, confidentiality, etc., where necessary

This information may be contained in a separate document that accompanies the data files.

 

At the individual data level, good metadata includes information about:

  • Names and labels for variables, descriptions, records, and values
  • Explanation of codes and classification schema
  • Explanation and reasons for missing values
  • Data derived after collection, with information on how they were created (code, algorithms, or command files)

This information may be embedded within a dataset itself or contained within a separate document that accompanies the data files.