LibGuides: Data Management: How to Package Your Data

When packaging your data, think long-term and long-distance. Your goal as an author is to make your data usable by as many people as possible, whether they are far away geographically or living in the future, without extra input or instructions. Here are some tips to consider when formatting and preparing your data for preservation:

Think long-term! Package your data in open, non-propietary formats that will remain accessible even as technology evolves in the future, as opposed to proprietary formats whose use is restricted to particular programs or people.
- Some proprietary formats are more ubiquitous than others--it's unlikely Microsoft products will disappear anytime soon. Still, it's best to package data in formats that can't only be read by one program, but are more universally accessible.
- Consider:
  - XML instead of databases.
  - TIFF for images. They will not display on the Internet, but they are the best for preserving image data without loss.
  - PDF for text files. TXT, RTF, and HTML files are also acceptable.
  - CSV (rather than Excel's .xslx) for raw data.

Recognize that in some cases, you may need to save your data in two different formats: one you can manipulate for your work, and one for long-term preservation.
Create good metadata for your research data! Identify any relevant standards for data and metadata content and format, and follow them to make sure the data can be used by others.
Know whether any laws or regulations will affect how you package your data. For example, HIPAA regulations and NIH policies will affect how you treat personally identifiable information while packaging data that is meant to be widely shared.

Listed below are some links information about suggested data standards for making research data more universally findable, searchable, and citable.

Library of Congress Recommended Formats Statement Provides recommendations on formatting and metadata for a variety of types of content, both physical and digital.
FAIRsharing.org A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies. Guides consumers to discover, select and use these resources with confidence. Helps producers to make their resources more visible, more widely adopted and cited. Provides humans and tools with access to trustworthy content to enable data management tasks.
Research Resource Identifiers (RRIDs) Research Resource Identifiers (#RRID) are ID numbers assigned to help researchers cite key resources (antibodies, model organisms and software projects) in the biomedical literature to improve transparency of research methods.
Digital Curation Centre: How to Cite Datasets and Link to Publications This guide will help you create links between your academic publications and the underlying datasets. It provides a working knowledge of the issues and challenges involved, and of how current approaches seek to address them.
Digital Curation Centre's List of General Research Data Standards and Tools
Guide to the NIH CDE Repository Common Data Elements (CDEs) are standardized, precisely defined questions paired with a set of specific allowable responses, used systematically across different sites, studies, or clinical trials to ensure consistent data collection. CDEs let researchers work together because of the common factor, the common data element that provides researchers with larger sets of reusable data.
Sustainability of Digital Formats (Library of Congress) The Library of Congress offers advice on file format preservation. LC lists sustainability factors that include documentation; human readability; embedded metadata; dependence on particular hardware, operating system, or software; costs; encryption; and other technical mechanisms that may prevent format migration.
Eleven quick tips for properly handling tabular data - PLOS

Metadata is, at its simplest, data about data. It usually includes information about the content, context, and/or accessibility of a data set. Descriptive metadata may be a required consideration in data management plans and is vital in making published data sets more findable, accessible, interoperable, and reproducable (FAIR).

Metadata can exist in multiple formats, including as a separate text or HTML document that accompanies a data set, an XML document linked to the data files, or as information embedded in an XML data file. (XML is often used for metadata records because it can be easily integrated into many different systems.)

Various metadata standards specify what pieces of information to include and how to express them when describing a data set. Each metadata standard is composed of various elements or fields, individual pieces of information that facilitate searching similar items through shared terminology and construction. There are three main types of metadata elements:

Descriptive: describes the content and context of an object. e.g. title, author/creator, subject
Technical/structural: describes the format, process, and interrelatedness of an object. e.g. file format, size, dimensions (for images), set (if part of a series)
Administrative: describes the information needed to manage or use an object. e.g. permissions, creation date, required software, provenance

Some examples of metadata standards are linked below, along with a description of whether they are best used within a specific discipline or across many subject areas.

Digital Curation Centre list of metadata standards A comprehensive list of metadata standards across many disciplines. Many relate to the sciences, but there is coverage of the humanities and social sciences as well.
Dublin Core General. Widely used in disciplinary and institutional repositories.
Darwin Core (Life Sciences, specifically biodiversity)
DDI (Social and behavioral sciences)
Ecology Metadata Language (EML)
Seeing Standards: A Visualization of the Metadata Universe (Humanities) A visualization of 105 metadata standards for different disciplines in the humanities, especially relating to cultural heritage.
Text Encoding Initiative (TEI) (Humanities) A widely-used standard for the representation of texts in digital form.

Preparing Yourself

Prepare yourself! Creating good metadata begins with preparation and organization. Gather all of your information together, especially if it is distributed among multiple people. Then you can plan what you need to do.
Use existing information whenever possible. The information will often already be written by the time you need it for a metadata record. Reuse text from your funding proposals, such as the abstract, purpose, location, etc. You can also create a data dictionary during the data collection and analysis stages of your research and reference that in your metadata.
Choose keywords and other descriptive tags wisely. Consider all the interpretations of your vocabulary choices, and use a thesaurus to come up with alternate terms you may not otherwise have thought of.
Review your metadata to make sure it is complete and accurate. Include as many details so users can know what to expect from your data before they begin going through it. Make sure your descriptions are clear and do not omit any important information.
If possible, include unique identifiers like an ORCID (Open Researcher and Contributer ID) for the authors of and contributors to the research.

Preparing Your Metadata

At the dataset level, good metadata includes information about:

The context of the data: project history, objectives, hypotheses.
Data collection methods: protocols, sampling, instruments, data scale, resolution, temporal coverage, geographic coverage, hardware/software and other equipment.
Structure of the data: Data files, relationships between files.
Sources of data used
Data checking, validation, proofing
Modifications to the data since their creation
Identification of different versions of the data
Information on access and use conditions, confidentiality, etc., where necessary

This information may be contained in a separate document that accompanies the data files.

At the individual data level, good metadata includes information about:

Names and labels for variables, descriptions, records, and values
Explanation of codes and classification schema
Explanation and reasons for missing values
Data derived after collection, with information on how they were created (code, algorithms, or command files)

This information may be embedded within a dataset itself or contained within a separate document that accompanies the data files.

USGS System-level Metadata Record Creation Guidance A set of best practices for creating metadata for large data systems and/or describing "collections" of datasets.