Metadata Guidelines#

Metadata are contextual data about your experimental data. Metadata are the who, what, when, where, and why of these data. Metadata put these data into context. For 'omics studies, metadata include information about the sample: when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Information about sample processing is also metadata: methods used to extract and purify molecules (e.g., DNA) from the sample, type of DNA sequencing or other ’omics analyses done, and where the raw experimental data are located.

For an 'omics study, metadata exist at multiple stages along the path from samples to analysis. For example, contextual information about when and where the sample was collected could include descriptors like date, time, and geospatial coordinates. Processing the sample in the lab for analysis requires different protocols and instrumentation. Once these data become electronic and are analyzed, the results should be accompanied by the versioned software used.

Metadata can also refer to information about the study itself and where the data can be found (both online and offline) in order for potential users to discover, access, evaluate, and use the data. Metadata of this type is required by the NOAA Data Management Handbook and should be hosted by NCEI.

Categories of metadata#

Sample metadata#

Information about the primary sample: when it was collected (e.g., date and time), where it was collected from (e.g., latitude, longitude, elevation/depth, site name, country, etc.), what kind of sample it was (e.g., soil, seawater, feces), and the properties of the environment during collection (e.g., temperature, salinity, pH) or experimental condition (e.g., experimental or control, disease state) from which the sample was taken. Sample metadata are not dependent on how the sample was processed; that is called preparation metadata (see next item).

Most of this would not be considered metadata in the strict (non-omics) sense. It is data, but it is used as contextual data for the omics data.

Experimental metadata#

  • Preparation metadata: how samples were extracted, how samples were sequenced; this is specified by the MIxS
  • Data processing/analysis metadata: data about processing from raw sequences to desired outputs (e.g., sequence properties, software versions, processing parameters
  • Feature metadata: ID number, assembled sequences, assigned taxonomy/function/genome location

Recording metadata#

Planning how and what metadata is recorded in a standardized way from the start of every project will go a long way towards saving time when submitting your data to data repositories, as well as improving overall research reproducibility. Typically, each individual sample is recorded as a row and metadata attributes for each sample are recorded in columns. The specific metadata attributes will vary depending on your 'omics method, but generally the sample metadata should contain the minimum, yet comprehensive information about the physical and chemical conditions of the sample, from collection through sequencing. A (non-comprehensive) list could include sampling environment, molecular lab procedure names, sequencing platform and model (required by NCBI SRA), and data processing steps, and external identifications from additional processing steps, e.g. sample IDs from a sequencing organization.

Metadata should be collated from primary sources (e.g., paper notes, emails, external collaborators, lab notebooks) and associated with sample IDs as soon as possible after it is generated. Primary sources of metadata should be backed up in case they are needed as a reference later. During this stage, the metadata should be evaluated for potential errors (e.g., incorrect GPS coordinates, missing data) and followed up on with the original data collector.

Metadata formats and custom fields#

While templates from NCBI provide some information of formatting and support the minimum metadata required for submission, we provide additional formatting details and recommended custom metadata fields below. Most of these fields are included in the NOAA Omics study data templates:

Table 3

Field name Format Description Custom
sample_name {text} (restricted) Identifies a sample and should be descriptive. It is the primary key and must be unique. Allowed characters are alphabetic [A-Za-z], numeric [0-9], and periods .. Disallowed characters include space, _, -, #, %, /, &, and *. FALSE
taxid {integer} NCBI taxon ID for the sample. Should indicate metagenome being investigated. Examples: 410658 for soil metagenome, 749906 for gut metagenome, 412755 for marine sediment metagenome, 408172 for marine metagenome (seawater), or 449393 for freshwater metagenome. If unspecified use 408169. TRUE
organism {text} Common name for the provided NCBI taxon ID (must match taxon_id above). Examples: soil metagenome, gut metagenome, marine sediment metagenome, marine metagenome. FALSE
host_subject_id {text} An identifier for the ‘host’. Should be specific to a host, and can be a one-to-many relationship with samples. If this is not a host-associated study, this can be blank. TRUE
description {text} Description of the sample. TRUE
physical_specimen_location {text} Where you would go to find physical sample or DNA, regardless of whether it is still available or not. TRUE
physical_specimen_remaining {boolean} Is there still physical sample (e.g., soil, not DNA) available? True or False. TRUE
collection_date yyyy-mm-ddThh:mm The time of sampling, either as an instance (single point in time) or interval. In case no exact time is available, the date/time can be right truncated. Must be in UTC time for NCBI. Examples: 2008-01-23T19:23:10+00:00 (T optional), 2008-01-23T19:23:10 (T optional), 2008-01-23, 2008-01, and 2008; all are ISO8601 compliant except 2008-01 and 2008. Date range may also be specified: 2007-2008 or 02/2011-04/2011. FALSE
collection_date_local yyyy-mm-ddThh:mm-hh:mm The date on which the sample was collected in local time in ISO format; date/time ranges are supported by providing two dates from among the supported value formats, delimited by a forward-slash character; collection times are supported by adding "T", then the hour and minute after the date. TRUE
decimalLatitude {decimal degrees} Latitude where sample was collected, in Darwin Core format. Postive if north of equator, negative if south of equator. Examples: 18.580 and -89.122. TRUE
decimalLongitude {decimal degrees} Longitude where sample was collected, in Darwin Core format. Positive if east of prime meridian, negative if west of prime meridian. Examples: 40.743 and -10.530. TRUE
project_description {text} Provide reason for conducting molecular analyses. Examples: Single species qPCR assay to survey North Pacific Hake. TRUE
depth {meters} Depth of sample collection where 0 is the surface. Examples: 0 and 10 FALSE
cruise_id {text} Identifier for the cruise, with year in parentheses. TRUE
station {text} spatially distinct sampling locations within a site (i.e. spatial replicates). Examples: Station_1:Cape_Elizabeth or Station_2:Miami_Harbor TRUE
biological_replicates {text} Cooresponding replicate sample units collected as close as possible to the same point in space and time, stored in separate containers, and analyzed independently. Separated by a pipe. Example: field_sample_1 | field_sample_2 TRUE
protocol {text} link to detailed description of sampling equipment used to collect the data, include link to permanent and stable archive of associated protocol. Protocols should include 1) equipment used for sample collection in the field, 2) field sample processing including filter type, pore size, diameter, manufacturer product number, number of filters used, volume/weight of sample collected, etc. 3) preservation method including preservative and storage time, 4) laboratory protocols used to generate resulting data (See Better Biomolecular Ocean Practices BeBOP for examples of detailed protocols). Information should be fully reproducible from description and details including information on negative controls (field blanks, extraction blanks, PCR blanks, non template controls, etc.) and positive controls (gblocks, exogenous DNA, tissue, etc.) both spike-in controls and standards. Examples: NCOG protocols TRUE

Refining metadata to a standard#

Standardizing the format of your metadata can both facilitate sharing of your results with others and improve identification of errors within your own metadata. Having a method-specific template Google Sheet or Excel file for metadata that you use across all similar studies can be very helpful. This template should include a second sheet or file with a "data dictionary" defining the desired attribute columns and formats. Refining your metadata to a standard has benefits for internal use, publication, and repository submission. Refined metadata should have the following characteristics:

  1. Standardized: The data in each column should be written in a consistent manner, i.e. in the same format.
  2. Well-defined attributes: The names of the metadata parameters of each sample should be obvious in definition, with units included if applicable.
  3. Collected metadata should be comprehensive, but not with extraneous or unnecessary attributes.

A general workflow for transforming and refining metadata:

  1. Fill in missing metadata: consult cruise/ship notes and other potential sources of unconsolidated information about the samples. For missing data that cannot be filled in, we recommend following the INSDC standardized missing value reporting language.
  2. Identify the columns that are present and compare that with those in the data dictionary, given that those are the minimum amount of information that we hope to submit with the dataset that we make publicly available.
  3. Standardize the data to that of the column headers’ standards.
  4. Optional: input columns to relational database
  5. Publish data to NCEI and other potential funding specific sites

Submitting metadata and environmental data to repositories#

NOAA ’Omics projects should make their metadata and non-’omics data (non-“big data”) publicly accessible to the appropriate repositories (also see Table 1). Please note that, although NOAA's Coral Reef Information System (CoRIS) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.

Project type Metadata repositories
Environmental survey OBIS/GBIF or directly to NCEI; NCBI
Mesocosm experiment NCEI & NCBI
Laboratory experiment NCBI
Genome/reference sequencing NCBI
Non-DNA/RNA data (e.g.: metabolomics, proteomics) Specialized repository and/or NCEI (if environmental), NCBI
Coral reef data CoRIS (via NCEI), with cross-linking to relevant data repository

Use GCMD keywords in project metadata#

Whenever you submit metadata about a project to a repository, if there is a field for keywords we recommend using a controlled vocabulary such as the Omics terms in the NASA Global Change Master Directory (GCMD). GCMD keywords are a hierarchical set of controlled Earth Science vocabularies that help ensure Earth science data, services, and variables are described in a consistent and comprehensive manner and allow for the precise searching of metadata and subsequent retrieval of data, services, and variables.

NCEI#

Send metadata and environmental data to the National Centers for Environmental Information (NCEI), generally excluding ’omics data, as these large datasets should be stored in the relevant long-term data repository and linked from NCEI to those records with persistent identifiers. Although NCEI can curate ’omics datasets smaller than 20 GB (i.e., most proteomic and metabolomic datasets), it does not permit the critically important interactive querying feature that is integral to all ’omics-tailored data repositories and so should not be the lone repository for ’omics data. We highly recommend submitting metabarcoding ASV tables to OBIS/GBIF (see next section).

Submitting to biodiversity repositories (OBIS/GBIF)#

GBIF and OBIS are global repositories of biodiversity data, and are actively interested in expanding access to eDNA observations. OBIS has the benefit of archiving data to NCEI for you.

NOAA Omics has developed a Python workflow for preparing data for submission to OBIS/GBIF, called edna2obis. This workflow requires some familiarity with Jupyter Notebooks and Python.

GBIF also has a non-coding prototype GUI tool to prepare eDNA data for GBIF/OBIS.

For general guidance on preparing data for OBIS/GBIF, check out the guide entitled "Publishing DNA-derived data through biodiversity data platforms."

OBIS also has a free self-paced course for preparing and submitting data, including a module on DNA-derived data.

Domain-specific databases#

Metadata standards#

Environmental metadata should be formatted according to one or more of the following standards:
* ISO I9115-2 * Water samples (NOAA Omics standard): Study Data template - Water * General ’omics projects: Genomics Standards Consortium (GSC) Minimum Information about any (x) Sequence (MIxS). * OAP-funded projects: Ocean Acidification Data Stewardship (OADS) metadata guidelines. * Additional guidance from ENA (European Nucleotide Archive), including The ENA Metadata Model — ENA Training Modules 1 documentation and Reporting Missing Values — ENA Training Modules 1 documentation.

Timing of submission#

The suggested deadline for data to be published and accessible in NCEI is one year after the end date of the project for NOAA intramural PIs, two years after the end date of the project for extramural PIs, or before a paper is published using these data (whichever is sooner). This schedule is based on the OAP Data Management Agreement.

Point of contact for submissions: The PI is responsible for working with NCEI to publish the data in a timely manner.

Metadata guides
+ National Microbiome Data Collaborative (NMDC) -- covers additional metadata + Earth Microbiome Project (EMP)

Metadata standards
+ GSC defined terms + Biosample Attributes + BeBOP-OBON Protocol Collection Template + BeBOP-OBON Minimum Information about an Omics Protocol + GBWG - Sustainable DarwinCore MIxS Interoperability - TDWG + GenBank templates + NIH tool to format all types of metadata