Metadata Guidelines#

Metadata are contextual data about your experimental data. Metadata are the who, what, when, where, and why of these data. Metadata put these data into context. For 'omics studies, metadata include information about the sample: when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Information about sample processing is also metadata: methods used to extract and purify molecules (e.g., DNA) from the sample, type of DNA sequencing or other ’omics analyses done, and where the raw experimental data are located.

For an 'omics study, metadata exist at multiple stages along the path from samples to analysis. For example, contextual information about when and where the sample was collected could include descriptors like date, time, and geospatial coordinates. Processing the sample in the lab for analysis requires different protocols and instrumentation. Once these data become electronic and are analyzed, the results should be accompanied by the versioned software used.

Metadata can also refer to information about the study itself and where the data can be found (both online and offline) in order for potential users to discover, access, evaluate, and use the data. Metadata of this type is required by the NOAA Data Management Handbook and should be hosted by NCEI.

Categories of metadata#

Sample metadata#

Information about the primary sample: when it was collected (e.g., date and time), where it was collected from (e.g., latitude, longitude, elevation/depth, site name, country, etc.), what kind of sample it was (e.g., soil, seawater, feces), and the properties of the environment during collection (e.g., temperature, salinity, pH) or experimental condition (e.g., experimental or control, disease state) from which the sample was taken. Sample metadata are not dependent on how the sample was processed; that is called preparation metadata (see next item).

Most of this would not be considered metadata in the strict (non-omics) sense. It is data, but it is used as contextual data for the omics data.

Experimental metadata#

  • Preparation metadata: how samples were extracted, how samples were sequenced; this is specified by the MIxS
  • Data processing/analysis metadata: data about processing from raw sequences to desired outputs (e.g., sequence properties, software versions, processing parameters
  • Feature metadata: ID number, assembled sequences, assigned taxonomy/function/genome location

Recording metadata#

Planning how and what metadata is recorded in a standardized way from the start of every project will go a long way towards saving time when submitting your data to data repositories, as well as improving overall research reproducibility. Using standardized templates

Metadata should be collated from primary sources (e.g., paper notes, emails, external collaborators, lab notebooks) and associated with sample IDs as soon as possible after it is generated. Primary sources of metadata should be backed up in case they are needed as a reference later. During this stage, the metadata should be evaluated for potential errors (e.g., incorrect GPS coordinates, missing data) and followed up on with the original data collector.

Metadata formats and custom fields#

While templates from NCBI provide some information of formatting and support the minimum metadata required for submission, we highly recommend providing additional metadata following relevant standards such as Darwin Core for biodiversity observations and FAIRe for eDNA data. Check out the Study Data Templates section for templates to help use these terms.

FAIRe eDNA Metadata Standards#

The FAIRe (Findable, Accessible, Interoperable, Reusable) eDNA initiative is an international, multi-organizational collaboration that has developed comprehensive metadata standards specifically for eDNA data. The FAIRe metadata checklist includes 337 data terms organized into workflow sections such as sample collection, PCR, and bioinformatics, with terms classified as:

  • 38 mandatory terms — required for all submissions
  • 51 highly recommended terms — strongly encouraged for data quality
  • 128 recommended terms — improve interoperability
  • 120 optional terms — for specialized applications

The FAIRe standard draws from established data sources including MIxS (Minimum Information about any Sequence), Darwin Core (for biodiversity data), MIQE guidelines (for quantitative PCR), MIEM guidelines (for eDNA metabarcoding), and 158 new terms developed specifically for eDNA procedures. This comprehensive approach ensures eDNA datasets are consistently documented, discoverable, and reusable across the scientific community, supporting data-driven biodiversity management at broad scales and enabling cross-discipline reuse.

Generating FAIRe Templates with FAIReSheets#

FAIReSheets is NOAA's Python-based tool for generating standardized eDNA metadata templates directly in Google Sheets, based on the FAIRe NOAA checklist (data dictionary).

Key features:

  • Custom User-Defined Terms — Add domain-specific fields to the checklist before template generation; these automatically appear in your Google Sheets
  • Controlled Vocabularies — Pre-defined vocabularies for many fields ensure consistent data entry and units of measure across the eDNA community
  • Ready for Submission — Generated templates are formatted for immediate submission to the Ocean DNA Explorer, and can also be adapted for OBIS and GBIF using the edna2obis tool

To access FAIReSheets: This tool is available upon request. Contact bayden.willms@noaa.gov for access and guidance.

The templates generated include metadata for: - Project metadata — overarching study information - Sample metadata — collection details and sample characteristics - Experiment run metadata — PCR and preparation procedures - Analysis metadata — bioinformatics processing and results

Filling in FAIRe Metadata Templates#

Once you have generated your templates through FAIReSheets, the next step is to populate them with your project data. This is a critical step to ensure your data is standardized, interoperable, and discoverable.

Best practices for completing templates:

  1. Link your records — Use consistent project IDs, sample IDs, and analysis run names across all metadata files to establish relationships between data layers
  2. Distinguish project vs. assay-specific data — Some fields in project metadata apply to all analyses (marked in the project_level column), while others are assay-specific (e.g., "ssu16sv4v5-emp" or "ssu18sv9-emp")
  3. Leverage controlled vocabularies — Use the predefined terms and units provided in the checklist to maintain consistency
  4. Document everything — Even fields that seem obvious now may be unclear to future data users or collaborators
  5. Validate before submission — Check for inconsistencies, missing required fields, and formatting errors before uploading

Handling Missing Data (Dead Values)#

Data absence is common and occurs for many legitimate reasons — location information may be intentionally obscured to protect endangered species or culturally sensitive sites, certain measurements may not be applicable to a sample type, or data collection may have been missed due to circumstances beyond your control.

For required fields that lack data, you must specify why the information is unavailable using the INSDC missing value controlled vocabulary. This practice is also recommended for optional fields. Rather than leaving cells empty, select the most appropriate "dead value" from the following table:

Value to Enter When to Use
true or 1 Boolean field is true
false or 0 Boolean field is false
not applicable: control sample Field does not apply (sample is a control)
not applicable: sample group Field does not apply (part of a sample group)
not applicable Field does not apply to this sample type
missing: not collected: synthetic construct Data not collected (synthetic/lab construct)
missing: not collected: lab stock Data not collected (lab stock material)
missing: not collected: third party data Data not collected (from third party)
missing: not collected Data not collected (unspecified reason)
missing: not provided: data agreement established pre-2023 Data exists but unavailable (pre-2023 agreement)
missing: not provided Data exists but was not provided
missing: restricted access: endangered species Data cannot be shared (species protection)
missing: restricted access: human-identifiable Data cannot be shared (privacy concerns)
missing: restricted access Data cannot be shared (access restrictions)

For additional details on missing value reporting, refer to the ENA documentation on missing values.

User-Defined Terms#

If your project requires data fields not present in the FAIRe NOAA checklist, you can add them as User-Defined Terms. These custom fields can be:

  • Added before template generation — Modify the FAIRe NOAA checklist Excel file to include your custom terms before running FAIReSheets; they will automatically populate in your generated Google Sheets
  • Added manually after generation — Insert new columns directly into your Google Sheet with your custom field names and metadata

User-Defined Terms should follow the same naming and documentation conventions as standard FAIRe terms to maintain data consistency and usability.

Project Structure and Data Relationships#

Understanding how your metadata files relate to each other is essential for proper data organization and submission.

Metadata File Type Purpose Key Identifiers Can Submit Independently?
Project Metadata Overarching study information, PCR targets, assay details project_id, assay_type, assay_name No — requires analysis file(s)
Sample Metadata Collection details, sample characteristics, environmental conditions project_id, sample_id No — requires project metadata
Experiment Run Metadata PCR procedures, primers, thermocycler conditions project_id, experiment_run_id No — requires project metadata
Analysis Metadata Bioinformatics pipeline, version numbers, processing parameters project_id, analysis_run_name Yes — if project already exists in Ocean DNA Explorer

Critical linking fields:

  • project_id — Must be identical across all metadata files to link sample, experiment, and analysis data to the same project
  • analysis_run_name — Must be unique for each analysis and correctly specified in all analysis metadata files
  • sample_id — Links samples to their collection metadata and experimental processing

Required Fields for Submission#

The following fields are mandatory for each submission type. All files must be in TSV (Tab-Separated Values) format and follow the FAIRe template structure exactly.

Project Metadata Requirements#

Required Field Description
project_id Unique identifier for the project; must match across all related files
project_contact Name and contact information for the project lead
assay_type Target marker or marker group (e.g., "16S rRNA", "18S rRNA", "COI")
assay_name Descriptive name for the specific assay (e.g., "ssu16sv4v5-emp")
checkls_ver Version of the FAIRe checklist used
pcr_0_1 PCR cycle information or PCR primer information
targetTaxonomicAssay Taxonomic target of the assay (e.g., bacteria, fungi, eukaryotes)
pcr_primer_forward Forward primer sequence or reference
pcr_primer_reverse Reverse primer sequence or reference

Analysis Metadata Requirements#

Required Field Description
project_id Must match the project_id in the project metadata file
assay_name Must match the assay_name from project metadata
analysis_run_name Unique name for this specific analysis run; used to distinguish multiple analyses of the same project

Submitting Metadata to the Ocean DNA Explorer#

Once you have completed your metadata templates:

  1. Export as TSV — Download each sheet from your Google Sheets template as a TSV file
  2. Validate structure — Verify all required fields are present and populated
  3. Check identifiers — Ensure project_id, sample_id, and analysis_run_name are consistent across files
  4. Submit files — Upload your TSV files to the Ocean DNA Explorer through the submission portal

Detailed submission instructions are available on the Ocean DNA Explorer documentation.

Submitting to OBIS and GBIF#

If you plan to submit your eDNA data to OBIS (Ocean Biodiversity Information System) or GBIF (Global Biodiversity Information Facility), NOAA Omics has developed the edna2obis Python workflow to convert Ocean DNA Explorer input files to the format required by these repositories.

Because edna2obis uses the same input file structure as the Ocean DNA Explorer, if your data is properly formatted for Ocean DNA Explorer submission, it can be easily adapted for OBIS and GBIF submission as well. This allows you to maximize the reach and impact of your eDNA data across multiple biodiversity platforms.

For more information on preparing eDNA data for OBIS/GBIF, consult the edna2obis GitHub repository and the GBIF guide to publishing DNA-derived data.

Refining metadata to a standard#

Standardizing the format of your metadata can both facilitate sharing of your results with others and improve identification of errors within your own metadata. Having a method-specific template Google Sheet or Excel file for metadata that you use across all similar studies can be very helpful. This template should include a second sheet or file with a "data dictionary" defining the desired attribute columns and formats. Refining your metadata to a standard has benefits for internal use, publication, and repository submission. Refined metadata should have the following characteristics:

  1. Standardized: The data in each column should be written in a consistent manner, i.e. in the same format.
  2. Well-defined attributes: The names of the metadata parameters of each sample should be obvious in definition, with units included if applicable.
  3. Collected metadata should be comprehensive, but not with extraneous or unnecessary attributes.

A general workflow for transforming and refining metadata:

  1. Fill in missing metadata: consult cruise/ship notes and other potential sources of unconsolidated information about the samples. For missing data that cannot be filled in, we recommend following the INSDC standardized missing value reporting language.
  2. Identify the columns that are present and compare that with those in the data dictionary, given that those are the minimum amount of information that we hope to submit with the dataset that we make publicly available.
  3. Standardize the data to that of the column headers’ standards.
  4. Optional: input columns to relational database
  5. Publish data to NCEI and other potential funding specific sites

Submitting metadata and environmental data to repositories#

NOAA ’Omics projects should make their metadata and non-’omics data (non-“big data”) publicly accessible to the appropriate repositories (also see Table 1). Please note that, although NOAA's Coral Reef Information System (CoRIS) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.

Project type Metadata repositories
Environmental survey OBIS/GBIF or directly to NCEI; NCBI
Mesocosm experiment NCEI & NCBI
Laboratory experiment NCBI
Genome/reference sequencing NCBI
Non-DNA/RNA data (e.g.: metabolomics, proteomics) Specialized repository and/or NCEI (if environmental), NCBI
Coral reef data CoRIS (via NCEI), with cross-linking to relevant data repository

Use GCMD keywords in project metadata#

Whenever you submit metadata about a project to a repository, if there is a field for keywords we recommend using a controlled vocabulary such as the Omics terms in the NASA Global Change Master Directory (GCMD). GCMD keywords are a hierarchical set of controlled Earth Science vocabularies that help ensure Earth science data, services, and variables are described in a consistent and comprehensive manner and allow for the precise searching of metadata and subsequent retrieval of data, services, and variables.

NCEI#

Send metadata and environmental data to the National Centers for Environmental Information (NCEI), generally excluding ’omics data, as these large datasets should be stored in the relevant long-term data repository and linked from NCEI to those records with persistent identifiers. Although NCEI can curate ’omics datasets smaller than 20 GB (i.e., most proteomic and metabolomic datasets), it does not permit the critically important interactive querying feature that is integral to all ’omics-tailored data repositories and so should not be the lone repository for ’omics data. We highly recommend submitting metabarcoding ASV tables to OBIS/GBIF (see next section).

Submitting to biodiversity repositories (OBIS/GBIF)#

GBIF and OBIS are global repositories of biodiversity data, and are actively interested in expanding access to eDNA observations. OBIS has the benefit of archiving data to NCEI for you.

NOAA Omics has developed a Python workflow for preparing data for submission to OBIS/GBIF, called edna2obis. This workflow requires some familiarity with Jupyter Notebooks and Python.

GBIF also has a non-coding prototype GUI tool to prepare eDNA data for GBIF/OBIS.

For general guidance on preparing data for OBIS/GBIF, check out the guide entitled "Publishing DNA-derived data through biodiversity data platforms."

OBIS also has a free self-paced course for preparing and submitting data, including a module on DNA-derived data.

Domain-specific databases#

Metadata standards#

Environmental metadata should be formatted according to one or more of the following standards:
* ISO I9115-2 * Water samples (NOAA Omics standard): Study Data template - Water * General ’omics projects: Genomics Standards Consortium (GSC) Minimum Information about any (x) Sequence (MIxS). * OAP-funded projects: Ocean Acidification Data Stewardship (OADS) metadata guidelines. * Additional guidance from ENA (European Nucleotide Archive), including The ENA Metadata Model — ENA Training Modules 1 documentation and Reporting Missing Values — ENA Training Modules 1 documentation.

Timing of submission#

The suggested deadline for data to be published and accessible in NCEI is one year after the end date of the project for NOAA intramural PIs, two years after the end date of the project for extramural PIs, or before a paper is published using these data (whichever is sooner). This schedule is based on the OAP Data Management Agreement.

Point of contact for submissions: The PI is responsible for working with NCEI to publish the data in a timely manner.

Metadata guides
+ National Microbiome Data Collaborative (NMDC) -- covers additional metadata + Earth Microbiome Project (EMP)

Metadata standards
+ GSC defined terms + Biosample Attributes + BeBOP-OBON Protocol Collection Template + BeBOP-OBON Minimum Information about an Omics Protocol + GBWG - Sustainable DarwinCore MIxS Interoperability - TDWG + GenBank templates + NIH tool to format all types of metadata