Study Data Templates#

Use the Table of Contents on the left to navigate to relevant sections for your 'omics data types!

NOAA Omics Study Data Templates#

Documentation site here!

A new NOAA Omics study data template was developed based on feedback from NOAA partners at OAR and the NOAA Omics Data and Bioinformatics Supergroup. This template incorporates data standards from MIxS, Darwin Core, and custom recommended NOAA fields to facilitate data management of eDNA survey samples, from project initiation through data submission. For guidance on using the template, check out the template's README page or the documentation wiki. Additional templates are in development to cover other data types and environments. If you are interested in developing a NOAA Omics template for your data/environment type, please reach out to katherine.silliman@noaa.gov!

Other templates for DNA/RNA sequence data#

While the templates below provide some information on metadata formatting and support the minimum metadata required for submission to NCBI, we provide additional formatting guidance and recommended custom metadata fields on the Metadata Guidelines page.

Sample metadata templates#

Genomic Standards Consortium (GSC) Minimal Information about any (x) Sequence (MIxS) templates are the standard for sample metadata, which includes information about the primary sample: when it was collected (e.g., date and time), where it was collected from (e.g., latitude, longitude, elevation/depth, site name, country, etc.), what kind of sample it was (e.g., soil, seawater, feces), and the properties of the environment during collection (e.g., temperature, salinity, pH) or experimental condition (e.g., experimental or control, disease state) from which the sample was taken.

Metadata input templates:

  • NCBI provides a useful link to download MIxS sample metadata templates based on your sequence data type and sample environment (known as 'packages'). These templates will be appropriate for the majority of NOAA 'Omics projects that generate DNA/RNA sequence data, and can be used to generate NCBI BioSamples. The NOAA Omics study data template includes a `sample_data' sheet that can be used for submission to NCBI BioSample.
  • The National Microbiome Data Collaborative (NMDC) maintains the NMDC Submission Portal that allows inputing metadata with real-time validation. The submission portal supports several different community standards, such as the MIxS standard from GSC, the PROV standard for provenance metadata, the Proteomics Standards Initiative (PSI) standards for metaproteomics, and the Metabolomics Standards Initiative (MSI) standards for metabolomics.

A guide to choosing the right metadata package given your 'omics data type is below:

Table 1. Suggested MIxS templates for common environmental omics datatypes.

Data type Description Metadata package
amplicon survey Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained directly from the environment, without culturing or identification of the organisms. MIMARKS Survey
metagenome Use for environmental and metagenome sequences. MIMS Environmental/Metagenome
metagenome-assembled genome Use for metagenome-assembled genome sequences produced using computational binning tools that group sequences into individual organism genome assemblies starting from metagenomic data sets. MIMAG Metagenome-assembled Genome
single amplified genome Use for single amplified genome sequences produced by isolating individual cells, amplifying the genome of each cell using whole genome amplification, and then sequencing the amplified DNA. MISAG Single Amplified Genome
uncultivated virus genome Use for uncultivated virus genome identified in metagenome and metatranscriptome datasets. MIUVIG Uncultivated Virus Genome
amplicon specimen Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained from cultured or voucher-identifiable specimens. MIMARKS Specimen
cultured bacteria or archaea Use for cultured bacterial or archaeal genomic sequences. MIGS Cultured Bacterial/Archaeal
viral genome Use for virus genomic sequences. MIGS Viral
eukaryotic genome Use for eukaryotic genomic sequences. MIGS Eukaryotic
qPCR or ddPCR or rt-PCR Use for any type of real time PCR, quantitative PCR (qPCR), or digital PCR. MIQE, RDML, & dMIQE

For most NOAA 'Omics projects, the water or sediment environmental packages will be appropriate.

Preparation metadata templates#

Preparation metadata is directly related to the preparation of the biomaterial undergoing the 'omics assay and the process of performing the assay. A primary sample could be split (aliquoted) and processed through multiple preparation methods; therefore, there could be multiple sets of preparation metadata for a single set of samples.

NCBI repositories (e.g., SRA, GenBank) provide some templates for the minimum required preparation metadata, while in other cases they require interactive user input. We recommend submitting your sample metadata and generating BioSample accession IDs first, although you can do both steps at the same time. The NOAA Omics study data template includes a `prep_data' sheet that can be used for submission to NCBI SRA.

High-throughput sequencing data (SRA)

Projects using high-throughput sequencing data (e.g., amplicon, metagenomic, RNASeq, RAD-Seq) can use the NCBI SRA template.

Sanger sequencing

Sequencing projects generated without high-throughput sequencing (e.g., single gene Sanger sequencing) can use the NCBI Genbank template.

Other omics data types#

For NOAA Omics projects that generate biological data other than DNA/RNA sequencing:

Targeted quantitative surveys (qPCR, ddPCR, rt-PCR)#

Projects generated with real time PCR, qPCR, or dPCR and can use the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) Real-time PCR Data Markup Language (RDML) template.

Additional resources for best practices: 1. Environmental Microbiology Minimum Information (EMMI) Guidelines Borchardt et al. 2021 2. The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments Bustin et al. 2009 3. Guidance on the Use of Targeted Environmental DNA (eDNA) Analysis for the Management of Aquatic Invasive Species and Species at Risk from the Canadian Science Advisory Secretariat Abbot et al. 2021 4. Best Practices in qPCR and dPCR Validation in Regulated Bioanalytical Laboratories Hays et al. 2022 from the American Association of Pharmaceutical Scientists Workshop 3. Sanders et al. 2018 4. Langlosi et al. 2021

Proteomics#

Sample Data Required? Definition or Example Recommended Format Repository
MS data Y Original proprietary files provided by the instruments used in the study (e.g. Thermo RAW) mzML;
Controlled vocabulary: MS ontology;
File formatting details: Pride
PRIDE
Sequencing data N Amino acid sequences, Whole genome sequences, RNA seq, Whole Exome Sequences FASTA, FASTQ MassIVE,
PRIDE (as optional data), NCBI SRA

Other options for repositories, as well as general data submission guidelines can be found on the (ProteomeXchange) website.

Metabolomics#

Sample Data Required? Definition or Example Recommended Format Repository
Raw NMR or MS data Y NMR: can be free induction decay (FID) or fourier transformed (FT) ; Should also include instrument and software versions. Open Source Formats (mzML, mzXML, CDF) Metabolomics Workbench
Sequencing Data N Whole genome, Amplicon, Transcriptome FASTA, FASTQ NCBI SRA

Formats for processed omics data#

If your 'omics data is processed using bioinformatics, the resulting file(s) from those analyses should also be archived. Below are suggested formats and destinations repositories for common environmental 'omics datasets.

Table 2. Suggested formats and destinations repositories for common environmental omics datasets. Please note that, although NOAA's Coral Reef Information System (CoRIS) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.

Data type Data formats (non-exhaustive) Repository
DNA reference sequences GenBank format NCBI GenBank
DNA sequence data (amplicon, metagenomic, RAD-Seq) Raw FASTQ NCBI SRA
Amplicon Sequence Variants Reference FASTA GBIF/OBIS, or directly to NCEI](https://www.ncei.noaa.gov/archive)
RNA sequence data (RNA-Seq) Raw FASTQ NCBI SRA
Functional genomics data (quantitative gene expression, ChIP-Seq, HiC-seq, methylation seq) Metadata, processed data (e.g., raw read counts), SRA accessions NCBI GEO
RNA transcript assemblies FASTA or SQN file NCBI TSA
Genome assemblies FASTA or SQN file, optional AGP file to orient scaffolds NCBI WGS
Quantitative PCR data Tab-delimited text NCEI
Mass spectrometry data (metabolomics, proteomics) Raw mass spectra, MZML, MZID ProteomeXChange, Metabolomics Workbench
Coral reef data Tab-delimited text, HDF, or netCDF (less preferable) CoRIS (via NCEI)
Feature observation tables and feature metadata BIOM (HDF5) format (feature observation tables), tab-delimited text (feature metadata) GBIF/OBIS](https://github.com/aomlomics/edna2obis) or directly to NCEI(https://www.ncei.noaa.gov/archive) (size permitting), Zenodo, or Figshare
Reference database FASTA (sequences) and TSV (taxonomy) Zenodo or FigShare or Dryad
Analysis code Commented code and Jupyter notebooks GitHub (optionally archived on Zenodo or FigShare or Dryad)
Figure code Commented code for recreating figures (R, etc) GitHub (optionally archived on Zenodo or FigShare or Dryad)