Study Data Templates#

Use the Table of Contents on the left to navigate to relevant sections for your 'omics data types!

NOAA Omics Study Data Templates#

A new NOAA Omics study data template was developed based on feedback from NOAA partners at OAR and the NOAA Omics Data and Bioinformatics Supergroup. This template incorporates data standards from MIxS, Darwin Core, and custom recommended NOAA fields to facilitate data management of eDNA survey samples, from project initiation through data submission. For guidance on using the template, check out the template's README page or the documentation wiki. Additional templates are in development to cover other data types and environments. If you are interested in developing a NOAA Omics template for your data/environment type, please reach out to katherine.silliman@noaa.gov!

NOAA_MIMARKS.survey.water.template: use for amplicon and/or metagenomic data from water environmental samples
Filled out example NOAA_MIMARKS template
NOAA_MIMARKS.survey.host-associated.template: use for amplicon and/or metagenomic data from host-associated samples
NOAA_MIMARKS.survey.sediment.template: use for amplicon and/or metagenomic data from sediment samples

Other templates for DNA/RNA sequence data#

While the templates below provide some information on metadata formatting and support the minimum metadata required for submission to NCBI, we provide additional formatting guidance and recommended custom metadata fields on the Metadata Guidelines page.

Sample metadata templates#

Genomic Standards Consortium (GSC) Minimal Information about any (x) Sequence (MIxS) templates are the standard for sample metadata, which includes information about the primary sample: when it was collected (e.g., date and time), where it was collected from (e.g., latitude, longitude, elevation/depth, site name, country, etc.), what kind of sample it was (e.g., soil, seawater, feces), and the properties of the environment during collection (e.g., temperature, salinity, pH) or experimental condition (e.g., experimental or control, disease state) from which the sample was taken.

Metadata input templates:

NCBI provides a useful link to download MIxS sample metadata templates based on your sequence data type and sample environment (known as 'packages'). These templates will be appropriate for the majority of NOAA 'Omics projects that generate DNA/RNA sequence data, and can be used to generate NCBI BioSamples. The NOAA Omics study data template includes a `sample_data' sheet that can be used for submission to NCBI BioSample.
The National Microbiome Data Collaborative (NMDC) maintains the NMDC Submission Portal that allows inputing metadata with real-time validation. The submission portal supports several different community standards, such as the MIxS standard from GSC, the PROV standard for provenance metadata, the Proteomics Standards Initiative (PSI) standards for metaproteomics, and the Metabolomics Standards Initiative (MSI) standards for metabolomics.

A guide to choosing the right metadata package given your 'omics data type is below:

Table 1. Suggested MIxS templates for common environmental omics datatypes.

Data type	Description	Metadata package
amplicon survey	Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained directly from the environment, without culturing or identification of the organisms.	MIMARKS Survey
metagenome	Use for environmental and metagenome sequences.	MIMS Environmental/Metagenome
metagenome-assembled genome	Use for metagenome-assembled genome sequences produced using computational binning tools that group sequences into individual organism genome assemblies starting from metagenomic data sets.	MIMAG Metagenome-assembled Genome
single amplified genome	Use for single amplified genome sequences produced by isolating individual cells, amplifying the genome of each cell using whole genome amplification, and then sequencing the amplified DNA.	MISAG Single Amplified Genome
uncultivated virus genome	Use for uncultivated virus genome identified in metagenome and metatranscriptome datasets.	MIUVIG Uncultivated Virus Genome
amplicon specimen	Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained from cultured or voucher-identifiable specimens.	MIMARKS Specimen
cultured bacteria or archaea	Use for cultured bacterial or archaeal genomic sequences.	MIGS Cultured Bacterial/Archaeal
viral genome	Use for virus genomic sequences.	MIGS Viral
eukaryotic genome	Use for eukaryotic genomic sequences.	MIGS Eukaryotic
qPCR or ddPCR or rt-PCR	Use for any type of real time PCR, quantitative PCR (qPCR), or digital PCR.	MIQE, RDML, & dMIQE

For most NOAA 'Omics projects, the water or sediment environmental packages will be appropriate.

Preparation metadata templates#

Preparation metadata is directly related to the preparation of the biomaterial undergoing the 'omics assay and the process of performing the assay. A primary sample could be split (aliquoted) and processed through multiple preparation methods; therefore, there could be multiple sets of preparation metadata for a single set of samples.

NCBI repositories (e.g., SRA, GenBank) provide some templates for the minimum required preparation metadata, while in other cases they require interactive user input. We recommend submitting your sample metadata and generating BioSample accession IDs first, although you can do both steps at the same time. The NOAA Omics study data template includes a `prep_data' sheet that can be used for submission to NCBI SRA.

High-throughput sequencing data (SRA)

Projects using high-throughput sequencing data (e.g., amplicon, metagenomic, RNASeq, RAD-Seq) can use the NCBI SRA template.

Sanger sequencing

Sequencing projects generated without high-throughput sequencing (e.g., single gene Sanger sequencing) can use the NCBI Genbank template.

Other omics data types#

For NOAA Omics projects that generate biological data other than DNA/RNA sequencing:

Targeted quantitative surveys (qPCR, ddPCR, rt-PCR)#

Projects generated with real time PCR, qPCR, or dPCR and can use the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) Real-time PCR Data Markup Language (RDML) template.

Additional resources for best practices: 1. Environmental Microbiology Minimum Information (EMMI) Guidelines Borchardt et al. 2021 2. The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments Bustin et al. 2009 3. Guidance on the Use of Targeted Environmental DNA (eDNA) Analysis for the Management of Aquatic Invasive Species and Species at Risk from the Canadian Science Advisory Secretariat Abbot et al. 2021 4. Best Practices in qPCR and dPCR Validation in Regulated Bioanalytical Laboratories Hays et al. 2022 from the American Association of Pharmaceutical Scientists Workshop 3. Sanders et al. 2018 4. Langlosi et al. 2021

Proteomics#

Sample Data	Required?	Definition or Example	Recommended Format	Repository
MS data	Y	Original proprietary files provided by the instruments used in the study (e.g. Thermo RAW)	mzML; Controlled vocabulary: MS ontology; File formatting details: Pride	PRIDE
Sequencing data	N	Amino acid sequences, Whole genome sequences, RNA seq, Whole Exome Sequences	FASTA, FASTQ	MassIVE, PRIDE (as optional data), NCBI SRA

Other options for repositories, as well as general data submission guidelines can be found on the (ProteomeXchange) website.

Metabolomics#

Sample Data	Required?	Definition or Example	Recommended Format	Repository
Raw NMR or MS data	Y	NMR: can be free induction decay (FID) or fourier transformed (FT) ; Should also include instrument and software versions.	Open Source Formats (mzML, mzXML, CDF)	Metabolomics Workbench
Sequencing Data	N	Whole genome, Amplicon, Transcriptome	FASTA, FASTQ	NCBI SRA

Formats for processed omics data#

If your 'omics data is processed using bioinformatics, the resulting file(s) from those analyses should also be archived. Below are suggested formats and destinations repositories for common environmental 'omics datasets.

Table 2. Suggested formats and destinations repositories for common environmental omics datasets. Please note that, although NOAA's Coral Reef Information System (CoRIS) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.

Data type	Data formats (non-exhaustive)	Repository
DNA reference sequences	GenBank format	NCBI GenBank
DNA sequence data (amplicon, metagenomic, RAD-Seq)	Raw FASTQ	NCBI SRA
Amplicon Sequence Variants	Reference FASTA	GBIF/OBIS, or directly to NCEI](https://www.ncei.noaa.gov/archive)
RNA sequence data (RNA-Seq)	Raw FASTQ	NCBI SRA
Functional genomics data (quantitative gene expression, ChIP-Seq, HiC-seq, methylation seq)	Metadata, processed data (e.g., raw read counts), SRA accessions	NCBI GEO
RNA transcript assemblies	FASTA or SQN file	NCBI TSA
Genome assemblies	FASTA or SQN file, optional AGP file to orient scaffolds	NCBI WGS
Quantitative PCR data	Tab-delimited text	NCEI
Mass spectrometry data (metabolomics, proteomics)	Raw mass spectra, MZML, MZID	ProteomeXChange, Metabolomics Workbench
Coral reef data	Tab-delimited text, HDF, or netCDF (less preferable)	CoRIS (via NCEI)
Feature observation tables and feature metadata	BIOM (HDF5) format (feature observation tables), tab-delimited text (feature metadata)	GBIF/OBIS](https://github.com/aomlomics/edna2obis) or directly to NCEI(https://www.ncei.noaa.gov/archive) (size permitting), Zenodo, or Figshare
Reference database	FASTA (sequences) and TSV (taxonomy)	Zenodo or FigShare or Dryad
Analysis code	Commented code and Jupyter notebooks	GitHub (optionally archived on Zenodo or FigShare or Dryad)
Figure code	Commented code for recreating figures (R, etc)	GitHub (optionally archived on Zenodo or FigShare or Dryad)