Study Data Templates#
Use the Table of Contents on the left to navigate to relevant sections for your 'omics data types!
NOAA Omics Study Data Templates#
A new NOAA Omics study data template was developed based on feedback from NOAA partners at OAR and the NOAA Omics Data and Bioinformatics Supergroup. This template incorporates data standards from MIxS, Darwin Core, and custom NOAA-recommended fields to facilitate data management of eDNA survey samples, from project initiation through data submission. For guidance on using the template, check out the template's README
page or the documentation wiki. Additional templates are in development to cover other data types and environments. If you are interested in developing a NOAA Omics template for your data/environment type, please reach out to katherine.silliman@noaa.gov!
- NOAA_MIMARKS.survey.water.template: use for amplicon and/or metagenomic data from water environmental samples
- Filled out example NOAA_MIMARKS template
- NOAA_MIMARKS.survey.host-associated.template: use for amplicon and/or metagenomic data from host-associated samples
- NOAA_MIMARKS.survey.sediment.template: use for amplicon and/or metagenomic data from sediment samples
Other templates for DNA/RNA sequence data#
While the templates below provide some information on metadata formatting and support the minimum metadata required for submission to NCBI, we provide additional formatting guidance and recommended custom metadata fields on the Metadata Guidelines page.
Sample metadata templates#
Genomic Standards Consortium (GSC) Minimal Information about any (x) Sequence (MIxS) templates are the standard for sample metadata, which includes information about the primary sample: when it was collected (e.g., date and time), where it was collected from (e.g., latitude, longitude, elevation/depth, site name, country, etc.), what kind of sample it was (e.g., soil, seawater, feces), and the properties of the environment during collection (e.g., temperature, salinity, pH) or experimental condition (e.g., experimental or control, disease state) from which the sample was taken.
Metadata input templates:
- NCBI provides a useful link to download MIxS sample metadata templates based on your sequence data type and sample environment (known as 'packages'). These templates will be appropriate for the majority of NOAA 'Omics projects that generate DNA/RNA sequence data, and can be used to generate NCBI BioSamples. The NOAA Omics study data template includes a `sample_data' sheet that can be used for submission to NCBI BioSample.
- The National Microbiome Data Collaborative (NMDC) maintains the NMDC Submission Portal that allows inputing metadata with real-time validation. The submission portal supports several different community standards, such as the MIxS standard from GSC, the PROV standard for provenance metadata, the Proteomics Standards Initiative (PSI) standards for metaproteomics, and the Metabolomics Standards Initiative (MSI) standards for metabolomics.
A guide to choosing the right metadata package given your 'omics data type is below:
Table 1. Suggested MIxS templates for common environmental omics datatypes.
Data type | Description | Metadata package |
---|---|---|
amplicon survey | Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained directly from the environment, without culturing or identification of the organisms. | MIMARKS Survey |
metagenome | Use for environmental and metagenome sequences. | MIMS Environmental/Metagenome |
metagenome-assembled genome | Use for metagenome-assembled genome sequences produced using computational binning tools that group sequences into individual organism genome assemblies starting from metagenomic data sets. | MIMAG Metagenome-assembled Genome |
single amplified genome | Use for single amplified genome sequences produced by isolating individual cells, amplifying the genome of each cell using whole genome amplification, and then sequencing the amplified DNA. | MISAG Single Amplified Genome |
uncultivated virus genome | Use for uncultivated virus genome identified in metagenome and metatranscriptome datasets. | MIUVIG Uncultivated Virus Genome |
amplicon specimen | Use for any type of marker gene sequences, eg, 16S, 18S, 23S, 28S rRNA or COI obtained from cultured or voucher-identifiable specimens. | MIMARKS Specimen |
cultured bacteria or archaea | Use for cultured bacterial or archaeal genomic sequences. | MIGS Cultured Bacterial/Archaeal |
viral genome | Use for virus genomic sequences. | MIGS Viral |
eukaryotic genome | Use for eukaryotic genomic sequences. | MIGS Eukaryotic |
qPCR or ddPCR or rt-PCR | Use for any type of real time PCR, quantitative PCR (qPCR), or digital PCR. | MIQE, RDML, & dMIQE |
For most NOAA 'Omics projects, the water
or sediment
environmental packages will be appropriate.
Preparation metadata templates#
Preparation metadata is directly related to the preparation of the biomaterial undergoing the 'omics assay and the process of performing the assay. A primary sample could be split (aliquoted) and processed through multiple preparation methods; therefore, there could be multiple sets of preparation metadata for a single set of samples.
NCBI repositories (e.g., SRA, GenBank) provide some templates for the minimum required preparation metadata, while in other cases they require interactive user input. We recommend submitting your sample metadata and generating BioSample accession IDs first, although you can do both steps at the same time. The NOAA Omics study data template includes a `prep_data' sheet that can be used for submission to NCBI SRA.
High-throughput sequencing data (SRA)
Projects using high-throughput sequencing data (e.g., amplicon, metagenomic, RNASeq, RAD-Seq) can use the NCBI SRA template.
Sanger sequencing
Sequencing projects generated without high-throughput sequencing (e.g., single gene Sanger sequencing) can use the NCBI Genbank template.
Other omics data types#
For NOAA Omics projects that generate biological data other than DNA/RNA sequencing:
Targeted quantitative surveys (qPCR, ddPCR, rt-PCR)#
Projects generated with real time PCR, qPCR, or dPCR and can use the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) Real-time PCR Data Markup Language (RDML) template.
Additional resources for best practices: 1. Environmental Microbiology Minimum Information (EMMI) Guidelines Borchardt et al. 2021 2. The MIQE guidelines: minimum information for publication of quantitative real-time PCR experiments Bustin et al. 2009 3. Guidance on the Use of Targeted Environmental DNA (eDNA) Analysis for the Management of Aquatic Invasive Species and Species at Risk from the Canadian Science Advisory Secretariat Abbot et al. 2021 4. Best Practices in qPCR and dPCR Validation in Regulated Bioanalytical Laboratories Hays et al. 2022 from the American Association of Pharmaceutical Scientists Workshop 3. Sanders et al. 2018 4. Langlosi et al. 2021
Proteomics#
Sample Data | Required? | Definition or Example | Recommended Format | Repository |
---|---|---|---|---|
MS data | Y | Original proprietary files provided by the instruments used in the study (e.g. Thermo RAW) | mzML; Controlled vocabulary: MS ontology; File formatting details: Pride | PRIDE |
Sequencing data | N | Amino acid sequences, Whole genome sequences, RNA seq, Whole Exome Sequences | FASTA, FASTQ | MassIVE, PRIDE (as optional data), NCBI SRA |
Other options for repositories, as well as general data submission guidelines can be found on the (ProteomeXchange) website.
Metabolomics#
Sample Data | Required? | Definition or Example | Recommended Format | Repository |
---|---|---|---|---|
Raw NMR or MS data | Y | NMR: can be free induction decay (FID) or fourier transformed (FT) ; Should also include instrument and software versions. | Open Source Formats (mzML, mzXML, CDF) | Metabolomics Workbench |
Sequencing Data | N | Whole genome, Amplicon, Transcriptome | FASTA, FASTQ | NCBI SRA |
Formats for processed omics data#
If your 'omics data is processed using bioinformatics, the resulting file(s) from those analyses should also be archived. Below are suggested formats and destinations repositories for common environmental 'omics datasets.
Table 2. Suggested formats and destinations repositories for common environmental omics datasets. Please note that, although NOAA's Coral Reef Information System (CoRIS) is the preferred venue for archiving NOAA-funded coral reef data, all CoRIS submissions are handled by NCEI.
Data type | Data formats (non-exhaustive) | Repository |
---|---|---|
DNA reference sequences | GenBank format | NCBI GenBank |
DNA sequence data (amplicon, metagenomic, RAD-Seq) | Raw FASTQ | NCBI SRA |
Amplicon Sequence Variants | Reference FASTA | GBIF/OBIS, or directly to NCEI](https://www.ncei.noaa.gov/archive) |
RNA sequence data (RNA-Seq) | Raw FASTQ | NCBI SRA |
Functional genomics data (quantitative gene expression, ChIP-Seq, HiC-seq, methylation seq) | Metadata, processed data (e.g., raw read counts), SRA accessions | NCBI GEO |
RNA transcript assemblies | FASTA or SQN file | NCBI TSA |
Genome assemblies | FASTA or SQN file, optional AGP file to orient scaffolds | NCBI WGS |
Quantitative PCR data | Tab-delimited text | NCEI |
Mass spectrometry data (metabolomics, proteomics) | Raw mass spectra, MZML, MZID | ProteomeXChange, Metabolomics Workbench |
Coral reef data | Tab-delimited text, HDF, or netCDF (less preferable) | CoRIS (via NCEI) |
Feature observation tables and feature metadata | BIOM (HDF5) format (feature observation tables), tab-delimited text (feature metadata) | GBIF/OBIS](https://github.com/aomlomics/edna2obis) or directly to NCEI(https://www.ncei.noaa.gov/archive) (size permitting), Zenodo, or Figshare |
Reference database | FASTA (sequences) and TSV (taxonomy) | Zenodo or FigShare or Dryad |
Analysis code | Commented code and Jupyter notebooks | GitHub (optionally archived on Zenodo or FigShare or Dryad) |
Figure code | Commented code for recreating figures (R, etc) | GitHub (optionally archived on Zenodo or FigShare or Dryad) |