Omics Data Guidelines#
Omics datasets should be sent to the relevant long-term data repository in accordance with the publication requirements of the field of study. While NCEI can be used to curate some small omics datasets (< ~20 GB), it is not easily searchable nor does it allow for critically important interactive querying (e.g., BLAST), and so should not be the lone repository for omics data if other options apply. Raw data (e.g., FASTQ files from the sequencing center) should be submitted to a repository for proper archiving. Data analysis products (e.g., MAG/genome assemblies) are useful to the scientific community, and should be submitted to relevant repositories. We offer guidelines for archival locations for omics datasets below and in Table 2.
Repositories#
Recommended destinations for different data types
Table 1 - Data repositories#
Suggested formats and destinations repositories for common environmental omics datasets.
Data type | Data formats (non-exhaustive) | Repository |
---|---|---|
DNA reference sequences | GenBank format | NCBI GenBank |
DNA sequence data (amplicon, metagenomic, RAD-Seq) | Raw FASTQ | NCBI SRA |
Amplicon Sequence Variants | Reference FASTA | OBIS/GBIF, NCBI |
RNA sequence data (RNA-Seq) | Raw FASTQ | NCBI SRA |
Functional genomics data (quantitative gene expression, ChIP-Seq, HiC-seq, methylation seq) | Metadata, processed data (e.g., raw read counts) raw FASTQ | NCBI GEO (raw data submitted to NCBI SRA for you) |
RNA transcript assemblies | FASTA or SQN file | NCBI TSA |
Genome assemblies | FASTA or SQN file, optional AGP file to orient scaffolds | NCBI WGS |
Quantitative PCR data | Tab-delimited text | NCEI |
Mass spectrometry data (metabolomics, proteomics) | Raw mass spectra, MZML, MZID | ProteomeXChange, Metabolomics Workbench |
Feature observation tables and feature metadata | BIOM (HDF5) format (feature observation tables), tab-delimited text (feature metadata) | NCEI, Zenodo, or Figshare |
Reference database | FASTA (sequences) and TSV (taxonomy) | Custom public server with DOIs, or repositories such as Zenodo, FigShare, or Dryad |
Notes on data formats:
- Tab-delimited text files should contain a single row at the top containing column headers. Column headers should be written in camelCase or with_underscores (no spaces) with units included in the column headers.
- Tab-delimited text files can be created as Microsoft Excel or Google Sheet files, edited, and then saved as tab-delimited text (.txt, Excel) or tab-separated values (.tsv, Sheets), which are the same except for the extension.
Storage & Backup#
Best practices for storage and backup of 'omics data should ensure that the data files are associated with their metadata and backed up securely.
Raw data#
Upon receipt of raw sequencing data, FASTQ files should be immediately stored in two locations (e.g., a local drive for analyses and on the NOAA Google Drive). The location of these files and their metadata (e.g., file name, associated project, sequence submission date) should be recorded on a Google Spreadsheet. This spreadsheet should be backed up on a regular basis by downloading it as a tab-delimited file and saving to an external drive.
Storage locations of raw sequence data should have a regularly scheduled backup plan (e.g., RAID on server drives, external backup of Google Drive).
Intermediate and processed data files#
Depending on your 'omics method, you will generate various types of intermediate and final processed files. For files produced by analyses that are computationally or time-intensive (e.g., trimmed FASTQ files, ASVs, assembled RADseq loci), it is a good idea to backup them up in a second location until the project is completed or the files are uploaded to a repository such as Dryad, Zenodo, or Figshare.
Archiving and cross-linking#
All NOAA ‘omics projects that are eligible for NCEI should include a project submission to the National Centers for Environmental Information (NCEI), and provide a README file locating where all products of that project have been submitted. This file should contain a description of the data and a link to a persistent digital object identifier (DOI) or NCBI accession number. This file should include include links to all raw data, metadata, data analysis products, and code used for the ‘omics project, including a self-referential link to the NCEI project submission where the README file is found.
This NCEI accession number can be cross-listed in other repositories containing the project's data. For example, you can provide a link to the NCEI project in the "Related Resources" section of an NCBI BioProject.
An additional organized and accessible data archive can further support reproducibility of a project. This archive can be hosted on Figshare, Dryad, or Github and archived with Zenodo. It should include cross-listed repository links or files for all raw data, metadata, data analysis products, code, and any research products (manuscripts, figures, ect.). Optionally, it can also include details on lab or field protocols.
The NOAA Omics Technical Portal GitHub organization contains links to repositories and datasets generated by NOAA Omics.