NCBI Datasets SARS-CoV-2 Data Package

Sequences, annotation and metadata for a set of SARS-CoV-2 GenBank genomes or proteins

NCBI Datasets SARS-CoV-2 Data Package

Sequences, annotation and metadata for a set of SARS-CoV-2 GenBank genomes or proteins

The NCBI Datasets SARS-CoV-2 Data Package contains sequences and metadata for a set of requested SARS-CoV-2 GenBank genomes or proteins. The data package may include genome, coding sequence (CDS) and protein sequences in FASTA format, annotation in GBFF and GPFF formats, protein structures in PDB format, and a data report containing metadata in JSON Lines format.

Package Content

NCBI Datasets SARS-CoV-2 Genome Data Package

sars-cov-2/
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- cds.fna
        |-- data_report.jsonl
        |-- dataset_catalog.json
        |-- genomic.fna
        |-- pdb
        |   `-- *.pdb
        |-- protein.faa
        |-- protein.gpff
        `-- virus_dataset.md

NCBI Datasets SARS-CoV-2 Protein Data Package

(note: this package does not contain SARS-CoV-2 genome sequence)

spike-protein/
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- cds.fna
        |-- data_report.jsonl
        |-- dataset_catalog.json
        |-- pdb
        |   `-- *.pdb
        |-- protein.faa
        |-- protein.gpff
        `-- virus_dataset.md

Virus Data Report

The virus data report contains metadata describing the genomes and proteins in the data package. The file is in JSON Lines format, where each line is the metadata for one genome or one protein. Use the dataformat tool for easy conversion to a tabular format of selected fields.

FASTA Sequence Files

Genomic FASTA

Nucleotide sequence of the viral GenBank genome.

Example FASTA Defline:

>MW583405.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-CDC-9N37-8996/2021, complete genome

CDS FASTA

Nucleotide sequence for the coding sequence of each protein and mature peptide.

Example FASTA Defline:

>NC_045512.2:21563-25384 surface glycoprotein [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=Wuhan-Hu-1]

Protein FASTA

Protein sequences for each protein and mature peptide.

Example FASTA Defline:

>QMT27626.1:1-180 leader protein [polyprotein=ORF1ab polyprotein] [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=SARS-CoV-2/human/USA/WA-S1488/2020]

Sequence Annotation Files

NCBI GenBank Flat File (GBFF)

The genome sequence and annotation file in GBFF format includes genomic sequence and describes genes and other features annotated on each genome.

NCBI GenPept Format (GPFF)

The protein sequence and annotation file in GPFF format includes protein sequence and describes features annotated on each protein.

Additional Files

PDB protein structures

PDB format file describing protein structure.

Virus README

The virus README describes the available SARS-CoV-2 data packages, their content and options for querying.

  • Path: ncbi_dataset/data/virus_dataset.md

README.md

The README contains a general project description common to all data packages.

  • Path: README.md

Dataset catalog

The dataset catalog lists each data file contained within or referenced by this package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json

Get SARS-CoV-2 data using one of these tools:

Generated November 19, 2021