NCBI Datasets Genome Package

Sequences, annotation and metadata for a set of requested genome assemblies

NCBI Datasets Genome Package

Sequences, annotation and metadata for a set of requested genome assemblies

The NCBI Datasets Genome Data Package contains sequences, annotation and metadata for a set of requested genomes. The data package may include genome, transcript and protein sequences in FASTA format, annotation in GFF3, GTF, and GBFF formats, data reports containing metadata in JSON Lines format, and a subset of metadata in tabular format.

Package Content

NCBI Datasets Genome Data Package

This example of GRCh38 (GCF_000001405.39) illustrates a typical genome data package.

GRCh38
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- GCF_000001405.39
        |   |-- genomic.gff
        |   |-- chr*.fna
        |   |-- protein.faa
        |   |-- rna.fna
        |   |-- sequence_report.jsonl
        |   `-- unplaced.scaf.fna
        |-- assembly_data_report.jsonl
        `-- dataset_catalog.json

Genome data report

The genome data report contains metadata describing the genomes in the data package. The file is in JSON Lines format, where each line is the metadata for one genome. Use the dataformat tool for easy conversion to a tabular format of selected fields.

Genome specific files

Each genome is placed in its own subdirectory inside the data package. The directory name is the assembly accession, for example GCF_000001405.39.

Sequence data report

The sequence data report describes all nucleotide sequences that comprise the genome assembly. The file is in JSON Lines format, where each line describes one nucleotide sequence. Use the dataformat tool for easy conversion to a tabular format of selected fields.

Genome data table

The genome data table is a tabular representation of a subset of metdata in the genome data report and is only provided through the NCBI Datasets Genomes website . Each row of the data table represents one genome in the data package.

The columns of the data table are Organism Scientific Name, Organism Common Name, Organism Qualifier, Taxonomy id, Assembly Name, Assembly Accession, Source, Annotation, Level, Contig N50, Size, Submission Date, Gene Count, BioProject and BioSample

  • Path: ncbi_dataset/data/data_summary.tsv

FASTA sequence Files

Genomic FASTA

Assembled chromosomes, unlocalized sequences for which the chromosome is known, and unplaced sequences for which the chromosome is unknown, are contained in separate FASTA files.

Example FASTA Defline:

> NC_000023.11 Homo sapiens chromosome X, GRCh38.p13 Primary Assembly
Transcript FASTA

Example FASTA Defline:

>NM_000014.6 Homo sapiens alpha-2-macroglobulin (A2M), transcript variant 1, mRNA
Protein FASTA
  • Path: ncbi_dataset/data/<assembly_accession>/protein.faa
  • Schema: Protein FASTA

Example FASTA Defline:

>NP_000005.3 alpha-2-macroglobulin isoform a precursor [Homo sapiens]

Annotation Files

Genome GFF3

The genome annotation file in GFF3 format describes genes and other features annotated on each genome.

  • Path: ncbi_dataset/data/<assembly_accession>/genomic.gff
  • Schema: Genome GFF3
Genome GBFF

The genome sequence and annotation file in GBFF format includes genomic sequence and describes genes and other features annotated on each genome.

Genome GTF

The genome annotation file in GTF format describes genes and other features annotated on each genome.

  • Path: ncbi_dataset/data/<assembly_accession>/genomic.gtf
  • Schema: Genome GTF

README.md

The README contains a general project description common to all data packages.

  • Path: README.md

Dataset catalog

The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json

Go retrieve a genome package using one of these tools:

Generated October 22, 2021