NCBI Datasets Gene Package

Sequences and metadata for a set of requested genes

NCBI Datasets Gene Package

Sequences and metadata for a set of requested genes

The NCBI Datasets Gene Data Package contains sequences and metadata for a set of requested genes. The data package may include gene, transcript and protein sequences in FASTA format, data reports containing metadata in JSON Lines format, and a subset of metadata in tabular format. There are two types of gene data packages, a eukaryotic gene data package and a prokaryotic gene data package. Differences between these two types of gene data package are described below.

Package content

NCBI Datasets Eukaryotic Gene Data Package

This example of Human BRCA1 (GeneID: 672) illustrates a typical eukaryotic gene data package.

human-brca1
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- data_report.jsonl
        |-- data_table.tsv
        |-- dataset_catalog.json
        |-- gene.fna
        |-- protein.faa
        `-- rna.fna

NCBI Datasets Prokaryotic Gene Data Package

This example of E. coli restriction endonuclease (WP_000769114.1) illustrates a typical prokaryotic gene data package.

endonuclease
|-- README.md
`-- ncbi_dataset
    `-- data
        |-- annotation_report.jsonl
        |-- data_report.jsonl
        |-- dataset_catalog.json
        |-- gene.fna
        `-- protein.faa

Gene data report

The gene data report contains metadata describing the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields. The content of the gene data report differs in the eukaryotic and prokaryotic data packages. For details, see the schemas below.

Eukaryotic gene data report (Gene Report)

Prokaryotic gene data report (Prokaryotic gene report)

Gene annotation report

The gene annotation report contains metadata describing the annotated locations of the genes in the data package and is only provided for prokaryotic genes. The file is in JSON Lines format, where each line is the metadata for one gene. Use the dataformat tool for easy conversion to a tabular format of selected fields.

Gene data table

The gene data table is a tabular representation of a subset of metdata in the gene data report and is only provided for eukaryotic genes. Each row of the data table represents one transcript of each gene in the data package.

The columns of the data table are Gene ID, Symbol, Gene name, Gene type, Scientific name, Transcripts, and Query.

FASTA sequence files

You can request three FASTA sequence files.

Gene FASTA

Example FASTA Defline:

>NC_000004.12:c122621066-122610108 IL21 [organism=Homo sapiens] [GeneID=59067] [chromosome=4]

Transcript FASTA

Example FASTA Defline:

>NM_021803.4 IL21 [organism=Homo sapiens] [GeneID=59067] [transcript=1]

Protein FASTA

Example FASTA Defline:

>NP_001193935.1 IL21 [organism=Homo sapiens] [GeneID=59067] [isoform=2 precursor]

README.md

The README contains a general project description common to all data packages.

  • Path: README.md

Dataset catalog

The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.

  • Path: ncbi_dataset/dataset_catalog.json

Retrieve a gene package using one of these tools:

Generated November 19, 2021