Gene report

Gene record metadata

Gene report

Gene record metadata

The downloaded gene package contains a gene data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the gene data report file is a hierarchical JSON object that represents a single gene record. The schema of the gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is GeneDescriptor.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform gene data reports from JSON Lines to tabular formats.

Sample report

{
  "annotations": [
    {
      "annotationName": "GCF_000001405.40-RS_2023_10",
      "annotationReleaseDate": "2023-10-02",
      "assemblyAccession": "GCF_000001405.40",
      "assemblyName": "GRCh38.p14",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_000019.10",
          "genomicRange": {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    },
    {
      "annotationName": "GCF_009914755.1-RS_2023_10",
      "annotationReleaseDate": "2023-10-02",
      "assemblyAccession": "GCF_009914755.1",
      "assemblyName": "T2T-CHM13v2.0",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_060943.1",
          "genomicRange": {
            "begin": "61441599",
            "end": "61449907",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    }
  ],
  "chromosomes": [
    "19"
  ],
  "commonName": "human",
  "description": "alpha-1-B glycoprotein",
  "ensemblGeneIds": [
    "ENSG00000121410"
  ],
  "geneGroups": [
    {
      "id": "1",
      "method": "NCBI Ortholog"
    }
  ],
  "geneId": "1",
  "nomenclatureAuthority": {
    "authority": "HGNC",
    "identifier": "HGNC:5"
  },
  "omimIds": [
    "138670"
  ],
  "orientation": "minus",
  "proteinCount": 1,
  "swissProtAccessions": [
    "P04217"
  ],
  "symbol": "A1BG",
  "synonyms": [
    "A1B",
    "ABG",
    "GAB",
    "HYST2477"
  ],
  "taxId": "9606",
  "taxname": "Homo sapiens",
  "transcriptCount": 1,
  "transcriptTypeCounts": [
    {
      "count": 1,
      "type": "PROTEIN_CODING"
    }
  ],
  "type": "PROTEIN_CODING"
}

GeneDescriptor Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`geneId`	`gene-id`	NCBI GeneID	`uint64`	NCBI Gene ID	`2778`
`symbol`	`symbol`	Symbol	`string`	gene symbol	`GNAS`
`description`	`description`	Description	`string`	gene name	`GNAS complex locus`
`taxId`	`tax-id`	Taxonomic ID	`uint64`	NCBI Taxonomy ID for the organism	`9606`
`taxname`	`tax-name`	Taxonomic Name	`string`	Taxonomic name of the organism	`Homo sapiens`
`commonName`	`common-name`	Common Name	`string`	Common name of the organism	`human`
`type`	`gene-type`	Gene Type	`GeneType`
`rnaType`	`rna-type`	RNA Type	`RnaType`
`orientation`	`orientation`	Orientation	`Orientation`
`referenceStandards repeated`	`ref-standard-`	Reference Standard	`GenomicRegion`	Clinical reference standard NG
`genomicRegions repeated`	`genomic-region-`	Genomic Region	`GenomicRegion`	Pseudogene, non-genic regulatory element and other genomic region NG
`chromosomes repeated`	`chromosomes`	Chromosomes	`string`		`1` `X,Y`
`nomenclatureAuthority`	`name-`	Nomenclature	`NomenclatureAuthority`
`swissProtAccessions repeated`	`swissprot-accessions`	SwissProt Accessions	`string`
`ensemblGeneIds repeated`	`ensembl-geneids`	Ensembl GeneIDs	`string`
`omimIds repeated`	`omim-ids`	OMIM IDs	`string`
`synonyms repeated`	`synonyms`	Synonyms	`string`
`replacedGeneId`	`replaced-gene-id`	Replaced NCBI GeneID	`uint64`	The NCBI Gene ID for the gene that was merged into the current gene record
`annotations repeated`	`annotation-`	Annotation	`Annotation`
`transcriptCount`	`transcript-count`	Transcripts	`uint32`
`proteinCount`	`protein-count`	Proteins	`uint32`
`transcriptTypeCounts repeated`			`TranscriptTypeCount`
`geneGroups repeated`	`group-`	Gene Group	`GeneGroup`

Annotation Structure

Field	Table Field Mnemonic	Table Column Name	Type
`assemblyAccession`	`assembly-accession`	Assembly Accession	`string`
`assemblyName`	`assembly-name`	Assembly Name	`string`
`annotationName`	`release-name`	Release Name	`string`
`annotationReleaseDate`	`release-date`	Release Date	`string`
`genomicLocations repeated`	`genomic-range-`	Genomic Range	`GenomicLocation`

GeneGroup Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`id`	`id`	Identifier	`string`
`method`	`method`	Method	`string`

GenomicLocation Structure

Field	Table Field Mnemonic	Table Column Name	Type
`genomicAccessionVersion`	`accession`	Accession	`string`
`sequenceName`	`seq-name`	Seq Name	`string`
`genomicRange`	`range-`		`Range`
`exons repeated`	`exon-`	Exons	`Range`

GenomicRegion Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`geneRange`	`gene-range-`	Gene Range	`SeqRangeSet`	The range of this Gene record on this genomic region.
`type`	`genomic-region-type`	Genomic Region Type	`GenomicRegion.GenomicRegionType`

NomenclatureAuthority Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`authority`	`authority`	Authority	`string`	The nomenclature authority for this gene record	`HGNC`
`identifier`	`id`	ID	`string`	The nomenclature authority identifier for this gene record	`HGNC:4392`

Range Structure

A 1-based range on a sequence record.

Field	Table Field Mnemonic	Table Column Name	Type	Description
`begin`	`start`	Start	`uint64`
`end`	`stop`	Stop	`uint64`
`orientation`	`orientation`	Orientation	`Orientation`
`order`	`order`	Order	`uint32`
`ribosomalSlippage`	coming soon	coming soon	`int32`	When ribosomal slippage is desired, fill out slippage amount between this and previous range.

SeqRangeSet Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`accessionVersion`	`accession`	Sequence Accession	`string`	NCBI Accession.version of the sequence
`range repeated`	`range-`		`Range`	Series of intervals on above accession_version

TranscriptTypeCount Structure

Field	Table Field Mnemonic	Table Column Name	Type	Description	Examples
`type`			`Transcript.TranscriptType`
`count`	coming soon	coming soon	`uint32`

GeneType Enumeration

NB: GeneType values match Entrez Gene

Name	Number	Description
`UNKNOWN`	`0`
`tRNA`	`1`
`rRNA`	`2`
`snRNA`	`3`
`scRNA`	`4`
`snoRNA`	`5`
`PROTEIN_CODING`	`6`
`PSEUDO`	`7`	these will have NG or NR
`TRANSPOSON`	`8`
`miscRNA`	`9`
`ncRNA`	`10`
`BIOLOGICAL_REGION`	`11`	these will have NG
`OTHER`	`255`

GenomicRegion.GenomicRegionType Enumeration

Name	Number	Description
`UNKNOWN`	`0`
`REFSEQ_GENE`	`1`
`PSEUDOGENE`	`2`
`BIOLOGICAL_REGION`	`3`
`OTHER`	`4`

Orientation Enumeration

Name	Number	Description
`none`	`0`
`plus`	`1`
`minus`	`2`

RnaType Enumeration

Name	Number	Description
`rna_UNKNOWN`	`0`
`premsg`	`1`
`tmRna`	`2`

Transcript.TranscriptType Enumeration

Name	Number	Description
`UNKNOWN`	`0`
`PROTEIN_CODING`	`1`
`NON_CODING`	`2`
`PROTEIN_CODING_MODEL`	`3`
`NON_CODING_MODEL`	`4`

Scalar Value Types

Protocol buffers type	Notes	C++	Python	Java	Go
`double`		`double`	`float`	`double`	`float64`
`float`		`float`	`float`	`float`	`float32`
`int32`	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.	`int32`	`int`	`int`	`int32`
`int64`	Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.	`int64`	`int/long`	`long`	`int64`
`uint32`	Uses variable-length encoding.	`uint32`	`int/long`	`int`	`uint32`
`uint64`	Uses variable-length encoding.	`uint64`	`int/long`	`long`	`uint64`
`sint32`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.	`int32`	`int`	`int`	`int32`
`sint64`	Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.	`int64`	`int/long`	`long`	`int64`
`fixed32`	Always four bytes. More efficient than uint32 if values are often greater than 2^28.	`uint32`	`int`	`int`	`uint32`
`fixed64`	Always eight bytes. More efficient than uint64 if values are often greater than 2^56.	`uint64`	`int/long`	`long`	`uint64`
`sfixed32`	Always four bytes.	`int32`	`int`	`int`	`int32`
`sfixed64`	Always eight bytes.	`int64`	`int/long`	`long`	`int64`
`bool`		`bool`	`boolean`	`boolean`	`bool`
`string`	A string must always contain UTF-8 encoded or 7-bit ASCII text.	`string`	`str/unicode`	`String`	`string`
`bytes`	May contain any arbitrary sequence of bytes.	`string`	`str`	`ByteString`	`[]byte`

Generated May 21, 2024