Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

The downloaded genome package contains a genome assembly data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the genome assembly data report file is a hierarchical JSON object that represents a single genome assembly record. The schema of the genome assembly record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is AssemblyDataReport.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly data reports from JSON Lines to tabular formats.

Sample report

{
  "annotationInfo": {
    "busco": {
      "buscoLineage": "primates_odb10",
      "buscoVer": "4.0.2",
      "complete": 0.9920174,
      "duplicated": 0.0066763423,
      "fragmented": 0.0017416546,
      "missing": 0.006240929,
      "singleCopy": 0.9853411,
      "totalCount": "13780"
    },
    "name": "NCBI Annotation Release 109.20210514",
    "releaseDate": "2021-05-14",
    "reportUrl": "https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genome/annotation_euk/Homo_sapiens/109.20210514/",
    "source": "NCBI",
    "stats": {
      "geneCounts": {
        "nonCoding": 18201,
        "other": 411,
        "proteinCoding": 19527,
        "pseudogene": 16446,
        "total": 54585
      }
    }
  },
  "assemblyInfo": {
    "assemblyAccession": "GCF_000001405.39",
    "assemblyLevel": "Chromosome",
    "assemblyName": "GRCh38.p13",
    "assemblyType": "haploid-with-alt-loci",
    "bioprojectLineage": [
      {
        "bioprojects": [
          {
            "accession": "PRJNA31257",
            "title": "The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"
          }
        ]
      }
    ],
    "blastUrl": "https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_SPEC=GDH_GCF_000001405.39",
    "description": "Genome Reference Consortium Human Build 38 patch release 13 (GRCh38.p13)",
    "genbankAssmAccession": "GCA_000001405.28",
    "pairedAssemblyAccession": "GCA_000001405.28",
    "refseqAssmAccession": "GCF_000001405.39",
    "refseqCategory": "reference genome",
    "submissionDate": "2019-02-28",
    "submitter": "Genome Reference Consortium"
  },
  "assemblyStats": {
    "contigL50": 18,
    "contigN50": 57879411,
    "gapsBetweenScaffoldsCount": 349,
    "gcCount": "1853979372",
    "numberOfComponentSequences": 35613,
    "numberOfContigs": 998,
    "numberOfScaffolds": 472,
    "scaffoldL50": 16,
    "scaffoldN50": 67794873,
    "totalNumberOfChromosomes": 24,
    "totalSequenceLength": "3099706404",
    "totalUngappedLength": "2948583725"
  },
  "commonName": "human",
  "organelleInfo": [
    {
      "assemblyName": "GRCh38.p13",
      "description": "Mitochondrion",
      "submitter": "Genome Reference Consortium"
    }
  ],
  "organismName": "Homo sapiens",
  "taxId": 9606
}

AssemblyDataReport Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
commonNamecommon-nameCommon namestringVernacular name associated with a particular taxonHuman
zebrafish
pacific white shrimp
organismNameorganism-nameOrganism namestringScientific name of the species or subspeciesHomo sapiens
Arabidopsis thaliana
Canis lupus familiaris
breedbreedBreedstringA homogenous group of animals within a domesticated speciesHereford
boxer
cultivarcultivarCultivarstringA variety of plant within a species produced and maintained by cultivationB73
ecotypeecotypeEcotypestringA population or subspecies occupying a distinct habitatAlpine
isolateisolateIsolatestringThe individual isolate from which the sequences in the genome assembly were derivedL1 Dominette 01449 registration number 42190680
Pmale09
sexsexSexstringMale or femalefemale
strainstrainStrainstringA genetic variant, subtype or culture within a species
taxIdtax-idTaxonomic IDuint32The NCBI Taxonomy identifier for the organism from which thegenome assembly was derived.
assemblyInfoassminfo-AssemblyAssemblyInfoMetadata for the genome assembly submission
assemblyStatsassmstats-Assembly StatsAssemblyStatsGlobal statistics for the genome assembly
organelleInfo repeatedorganelle-OrganelleOrganelleInfoMetadata for all associated organelle genomes
annotationInfoannotinfo-Annotation InfoAnnotationInfoMetadata and statistics for the genome assembly annotation, when available
wgsInfowgs-WGSWGSInfoMetadata pertaining to the Whole Genome Shotgun (WGS) record for the genome assembliesthat are complete genomes. Those that are clone-based do not haveWGS-master records.

AnnotationInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
sourcesourceSourcestring
releaseDaterelease-dateRelease Datestring
reportUrlreport-urlReport URLstring
statsfeatcount-CountFeatureCounts
buscobusco-BUSCOBuscoStat

AssemblyInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionaccessionAccessionstringThe GenColl assembly accessionGCF_000001405.39
pairedAssemblyAccessionpaired_accessionPaired AccessionstringThe GenBank or RefSeq assembly accession paired with this assemblyGCA_000001405.28
assemblyLevellevelLevelstringThe level at which a genome has been assembledchromosome
scaffold
contig
assemblyNamenameNamestringThe assembly submitter’s name for the genome assembly, when provided. Otherwise, a default name in theform ASM#####v# is assignedGRCh38.p13
ASM985889v3
assemblyTypetypeTypestringChromosome content of the submitted genome assemblyhaploid-with-alt-loci
haploid
bioprojectLineage repeatedbioproject-BioProjectBioProjectLineageThe lineage of BioProject accessions. The specific BioProject which produced the sequences in thegenome assembly is listed first, followed in order by its antecendents.
submissionDatesubmission-dateSubmission DatestringDate the assembly was submitted to NCBI
descriptiondescriptionDescriptionstringLong description for this genome
genbankAssmAccessiongenbank-assm-accessionGenBank AccessionstringAccession for the GenBank assembly is the unique identifier for the set of sequences in this particular version ofthe genome assembly.GCA_000001405.28
submittersubmitterSubmitterstringThe submitting consortium or organization. Full submitter information is available in the BioProject
refseqCategoryrefseq-categoryRefseq DategorystringThe RefSeq Category is either reference or representative genome and indicates the RefSeq project classificationreference genome
representative genome
refseqAssmAccessionrefseq-assm-accessionRefSeq AccessionstringRefSeq assembly accession is the unique identifier for the set of sequences in this particular version ofthe genome assembly.GCF_000001405.39
ucscAssmNameucsc-assm-nameUCSC Assembly NamestringGenome name ascribed to this assembly by the UC Santa Cruz genome browserhg38
linkedAssemblylinked-assmLinked AssemblystringThe accession.version and designation (principal or alternate pseudohaplotype) of a paired genome assembly derived from the same diploid individual
sequencingTechsequencing-techSequencing TechstringSequencing technology used to sequence this genome
biosampleAccessionbiosample-accessionBioSample AccessionstringNCBI BioSample Accession for the BioSample from which the sequences in the genomeassembly were obtained.SAMN03145444
blastUrlblast-urlBlast URLstringURL to blast page for this assembly

AssemblyStats Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totalNumberOfChromosomestotal-number-of-chromosomesTotal Number of Chromosomesuint32Count of nuclear chromosomes, organelles and plasmids in a submitted genome assembly
totalSequenceLengthtotal-sequence-lenTotal Sequence Lengthuint64Total sequence length of the nuclear genome including unplaced and unlocalized sequences
totalUngappedLengthtotal-ungapped-lenTotal Ungapped Lengthuint64Total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap
numberOfContigsnumber-of-contigsNumber of Contigsuint32Total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between twocontigs in a scaffold when counting contigs and calculating contig N50 & L50 values
contigN50contig-n50Contig N50uint32Length such that sequence contigs of this length or longer include half the bases of the assembly
contigL50contig-l50Contig L50uint32Number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
numberOfScaffoldsnumber-of-scaffoldsNumber of Scaffoldsuint32Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds
scaffoldN50scaffold-n50Scaffold N50uint32Length such that scaffolds of this length or longer include half the bases of the assembly
scaffoldL50scaffold-l50Scaffold L50uint32Number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
gapsBetweenScaffoldsCountgaps-between-scaffolds-countGaps Between Scaffolds Countuint32Number of unspanned gaps between scaffolds
numberOfComponentSequencesnumber-of-component-sequencesNumber of Component Sequencesuint32Total number of component WGS or clone sequences in the assembly
gcCountgc-countGC Countuint64The number of GC base-pairs in the assembly

BuscoStat Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
buscoLineagelineageLineagestringBUSCO Lineage
buscoVerverVersionstringBUSCO Version
completecompleteCompletefloatBUSCO score: Complete
singleCopysinglecopySingle CopyfloatBUSCO score: Single Copy
duplicatedduplicatedDuplicatedfloatBUSCO score: Duplicated
fragmentedfragmentedFragmentedfloatBUSCO score: Fragmented
missingmissingMissingfloatBUSCO score: Missing
totalCounttotalcountTotal Countuint64BUSCO score: Total Count

FeatureCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneCountsgene-GeneGeneCountsCounts of gene types

GeneCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totaltotalTotaluint32Total number of annotated genes
proteinCodingprotein-codingProtein-codinguint32Count of annotated genes that encode a protein
nonCodingnon-codingNon-codinguint32Count of transcribed non-coding genes (e.g. lncRNAs, miRNAs, rRNAs, etc…) excludes transcribed pseudogenes
pseudogenepseudogenePseudogeneuint32Count of transcribed and non-transcribed pseudogenes
otherotherOtheruint32Count of genic region GeneIDs and non-genic regulatory GeneIDs

OrganelleInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyNameassembly-nameAssembly NamestringName of associated nuclear assembly
infraspecificNameinfraspecific-nameInfraspecific NamestringThe strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived
bioproject repeatedbioproject-accessionsBioProject AccessionsstringThe associated BioProject accession, when available
descriptiondescriptionDescriptionstringLong description of the organelle genome
totalSeqLengthtotal-seq-lengthTotal Seq Lengthuint64Sequence length of the organelle genome
submittersubmitterSubmitterstringName of submitter

WGSInfo Structure

Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
wgsProjectAccessionproject-accessionproject accessionstringAAEX03
CABHLF01
masterWgsUrlurlURLstringhttps://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/nuccore/AAEX00000000.3
wgsContigsUrlcontigs-urlcontigs URLstringhttps://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/Traces/wgs/AAEX03

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringBioProject accessionPRJEB35387
titletitleTitlestringTitle of the BioProject provided by the submitterSciurus carolinensis (grey squirrel) genome assembly, mSciCar1
parentAccessions repeatedparent-accessionsParent AccessionsstringBioProject accession containing multiple children BioProjects["PRJNA489243","PRJEB33226","PRJEB40665"]

BioProjectLineage Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
bioprojects repeatedlineage-LineageBioProjectA BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated October 18, 2021