Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

Genome assembly report

Genome record accession, organism, assembly statistics, and annotation info

The downloaded genome package contains a genome assembly data report in JSON Lines format in the file:

ncbi_dataset/data/assembly_data_report.jsonl

Each line of the genome assembly data report file is a hierarchical JSON object that represents a single genome assembly record. The schema of the genome assembly record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is AssemblyDataReport.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option Refer to the dataformat CLI tool reference to see how you can use this tool to transform assembly data reports from JSON Lines to tabular formats.

Sample report

{
  "accession": "GCF_000001405.40",
  "annotationInfo": {
    "busco": {
      "buscoLineage": "primates_odb10",
      "buscoVer": "4.1.4",
      "complete": 0.99187225,
      "duplicated": 0.007256894,
      "fragmented": 0.0015239477,
      "missing": 0.0066037737,
      "singleCopy": 0.9846154,
      "totalCount": "13780"
    },
    "method": "Best-placed RefSeq; Gnomon; RefSeqFE; cmsearch; tRNAscan-SE",
    "name": "GCF_000001405.40-RS_2023_10",
    "pipeline": "NCBI eukaryotic genome annotation pipeline",
    "provider": "NCBI RefSeq",
    "releaseDate": "2023-10-02",
    "reportUrl": "https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2023_10.html",
    "softwareVersion": "10.2",
    "stats": {
      "geneCounts": {
        "nonCoding": 22158,
        "other": 413,
        "proteinCoding": 20080,
        "pseudogene": 17001,
        "total": 59652
      }
    },
    "status": "Updated annotation"
  },
  "assemblyInfo": {
    "assemblyLevel": "Chromosome",
    "assemblyName": "GRCh38.p14",
    "assemblyStatus": "current",
    "assemblyType": "haploid-with-alt-loci",
    "bioprojectAccession": "PRJNA31257",
    "bioprojectLineage": [
      {
        "bioprojects": [
          {
            "accession": "PRJNA31257",
            "title": "The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"
          }
        ]
      }
    ],
    "blastUrl": "https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_SPEC=GDH_GCF_000001405.40",
    "description": "Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)",
    "pairedAssembly": {
      "accession": "GCA_000001405.29",
      "onlyGenbank": "4 unlocalized and unplaced scaffolds.",
      "status": "current"
    },
    "refseqCategory": "reference genome",
    "releaseDate": "2022-02-03",
    "submitter": "Genome Reference Consortium",
    "synonym": "hg38"
  },
  "assemblyStats": {
    "contigL50": 18,
    "contigN50": 57879411,
    "gapsBetweenScaffoldsCount": 349,
    "gcCount": "1374283647",
    "gcPercent": 41.0,
    "numberOfComponentSequences": 35611,
    "numberOfContigs": 996,
    "numberOfOrganelles": 1,
    "numberOfScaffolds": 470,
    "scaffoldL50": 16,
    "scaffoldN50": 67794873,
    "totalNumberOfChromosomes": 24,
    "totalSequenceLength": "3099441038",
    "totalUngappedLength": "2948318359"
  },
  "currentAccession": "GCF_000001405.40",
  "organelleInfo": [
    {
      "description": "Mitochondrion",
      "submitter": "Genome Reference Consortium",
      "totalSeqLength": "16569"
    }
  ],
  "organism": {
    "commonName": "human",
    "organismName": "Homo sapiens",
    "taxId": 9606
  },
  "pairedAccession": "GCA_000001405.29",
  "sourceDatabase": "SOURCE_DATABASE_REFSEQ"
}

AssemblyDataReport Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAssembly AccessionstringThe GenColl assembly accessionGCF_000001405.40
currentAccessioncurrent-accessionCurrent AccessionstringThe latest GenColl assembly accession for this revision chainGCF_000001405.40
sourceDatabasesource_databaseSource DatabaseSourceDatabaseSource of the accession. The paired accession, if it exists, is from the other database.REFSEQ
GENBANK
organismorganism-OrganismOrganism
assemblyInfoassminfo-AssemblyAssemblyInfoMetadata for the genome assembly submission
assemblyStatsassmstats-Assembly StatsAssemblyStatsGlobal statistics for the genome assembly
organelleInfo repeatedorganelle-OrganelleOrganelleInfoMetadata for all associated organelle genomes
annotationInfoannotinfo-AnnotationAnnotationInfoMetadata and statistics for the genome assembly annotation, when available
wgsInfowgs-WGSWGSInfoMetadata pertaining to the Whole Genome Shotgun (WGS) record for the genome assembliesthat are complete genomes. Those that are clone-based do not haveWGS-master records.
typeMaterialtype_material-Type MaterialTypeMaterial
checkmInfocheckm-CheckMCheckMMetadata on the completeness and contamination of this assembly
averageNucleotideIdentityani-ANIAverageNucleotideIdentity

ANIMatch Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyassemblyAssemblystringGCA_010191885.1
organismNameorganismOrganismstringSalmonella enterica subsp. enterica serovar Typhimurium
categorycategoryType CategoryANITypeCategoryType material
anianiANIfloat98.5
assemblyCoverageassembly_coverageAssembly CoveragefloatAKA qcoverage90.75
typeAssemblyCoveragetype_assembly_coverageType Assembly CoveragefloatAKA scoverage89.60

AnnotationInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
providerproviderProviderstring
releaseDaterelease-dateRelease Datestring
reportUrlreport-urlReport URLstring
statsfeatcount-CountFeatureCounts
buscobusco-BUSCOBuscoStat
methodmethodMethodstring
pipelinepipelinePipelinestring
softwareVersionsoftware-versionSoftware Versionstring
statusstatusStatusstring
releaseVersionrelease-versionRelease Versionstring

AssemblyInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyLevellevelLevelstringThe level at which a genome has been assembledchromosome
scaffold
contig
assemblyStatusstatusStatusAssemblyStatusThe GenColl assembly statuscurrent
pairedAssemblypaired-assm-Paired AssemblyPairedAssemblyMetadata from the GenBank or RefSeq assembly paired with this one
assemblyNamenameNamestringThe assembly submitter’s name for the genome assembly, when provided. Otherwise, a default name in theform ASM#####v# is assignedGRCh38.p14
ASM985889v3
assemblyTypetypeTypestringChromosome content of the submitted genome assemblyhaploid-with-alt-loci
haploid
bioprojectLineage repeatedbioproject-BioProjectBioProjectLineageThe lineage of BioProject accessions. The specific BioProject which produced the sequences in thegenome assembly is listed first, followed in order by its antecedents.
bioprojectAccessionbioprojectBioProject Accessionstring
releaseDaterelease-dateRelease DatestringDate the assembly was made available by NCBI. This field is not returned by versions of the datasets Command Line Interface (CLI) program < 15.
descriptiondescriptionDescriptionstringLong description for this genome
submittersubmitterSubmitterstringThe submitting consortium or organization. Full submitter information is available in the BioProject
refseqCategoryrefseq-categoryRefseq CategorystringThe RefSeq Category is either reference or representative genome and indicates the RefSeq project classificationreference genome
representative genome
synonymsynonymSynonymstringGenome name ascribed to this assembly by the UC Santa Cruz genome browserhg38
linkedAssemblies repeatedlinked-assm-Linked AssemblyLinkedAssemblyGenome assemblies derived from the same diploid individual
atypicalatypicalAtypicalAtypicalInfoInformation on atypical genomes - genomes that have assembly issues or are otherwise atypical
genomeNotes repeatednotesNotesstringAll the RefSeq messages associated with this assembly
sequencingTechsequencing-techSequencing TechstringSequencing technology used to sequence this genome
assemblyMethodassembly-methodAssembly MethodstringGenome assembly method
biosamplebiosample-BioSampleBioSampleDescriptorNCBI BioSample from which the sequences in the genome assembly were obtained.
blastUrlblast-urlBlast URLstringURL to blast page for this assembly
commentscoming sooncoming soonstringFreeform comments
suppressionReasonsuppression-reasonSuppression ReasonstringThe reason for the assembly is suppressed, for suppressed assemblies
diploidRoleLinkedAssemblyType

AssemblyStats Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totalNumberOfChromosomestotal-number-of-chromosomesTotal Number of Chromosomesuint32Count of nuclear chromosomes, organelles and plasmids in a submitted genome assembly
totalSequenceLengthtotal-sequence-lenTotal Sequence Lengthuint64Total sequence length of the nuclear genome including unplaced and unlocalized sequences
totalUngappedLengthtotal-ungapped-lenTotal Ungapped Lengthuint64Total length of all top-level sequences ignoring gaps. Any stretch of 10 or more Ns in a sequence is treated like a gap
numberOfContigsnumber-of-contigsNumber of Contigsuint32Total number of sequence contigs in the assembly. Any stretch of 10 or more Ns in a sequence is treated as a gap between twocontigs in a scaffold when counting contigs and calculating contig N50 & L50 values
contigN50contig-n50Contig N50uint32Length such that sequence contigs of this length or longer include half the bases of the assembly
contigL50contig-l50Contig L50uint32Number of sequence contigs that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
numberOfScaffoldsnumber-of-scaffoldsNumber of Scaffoldsuint32Number of scaffolds including placed, unlocalized, unplaced, alternate loci and patch scaffolds
scaffoldN50scaffold-n50Scaffold N50uint32Length such that scaffolds of this length or longer include half the bases of the assembly
scaffoldL50scaffold-l50Scaffold L50uint32Number of scaffolds that are longer than, or equal to, the N50 length and therefore include half the bases of the assembly
gapsBetweenScaffoldsCountgaps-between-scaffolds-countGaps Between Scaffolds Countuint32Number of unspanned gaps between scaffolds
numberOfComponentSequencesnumber-of-component-sequencesNumber of Component Sequencesuint32Total number of component WGS or clone sequences in the assembly
gcPercentgc-percentGC PercentfloatThe percentage of GC base-pairs in the assembly
genomeCoveragegenome-coverageGenome CoveragestringGenome assembly coverage
numberOfOrganellesnumber-of-organellesNumber of Organellesuint32number of organelles

AtypicalInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
isAtypicalis-atypicalIs AtypicalboolIf true there are assembly issues or the assembly is in some way non-standard
warnings repeatedwarningsWarningsstringThe reasons that the assembly is considered atypical

AverageNucleotideIdentity Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
taxonomyCheckStatuscheck-statusCheck statusAverageNucleotideIdentity.TaxonomyCheckStatusok
failed
inconclusive
matchStatusbest-match-statusBest match statusAverageNucleotideIdentity.MatchStatusderived-species-match
submittedOrganismsubmitted-organismSubmitted organismstringColumn 5 of ANI ReportSalmonella enterica subsp. enterica serovar Tennessee str. CDC07-0191
submittedSpeciessubmitted-speciesSubmitted speciesstringColumn 6 of ANI ReportSalmonella enterica
categorycategoryCategoryANITypeCategorysyntype
submittedAniMatchsubmitted-ani-match-Declared ANI matchANIMatch
bestAniMatchbest-ani-match-Best ANI matchANIMatch
commentcommentCommentstring

BioProject Structure

A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from NCBI BioProject

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringBioProject accessionPRJEB35387
titletitleTitlestringTitle of the BioProject provided by the submitterSciurus carolinensis (grey squirrel) genome assembly, mSciCar1
parentAccessions repeatedparent-accessionsParent AccessionsstringBioProject accession containing multiple children BioProjects["PRJNA489243","PRJEB33226","PRJEB40665"]

BioProjectLineage Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
bioprojects repeatedlineage-LineageBioProjectA BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium

BioSampleAttribute Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
valuevalueValuestring

BioSampleContact Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
lablabLabstringSubmitter lab name.

BioSampleDescription Structure

Description of the BioSample object

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
titletitleTitlestring
organismorganism-OrganismOrganism
commentcommentCommentstring

BioSampleDescriptor Structure

TODO: We may be able to delete but not sure if other things are relying on it…

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringSAMN20055006
lastUpdatedlast-updatedLast updatedstring
publicationDatepublication-datePublication datestring
submissionDatesubmission-dateSubmission datestring
sampleIds repeatedids-Sample IdentifiersBioSampleId
descriptiondescription-DescriptionBioSampleDescription
ownerowner-OwnerBioSampleOwner
models repeatedmodelsModelsstring
bioprojects repeatedbioproject-BioProjectBioProject
packagepackagePackagestringMIGS.ba.air.4.0
attributes repeatedattribute-AttributeBioSampleAttribute
statusstatus-StatusBioSampleStatus

BioSampleId Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
dbdbDatabasestringWellcome Sanger Institute
labellabelLabelstringSample name
valuevalueValuestringCOG-UK/ALDP-17A6A8C

BioSampleOwner Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestring
contacts repeatedcontact-ContactBioSampleContact

BioSampleStatus Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
statusstatusStatusstringlive
whenwhenWhenstring

BuscoStat Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
buscoLineagelineageLineagestringBUSCO Lineage
buscoVerverVersionstringBUSCO Version
completecompleteCompletefloatBUSCO score: Complete
singleCopysinglecopySingle CopyfloatBUSCO score: Single Copy
duplicatedduplicatedDuplicatedfloatBUSCO score: Duplicated
fragmentedfragmentedFragmentedfloatBUSCO score: Fragmented
missingmissingMissingfloatBUSCO score: Missing
totalCounttotalcountTotal Countuint64BUSCO score: Total Count

CheckM Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
checkmMarkerSetmarker-setmarker setstringWhat taxonomic group is used as the basis for comparison with this assembly with regards to checkM valuesMycobacterium avium
checkmSpeciesTaxIdspecies-tax-idspecies tax iduint32The species-level taxid for this assemblies checkM dataset1764
checkmMarkerSetRankmarker-set-rankmarker set rankstringCheckM taxonomic rank of checkm_marker_setspecies
genus
checkmVersionversionversionstringCheckM software versionv1.2.0
completenesscompletenesscompletenessfloatWhat percent complete is this assembly86.83
contaminationcontaminationcontaminationfloatWhat is the contamination percentage for this assembly5.18
completenessPercentilecompleteness-percentilecompleteness percentilefloatThe percent of assemblies under the taxonomic grouping ‘checkm_marker_set’ that this assembly is as-or-more complete than.79

FeatureCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneCountsgene-GeneGeneCountsCounts of gene types

GeneCounts Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
totaltotalTotaluint32Total number of annotated genes
proteinCodingprotein-codingProtein-codinguint32Count of annotated genes that encode a protein
nonCodingnon-codingNon-codinguint32Count of transcribed non-coding genes (e.g. lncRNAs, miRNAs, rRNAs, etc…) excludes transcribed pseudogenes
pseudogenepseudogenePseudogeneuint32Count of transcribed and non-transcribed pseudogenes
otherotherOtheruint32Count of genic region GeneIDs and non-genic regulatory GeneIDs

InfraspecificNames Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
breedbreedBreedstringA homogenous group of animals within a domesticated speciesHereford
boxer
cultivarcultivarCultivarstringA variety of plant within a species produced and maintained by cultivationB73
ecotypeecotypeEcotypestringA population or subspecies occupying a distinct habitatAlpine
isolateisolateIsolatestringThe individual isolate from which the sequences in the genome assembly were derivedL1 Dominette 01449 registration number 42190680
Pmale09
sexsexSexstringMale or femalefemale
strainstrainStrainstringA genetic variant, subtype or culture within a speciesSE11

LinkedAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
linkedAssemblyaccessionAccessionstringThe linked assembly accessionGCA_000212995.1
assemblyTypetypeTypeLinkedAssemblyTypeThe linked assembly type

OrganelleInfo Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
infraspecificNameinfraspecific-nameInfraspecific NamestringThe strain, breed, cultivar or ecotype of the organism from which the sequences in the assembly were derived
bioproject repeatedbioproject-accessionsBioProject AccessionsstringThe associated BioProject accession, when available
descriptiondescriptionDescriptionstringLong description of the organelle genome
totalSeqLengthtotal-seq-lengthTotal Seq Lengthuint64Sequence length of the organelle genome
submittersubmitterSubmitterstringName of submitter

PairedAssembly Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionaccessionAccessionstringThe GenColl assembly accession of the GenBank or RefSeq assembly paired with this oneGCF_000001405.40
statusstatusStatusAssemblyStatusGenColl Assembly status from paired recordcurrent
annotationNamenameNamestringAnnotation name from paired record
onlyGenbankonly-genbankOnly GenbankstringSequences that are only included in the GenBank assembly
onlyRefseqonly-refseqOnly RefSeqstringSequences that are only included in the RefSeq assembly
changedchangedChangedstringSequences present on both the GenBank and the RefSeq assemblies that have been changed, e.g., contaminated sequence in the GenBank assembly has been replaced with a gap
manualDiffmanual-diffManual DiffstringAdditional details about sequence differences between the GenBank and RefSeq assemblies

TypeMaterial Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
typeLabellabelLabelstring
typeDisplayTextdisplay_textDisplay Textstring

WGSInfo Structure

Whole Genome Shotgun (WGS) projects are genome assemblies of incomplete genomes or incomplete chromosomes of prokaryotes or eukaryotes that are generally being sequenced by a whole genome shotgun strategy.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
wgsProjectAccessionproject-accessionproject accessionstringAAEX03
CABHLF01
masterWgsUrlurlURLstringhttps://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/nuccore/AAEX00000000.3
wgsContigsUrlcontigs-urlcontigs URLstringhttps://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/Traces/wgs/AAEX03

ANITypeCategory Enumeration

NameNumberDescription
ANI_CATEGORY_UNKNOWN0
claderef1
category_na2
neotype3
no_type4
pathovar5
reftype6
suspected_type7
syntype8
type9

AssemblyStatus Enumeration

NameNumberDescription
ASSEMBLY_STATUS_UNKNOWN0
current1
previous2
suppressed3
retired4This is deprecated - should no longer be seen in the data

AverageNucleotideIdentity.MatchStatus Enumeration

NameNumberDescription
BEST_MATCH_STATUS_UNKNOWN0
approved_mismatch1
below_threshold_match2
below_threshold_mismatch3
best_match_status4
derived_species_match5
genus_match6
low_coverage7
mismatch8
status_na9
species_match10
subspecies_match11
synonym_match12
lineage_match13
below_threshold_lineage_match14

AverageNucleotideIdentity.TaxonomyCheckStatus Enumeration

NameNumberDescription
TAXONOMY_CHECK_STATUS_UNKNOWN0
OK1
Failed2
Inconclusive3

LinkedAssemblyType Enumeration

NameNumberDescription
LINKED_ASSEMBLY_TYPE_UNKNOWN0
alternate_pseudohaplotype_of_diploid1SEQUI-5245
principal_pseudohaplotype_of_diploid2
maternal_haplotype_of_diploid3
paternal_haplotype_of_diploid4
haplotype_16
haplotype_27
haplotype_38
haplotype_49
haploid10Catch all for any value that is not explicitly listed above

SourceDatabase Enumeration

NameNumberDescription
SOURCE_DATABASE_UNSPECIFIED0
SOURCE_DATABASE_GENBANK1
SOURCE_DATABASE_REFSEQ2

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated May 21, 2024