NCBI Logo
NCBI News




In this issue


Transitioning from LocusLink to Entrez Gene

Cancer Chromosomes: a New Entrez Database

HomoloGene: An Entrez Database with a New Look

BLAST Link (BLink) to Protein Alignments and Structures

Debut of the HCT Database and Anthropology/Allele Frequencies in dbMHC

350kb Sequence Length Limit Removed by Sequence Database Collaboration

New Eukaryotic Genomes at NCBI

Environmental Samples Make Big Splash

HIV Protein-Interaction Database

e-PCR and Reverse e-PCR: Greater Sensitivity, More Options

New Organisms in UniGene

RefSeq Accession Numbers Get Longer as Rat Gets Last 6-digit Accession

Slots available for FieldGuidePlus Training Course Onsite at NCBI

RefSeq Release 6 on FTP Site

Exponential Growth of GenBank Continues with Release 142

Entrez Tools is a 'Hot Spot'

BLAST Lab: Using BLASTClust

New Microbial Genomes in GenBank

Entrez Quiz

Masthead





350 KB Sequence Length Limit Removed by Sequence Database Collaboration

In 1995, the International Nucleotide Sequence Database Collaborators (GenBank, DDBJ, and EMBL) agreed to a 350 KB limit on the size of database sequence records in order to maintain compatibility with existing molecular biology software that was not able to work with large sequences.

At this time, a new GenBank division was created called "CON" for contig. The records in the CON division contain the instructions for the assembly of full-length contigs from the sequence data of multiple GenBank records. Although CON division records contain no sequence data, the assembly information they provide makes it possible for NCBI's Entrez search and retrieval system to show complete genomic sequences by dynamically assembling the data for display. Using the information in CON division records, FTP files are also regularly created by NCBI for download that contain megabase-scale genomic sequences as single FASTA files.

By 1998, GenBank, DDBJ, and EMBL were routinely accepting submissions from large scale sequencing projects of draft sequences, such as phase 1 and phase 2 high-throughput genomic sequences (HTGS), that were longer than 350 KB. To avoid breaking a huge amount of draft sequence into 350 KB chunks, the database collaborators agreed to relax the 350 KB limit in these cases. The 350 KB limit was also relaxed for assemblies of Whole Genome Shotgun (WGS) project data and for large eukaryotic genes.

Removal of 350 kb Limit

In 2003, the Database Collaborators agreed to remove the 350 KB limit for all sequences as of June 2004, since the increased ability of molecular biology software to analyze long sequences quickly has rendered the limit on sequence length unnecessary. To help software developers prepare for the change, some sample records with large sequences have been made available for testing:

An example of the effect of the removal of the 350 KB limit on GenBank records may be seen in the case of accession U00096, the Escherichia coli K-12 MG1655 complete genome sequence. Under the 350 KB limit, this accession number refered to a contig record giving a list of short sequences that can be assembled to create the complete genome. With the removal of the 350 KB limit the accession now refers to the complete contiguous sequence for Escherichia coli K-12. The accessions for all 400 parts will appear as secondary accessions. The CON division will remain as a GenBank division to represent sequences which by their nature are assembles; Ex. genome scaffold records.

The effect of the changes on the NCBI GenBank FTP files and the BLAST database files available for download is expected to be minimal. As sequences become secondary to primary records, the overall size of the databases should not change drastically. However, the number of megabased-sized records will in-crease, therefore NCBI recommends that software be tested with the example large sequence records, mentioned above.


—SM

Continue to:  eukaryotic

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003