NCBI News: Spring 2004|BLASTLab

	Transitioning from LocusLink to Entrez Gene Cancer Chromosomes: a New Entrez Database HomoloGene: An Entrez Database with a New Look BLAST Link (BLink) to Protein Alignments and Structures Debut of the HCT Database and Anthropology/Allele Frequencies in dbMHC 350kb Sequence Length Limit Removed by Sequence Database Collaboration New Eukaryotic Genomes at NCBI Environmental Samples Make Big Splash HIV Protein-Interaction Database e-PCR and Reverse e-PCR: Greater Sensitivity, More Options New Organisms in UniGene RefSeq Accession Numbers Get Longer as Rat Gets Last 6-digit Accession Slots available for FieldGuidePlus Training Course Onsite at NCBI RefSeq Release 6 on FTP Site Exponential Growth of GenBank Continues with Release 142 Entrez Tools is a 'Hot Spot' BLAST Lab: Using BLASTClust New Microbial Genomes in GenBank Entrez Quiz Masthead		Using BLASTClust to Make Non-redundant Sequence Sets BLASTClust is a program within the standalone BLAST package used to cluster either protein or nucleotide sequences. The program begins with pairwise matches and places a sequence in a cluster if the sequence matches at least one sequence already in the cluster. In the case of proteins, the blastp algorithm is used to compute the pairwise matches; in the case of nucleotide sequences, the Megablast algorithm is used. In the simplest case, BLASTClust takes as input a file containing catenated FASTA-format sequences, each with a unique identifier at the start of the definition line. BLASTClust formats the input sequence to produce a temporary BLAST database, performs the clustering, and removes the database at completion. Hence, there is no need to run formatdb in advance to use BLASTClust. The output of BLASTClust consists of a file, one cluster to a line, of sequence identifiers separated by spaces. The clusters are sorted from the largest cluster to the smallest. BLASTClust accepts a number of parameters that can be used to control the stringency of clustering including thresholds for score density, percent identity, and alignment length. The BLASTClust program has a number of applications, the simplest of which is to create a non-redundant set of sequences from a source database. As an example, one might have a library of a few thousand short nucleotide sequence reads and wish to replace these with a non-redundant set. To produce the non-redundant set, one might use: blastclust -i infile -o outfile -p F -L .9 -b T -S 95 The sequences in "infile" will be clustered and the results will be written to "outfile". The input sequences are identified as nucleotide (-p F); "-p T", or protein, is the default. To register a pairwise match two sequences will need to be 95% identical (-S 95) over an area covering 90% of the length (-L .9) of each sequence (-b T) . Using "-b F" instead of "-b T" would enforce the alignment length threshold on only one member of a sequence pair. The parameter "S", used here to specify the percent identity, can also be used to specify, instead, a "score density." The latter is equivalent to the BLAST score divided by the alignment length. If "S" is given as a number between 0 and 3, it is interpreted as a score density threshold; otherwise it is interpreted as a percent identity threshold. To create a stringent non-redundant protein sequence set, use the following command line: blastclust -i infile -o outfile -p T -L 1 -b T -S 100 In this case, only sequences which are identical will be clustered together. The “blastclust.txt” file in the standalone BLAST package details the full range of BLASTClust parameters. —DW