BLAST 2.0 RELEASE NOTES (revised December 18, 1998) * Introduction * Blast Family of Programs * Gaps in Blast * Blast Query Format * Blast Report * Blast Statistics and Scores * Stand-Alone Blast * Compiling Blast * Database Format * PSI-Blast * PHI-Blast * References * Release History Introduction BLAST is a service of the National Center for Biotechnology Information (NCBI). A nucleotide or protein sequence sent to the BLAST server is compared against databases at the NCBI and a summary of matches is returned to the user. The www BLAST server can be accessed through the home page of the NCBI at www.ncbi.nlm.nih.gov. Stand-alone BLAST binaries can be obtained from the NCBI FTP site. See the Stand-Alone Blast section for details. The BLAST 2.0 release has significant differences from the BLAST 1.4 release. These include significant performance enhancements, the addition of 'gapping' routines, position-specific-iterated BLAST (see the PSI-Blast section) as well as extensive changes to the text report (see below), and the format of the databases (see the Stand-Alone Blast section). The options available and their command-line appearance have also changed substantially. The BLAST 2.0 programs are described in a Nucleic Acids Research article. Please cite this reference if you publish the results of your BLAST query. Blast Family of Programs The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases: blastp compares an amino acid query sequence against a protein sequence database. blastn compares a nucleotide query sequence against a nucleotide sequence database. blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleo- tide query sequence against the six-frame transla- tions of a nucleotide sequence database. The default matrix for all protein-protein comparisons is BLOSUM62. Gaps in Blast Version 2.0 of BLAST allows the introduction of gaps (deletions and insertions) into alignments. With a gapped alignment tool, homologous domains do not have to be broken into several segments. Also, the scoring of gapped results tends to be more biologically meaningful than ungapped results. The programs, blastn and blastp, offer fully gapped alignments. blastx and tblastn have 'in-frame' gapped alignments and use sum statistics to link alignments from different frames. tblastx provides only ungapped alignments. Blast Query Format The sequence sent to the BLAST server should be in FASTA format, described in http://www.ncbi.nlm.nih.gov/BLAST/fasta.html. A number of databases are also available. They are described in http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.html. Blast Report The BLAST report consists of a number of sections. The descriptions below are for a blastp comparison, but the format for the other programs is analogous. The BLAST report is not intended to be a parseable document. It is subject to change with little or no notice. The BLAST report starts with some header information that lists the type of program (here blastp), the version (here 2.0.1), and a release date. Also listed are a reference to the BLAST program, the query definition line, and summary of the database used. BLASTP 2.0.1 [Aug-20-1997] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= gi|129295|sp|P01013|OVAX_CHICK gene X protein - chicken (fragment) (232 letters) Database: Non-redundant SwissProt sequences 59,576 sequences; 21,219,450 total letters One-line descriptions of the database matches found are presented next. These include a database sequence identifier, the corresponding definition line, as well as the score (in bits) and the statistical significance ('E value') for this match (please see the section on statistics for an explanation of bits and significance). Consider the output below, from a gapped blastp comparison of SwissProt accession P01013 against the SwissProt database. High E Sequences producing significant alignments: Score Value sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) 442 e-124 sp|P01014|OVAY_CHICK GENE Y PROTEIN (OVALBUMIN-RELATED) 353 9e-98 sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II) 278 5e-75 sp|P19104|OVAL_COTJA OVALBUMIN 268 5e-72 sp|P48595|BOMA_HUMAN BOMAPIN (PROTEASE INHIBITOR 10) 199 2e-51 sp|P29508|SCC1_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 1 (SCCA-1) ... 198 5e-51 sp|P80229|ILEU_PIG LEUKOCYTE ELASTASE INHIBITOR (LEI) (LEUCOCYTE... 197 1e-50 sp|P48594|SCC2_HUMAN SQUAMOUS CELL CARCINOMA ANTIGEN 2 (SCCA-2) ... 196 2e-50 sp|P50453|PTI9_HUMAN CYTOPLASMIC ANTIPROTEINASE 3 (CAP3) (PROTEA... 195 6e-50 sp|P05619|ILEU_HORSE LEUKOCYTE ELASTASE INHIBITOR (LEI) 193 2e-49 The first match, in this case, is the actual query sequence. The identifiers shown here are all from SwissProt, so they all have 'sp' in the first field, followed by the accession, and then a Locus name. The syntax of these identifiers is discussed in more detail in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README The definition lines are taken from the definition line in the database, with the ellipsis (e.g., P29508) indicating that the definition line was too long to for the space available. Ungapped alignments and results from blastx and tblastn will have an additional column ('N'), displaying the number of different segment pairs used to produce the alignment, according to the Karlin-Altschul statistics. Each alignment is preceded by the sequence identifier, the full definition line and the length of the database sequence. Next come the score (in bits as well as the raw score) as well as the statistical significance of the match, followed by the number of identities and positive matches according to the scoring system (e.g., BLOSUM62) and, if applicable, the number of gaps in the alignment. Finally the actual alignment is shown, with the query on top and the database match labeled as 'Sbjct'. Between the two sequences the residue is shown if it is conserved, a '+' is shown if there is a positive match. One or more dashes, '-', indicates insertions or deletions. The example below is the third sequence listed in the one-line descriptions above. >sp|P01012|OVAL_CHICK OVALBUMIN (PLAKALBUMIN) (ALLERGEN GAL D II) Length = 386 Score = 278 bits (744), Expect = 5e-75 Identities = 149/231 (64%), Positives = 182/231 (78%), Gaps = 2/231 (0%) Query 2 IKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNS 61 I+++L SS D T +VLVNAI FKG+W+ AF EDT+ MPF VT+QESKPVQMM Sbjct 158 IRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGL 217 Query 62 FNVATLPAEKMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKR 121 F VA++ +EKMKILELPFASG +SMLVLLPDEVS LE++E INFEKLTEWT+ N ME+R Sbjct 218 FRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEER 277 Query 122 RVKVYLPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSE 181 ++KVYLP+MK+EEKYNLTSVLMA+G+TD+F SANL+GISSAESLKISQAVH A E++E Sbjct 278 KIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINE 337 Query 182 DGIEMAGSTGVIEDIKHSPESEQFRADHPFLFLIKHNPTNTIVYFGRYWSP 232 G E+ GS + + SE+FRADHPFLF IKH TN +++FGR SP Sbjct 338 AGREVVGSAEA--GVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP 386 The last section lists specifics about the database searched as well as statistical and search parameters used: Database: Non-redundant SwissProt sequences Posted date: Aug 14, 1997 9:52 AM Number of letters in database: 21,219,450 Number of sequences in database: 59,576 Lambda K H 0.317 0.132 0.377 Gapped Lambda K H 0.255 0.0350 0.190 Matrix: BLOSUM62 Gap Penalties: Existence: 10, Extension: 1 Number of Hits to DB: 8938654 Number of Sequences: 59576 Number of extensions: 335248 Number of successful extensions: 1188 Number of sequences better than 10: 116 Number of HSP's better than 10.0 without gapping: 106 Number of HSP's successfully gapped in prelim test: 10 Number of HSP's that attempted gapping in prelim test: 868 Number of HSP's gapped (non-prelim): 120 length of query: 232 length of database: 21219450 effective HSP length: 52 effective length of query: 180 effective length of database: 18121498 effective search space: -1033097656 T: 11 A: 40 X1: 16 ( 7.3 bits) X2: 40 (14.7 bits) X3: 67 (24.6 bits) S1: 41 (21.7 bits) S2: 64 (28.4 bits) Blast Statistics and Scores One may judge the results of a blast search by two numbers. One is the 'bit' score, which is defined as: S' (bits) = [lambda * S (raw) - ln K] / ln 2 where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix). The Expect value estimates the statistical significance of the match, specifying the number of matches, with a given score, that are expected in a search of a database of this size absolutely by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The expect value changes with the size of the database (in a larger database more chance matches with a given score are expected) and is the most intuitive way to rank results or compare the results of one query run against two different databases. Stand-Alone Blast This section is only applicable if a users wishes to run stand-alone BLAST at their own institution. One reason to do so might be the wish to use private databases not available at the NCBI. Users of www or network BLAST do not need to read these sections. BLAST binaries are provided for IRIX6.2, Solaris2.6, DEC OSF1 (ver. 4), LINUX, and Win32 systems. We will attempt to produce binaries for other platforms upon request. Stand-alone binaries are available from ftp://ncbi.nlm.nih.gov/blast/executables. The source code for BLAST 2.0 is part of the NCBI toolkit. See Compiling Blast for help in compiling BLAST. Please remember to FTP in binary mode. Formatdb Formatdb, should be used to format the FASTA databases for both protein and DNA databases for BLAST 2.0. This must be done before blastall or blastpgp can be run locally. The format of the databases has been changed substantially from the BLAST 1.4 release. A major improvement in this format over the old one is that ambiguity information for DNA sequences is now retrieved from the files produced by formatdb, rather than from the original FASTA file. The original FASTA file is no longer needed for the BLAST runs. Formatdb may be obtained with the other BLAST binaries from the executables directory (see above). The input for formatdb may be either ASN.1 or FASTA. Use of ASN.1 is advantageous for those sites that might also wish to format the ASN.1 in different ways, such as a GenBank report. Usage of formatdb may be obtained by executing formatdb and a dash: formatdb arguments: -t Title for database file [String] Optional -i Input file for formatting (this parameter must be set) [File In] -l Logfile name: [File Out] Optional default = formatdb.log -p Type of file T - protein F - nucleotide [T/F] Optional default = T -o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. [T/F] Optional default = F -a Input file is database in ASN.1 format (otherwise FASTA is expected) T - True, F - False. [T/F] Optional default = F -b ASN.1 database in binary mode T - binary, F - text mode. [T/F] Optional default = F -e Input is a Seq-entry [T/F] Optional default = F The "-p" option has two different meaning depending on whether input database is in FASTA or ASN.1 format. In case of FASTA, the "-p" specifies type of input database. In case of ASN.1, the option specifies the type of sequence to be indexed for BLAST. If the "-o" option is TRUE (and the input database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention described in the appendices of ftp://ncbi.nlm.nih.gov/blast/db/README It is always advantageous to use the '-o' option if the database identifiers are in the format specified above. If the database identifiers are in the parseable formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA defintion line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases: 1.) If ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE. 2.) master-slave alignments are desired (i.e., the '-m' option with a non-zero value is used). 3.) The gi's are desired as part of the output (i.e., '-I' is used). 4.) fastacmd is used to fetch sequences from the database by accession or gi. An input ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database. An input ASN.1 database (either text ascii or binary) may contains Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE. Blastall Blastall may be used to perform all five flavors of blast comparison. One may obtain the blastall options by executing 'blastall -' (note the dash). A typical blastall to perform a blastn search (nucl. vs. nucl.) of a file called QUERY would be: blastall -p blastn -d nr -i QUERY -o out.QUERY The output is placed into the output file out.QUERY and the search is performed against the 'nr' database. If a protein vs. protein search is desired, then 'blastn' should be replaced with 'blastp' etc. Some of the most commonly used blastall options are: blastall arguments: -p Program Name [String] Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx". -d Database [String] default = nr Version 2.0.4 and higher will accept multiple database names (bracketed by quotations). An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database. -i Query File [File In] default = stdin The query should be in FASTA format. If multiple FASTA entries are in the input file, all queries will be searched. -e Expectation value (E) [Real] default = 10.0 -o BLAST report Output File [File Out] Optional default = stdout -F Filter query sequence (DUST with blastn, SEG with others) [T/F] default = T See the "Low-complexity Filters" section below for details. Blastpgp Blastpgp performs gapped blastp searches and can be used to perform iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and PHI-BLAST sections for a description of this binary. The options may be obtained by executing 'blastpgp -'. Fastacmd Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted using the '-o' option. An example fastacmd call would be: fastacmd -d nr -s p38398 The fastacmd options are: fastacmd arguments: -d Database [String] default = nr -s Search string: GIs, accessions and locuses may be used delimited by comma or space) [String] Optional -i Input file wilth GIs/accessions/locuses for batch retrieval [String] Optional -a Retrieve duplicated accessions [T/F] Optional default = F -l Line length for sequence [Integer] Optional default = 80 Software requirements Blast 2.0 uses threads to perform multi-processing searches. OS requirements on SGI's are IRIX 6 (with relevant threads patches, see below), any Solaris version, or a version of DEC UNIX. IRIX 5 may be used if multi-processing is not enabled. SGI recommends the following threads patches on IRIX6 systems: For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order) For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order) For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order) These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/ System recommendations BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if it can read the entire BLAST database into memory, then keep on using it there. Resources consumed reading a database into memory can easily outweight the cost of a BLAST search, so that the memory of a machine is normally more important than the CPU speed. This means that one should have sufficient memory for the largest BLAST database one will use, then run all the searches against this databases in serial, then run queries against another database in serial. This guarantees that the database will be read into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg, which translates to about 170-200 Meg of BLAST database. At least another 100-200 Meg should be allowed for memory consumed by the actual BLAST program. All of the FASTA databases together are about 1.5 Gig, the BLAST databases produced from this will probably be about another Gig or so. 4 Gig of disk space, to make room for software and output, is probably a pretty good bet. Setup BLAST needs to know where the NCBI data directory and BLAST databases are. This is specified by the main configuration file for the NCBI toolkit (".ncbirc" on UNIX systems, ncbi.ini on Windows, analogous names on other platforms). If BLAST is the ONLY NCBI application that will be used, it is sufficient to have the following simple configuration file: [NCBI] Data=/am/ncbiapdata/data [BLAST] BLASTDB=/usr/ncbi/db/disk.blast/blast2 BLAST looks for resource files in the "Data" directory (e.g., "/am/ncbiapdata/data/"). A directory different than "/am/ncbiapdata/data" can be used if this is desired. The resource files can be found in the data directory of the toolbox (i.e., ncbi/data). The .ncbirc should be either in the directory from which BLAST is called, the user's home directory, or in the directory set by the environment variable "NCBI". Alternatively, an environment variable may be set under UNIX. If BLAST is run from the same directory as the database files, the BLASTDB line is unnecessary. Database and matrix directories On UNIX systems environment variables can be setenv to specify the directory of the database (BLASTDB) and matrices (BLASTMAT). Low-complexity Filters BLAST 2.0 uses the dust low-complexity filter for blastn and seg for the other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit and are accessed automatically. Access to filtering options. If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever). The seg options can be changed by using: -F "S 10 1.0 1.5" which specifies a window of 10, locut of 1.0 and hicut of 1.5. A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991)) and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked by specifying: -F "C" There are three parameters for this: window, cutoff (prob of a coil-coil), and linker (distance between two coiled-coiled regions that should be linked together). These are now set to window: 22 cutoff: 40.0 linker: 32 One may also change the coiled-coiled parameters in a manner analogous to that of seg: -F "C 28 40.0 32" will change the window to 28. One may also run both seg and coiled-coiled together by using a ";": -F "C;S" BLAST databases The FASTA files used by the NCBI to produce BLAST databases are available on the NCBI FTP site in ftp://ncbi.nlm.nih.gov/blast/db/. Please see the README for details. Compiling Blast BLAST is part of the NCBI toolkit and it is necessary to compile the toolkit to compile BLAST. For DOS or UNIX it is recommended to read either readme.dos or readme.unx. These documents may be found in the 'make' directory of the NCBI toolkit. These documents describe how to use scripts to easily perform the compile. NCBI toolkit archives may be obtained from ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools. Please send questions to toolbox@ncbi.nlm.nih.gov. When emailing toolbox, please include information on your operating system (use 'uname -a' under UNIX), what steps you have taken, and any error messages or output from your make that you may have. Database Format The format of the BLAST databases has changed for the 2.0 release and is not compatiable with the databases used in the 1.4 release. The change was made to eliminate an unpleasant feature of the 1.4 databases: ambiguity information for nucleotide sequences was not stored in the compressed file, but rather the original FASTA file had to be accessed for this information. This leads to significant slow-downs in BLAST comparisons for databases, such as dbest, that contain a large number of ambiguity characters. PSI-Blast The blastpgp program can do an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In this usage, the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As explained in the accompanying paper, the BLAST algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence. The position-specific matrix for round i+1 is built from a constrained multiple alignment among the query and the sequences found with sufficiently low e-value in round i. The top part of the output for each round distinguishes the sequences into: sequences found previously and used in the score model, and sequences not used in the score model. The output currently includes lots of diagnostics requested by users at NCBI. To skip quickly from the output of one round to the next, search for the string "producing", which is part of the header for each round and likely does not appear elsewhere in the output. PSI-BLAST "converges" and stops if all sequences found at round i+1 below the e-value threshold were already in the model at the beginning of the round. There are several blastpgp parameters specifically for PSI-BLAST: -j is the maximum number of rounds (default 1; i.e., regular BLAST) -e is the e-value threshold for including sequences in the score matrix model (default 0.01) -c is the "constant" used in the pseudocount formula specified in the paper (default 10) The -C and -R flags provide a "checkpointing" facility whereby a score model can be stored and later reused. -C stores the query and frequency count ratio matrix in a file -R restarts from a file stored previously. When using -R, it is required that the query specified on the command line match exactly the query in the restart file. The checkpoint files are stored in a byte-encoded (not human readable) format, so as to prevent roundoff error between writing and reading the checkpoint. Users who also develop their own sequence analysis software may wish to develop their own scoring systems. For this purpose the code in posit.c that writes out the checkpoint can be easily adapated to write out scoring systems derived by other algorithms in such a way that PSI-BLAST can read the files in later. The checkpoint structure is general in the sense that it can handle any position-specific matrix that fits in the Karlin-Altschul statistical framework for BLAST scoring. PHI-Blast PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines matching of regular expressions with local alignments surrounding the match. The most important features of the program have been incorporated into the BLAST software framework partly for user convenience and partly so that PHI-BLAST may be combined seamlessly with PSI-BLAST. Other features that do not fit into the BLAST framework will be released later as a separate program and/or separate Web page query options. One very restrictive way to identify protein motifs is by regular expressions that must contain each instance of the motif. The PROSITE database is a compilation of restricted regular expressions that describe protein motifs. Given a protein sequence S and a regular expression pattern P occurring in S, PHI-BLAST helps answer the question: What other protein sequences both contain an occurrence of P and are homologous to S in the vicinity of the pattern occurrences? PHI-BLAST may be preferable to just searching for pattern occurrences because it filters out those cases where the pattern occurrence is probably random and not indicative of homology. PHI-BLAST may be preferable to other flavors of BLAST because it is faster and because it allows the user to express a rigid pattern occurrence requirement. The pattern search methods in PHI-BLAST are based on the algorithms in: R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82. S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91. The calculation of local alignments is done using a method very similar to (and much of the same code as) gapped BLAST. However, the method of evaluating statistical significance is different, and is described below. In the stand-alone mode the typical PHI-BLAST usage looks like: blastpgp -i -k -p patseedp where -i is followed by the file containing the query in FASTA format where -k is followed by the file containing the pattern in a syntax given below and "patseedp" indicates the mode of usage, not representing any file. The syntax for the query sequence is FASTA format as for all other BLAST queries. The syntax for patterns follows the rules of PROSITE and is documented in detail below. The specified pattern is not required to be in the PROSITE list. Most of the other BLAST flags can be used with PHI-BLAST. One important exception is that PHI-BLAST requires gapped alignments (i.e. forbids -g F in the flags) because ungapped alignments do not make sense for almost all patterns in PROSITE. There is a second mode of PHI-BLAST usage that is important when the specified pattern occurs more than 1 time in the query. In this case, the user may be interested in restricting the search for local alignments to a subset of the pattern occurrences. This can be done with a search that looks like: blastpgp -i -k -p seedp in which case the use of the "seedp" option requires the user to specify the location(s) of the interesting pattern occurrence(s) in the pattern file. The syntax for how to specify pattern occurrences is below. When there are multiple pattern occurrences in the query it may be important to decide how many are of interest because the E-value for matches is effectively multiplied by the number of interesting pattern occurrences. The PHI-BLAST Web page supports only the "patseedp" option. PHI-BLAST is integrated with PSI-BLAST. In the command-line mode, PSI-BLAST can be invoked by using the -j option, as usual. When this is done as: blastpgp -i -k -p patseedp -j then the first round of searching uses PHI-BLAST and all subsequent rounds use PSI-BLAST. In the Web page setting, the user must explicitly invoke one round at a time, and the PHI-BLAST Web page provides the option to initiate a PSI-BLAST round with the PHI-BLAST results. To describe a combined usage, use the term "PHI-PSI-BLAST" (Pattern-Hit Initiated, Position-Specific Iterated BLAST). Determining statistical significance. When a query sequence Q matches a database sequence D in PHI-BLAST, it is useful to subdivide Q and D into 3 disjoint pieces Qleft Qpattern Qright Dleft Dpattern Dright The substrings Qpattern and Dpattern contain the pattern specified in the pattern file. The pieces Qpattern and Dpattern are aligned and that alignment is displayed as part of the PHI-BLAST output, but the score for that alignment is mostly ignored. The "reduced" score r of an alignment is the sum of the scores obtained by aligning Qleft with Dleft and by aligning Qright with Dright. The expected number of alignments with a reduced score >= x is given by: CN(Lambda*x + 1)e^(-Lambda *x) where: C and Lambda are "constants" depending on the score matrix and the gap costs. N is (number of occurrences of pattern in database) * (number of occurrences of pattern in Q) e is the base of the natural logarithm. It is important to understand that this method of computing the statistical significance of a PHI-BLAST alignment is mathematically different from the method used for BLAST and PSI-BLAST alignments. However, both methods provide E-values, so they the E_values are displayed with a similar output syntax. Rules for pattern syntax for PHI-BLAST. The syntax for patterns in PHI-BLAST follows the conventions of PROSITE. When using the stand-alone program, it is permissible to have multiple patterns in a file separated by a blank line between patterns. When using the Web-page only one pattern is allowed per query. Valid protein characters for PHI-BLAST patterns: ABCDEFGHIKLMNPQRSTVWXYZU Valid DNA characters for PHI-BLAST patterns: ACGT Other useful delimiters: [ ] means any one of the characters enclosed in the brackets e.g., [LFYT] means one occurrence of L or F or Y or T - means nothing (this is a spacer character used by PROSITE) x with nothing following means any residue x(5) means 5 positions in which any residue is allowed (and similarly for any other single number in parentheses after x) x(2,4) means 2 to 4 positions where any residue is allowed, and similarly for any other two numbers separated by a comma; the first number should be < the second number. > can occur only at the end of a pattern and means nothing it may occur before a period (another spacer used by PROSITE) . may be used at the end of the pattern and means nothing When using the stand-alone program, the pattern should be in a file, with the first line starting: ID followed by 2 spaces and a text string giving the pattern a name. There should also be a line starting PA followed by 2 spaces followed by the pattern description. All other PROSITE codes in the first two columns are allowed, but only the HI code, described below is relevant to PHI-BLAST. Here is an example from PROSITE. ID CNMP_BINDING_2; PATTERN. AC PS00889; DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE). DE Cyclic nucleotide-binding domain signature 2. PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. NR /RELEASE=32,49340; NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0); NR /FALSE_NEG=1; /PARTIAL=1; CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2; The line starting ID gives the pattern a name. The lines starting AC, DT, DE, NR, NR, CC are relevant to PROSITE users, but irrelevant to PHI-BLAST. These lines are tolerated, but ignored by PHI-BLAST. The line starting PA describes the pattern as: one of LIVMF followed by G followed by E followed by any single character followed by one of GAS followed by one of LIVM followed by any 5 to 11 characters followed by R followed by one of STAQ followed by A followed by any single character followed by one of LIVMA followed by any single character followed by one of STACV In this case the pattern ends with a period. It can end with nothing after the last specifying symbol or any number of > signs or periods or combination thereof. Here is another example, illustrating the use of an HI line. ID ER_TARGET; PATTERN. PA [KRHQSA]-[DENQ]-E-L>. HI (19 22) HI (201 204) In this example, the HI lines specify that the pattern occurs twice, once from positions 19 through 22 in the sequence and once from positions 201 through 204 in the sequence. These specifications are relevant when stand-alone PHI-BLAST is used with the seedp option, in which the interesting occurrences of the pattern in the sequence are specified. In this case the HI lines specify which occurrence(s) of the pattern should be used to find good alignments. In general, the seedp option is more useful than the standard patternp option ONLY when the pattern occurs K > 1 times in the sequence AND the user is interested in matching to J < K of those occurrences. Then using the HI lines enables the user to specify which occurrences are of interest. Additional functionality related to PHI-BLAST. PHI-BLAST takes as input both a sequence and a query containing that sequence and searches a sequence database for other sequences containing the same pattern and having a good alignment. One may be interested in asking two related, simpler questions: 1. Given a sequence and a database of patterns, which patterns occur in the sequence and where? 2. Given a pattern and a sequence database, which sequences contain the pattern and where? These queries can be answered wih software closely related to PHI-BLAST, but they do not fit into the output framework of BLAST because the answers are simple lists without alignments and with no notion of statistical significance. The NCBI toolbox includes another program, currently called seedtop to answer the two queries above. Query 1 can be asked with: seedtop -i -k -p patmatchp Query 2 can be asked with: seedtop -d -k -p patternp The -k argument is used similarly in all queries and the file format is always the same. The standard pattern database is PROSITE, but others (or a subset) can be used. There are plans afoot to offer the patmatchp query (number 1) on the PHI-BLAST web page or in its vicinity, but this would be restricted to having PROSITE as the pattern database. References Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden, David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998), "Protein sequence similarity searches using patterns as seeds", Nucleic Acids Res. 26:3986-3990. Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Karlin, Samuel and Stephen F. Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-68. Karlin, Samuel and Stephen F. Altschul (1993). Applications and statistics for multiple high-scoring segments in molecu- lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7. Release History Notes for 2.0.7 release: Bug fixes: 1.) BLAST now multi-threads properly under LINUX. 2.) A problem with very redundant databases and psi-blast was fixed. 3.) A problem with the formatting of the number of identities and positives was fixed. This affected results on the minus strand only and did not affect the expect value or scores. 4.) A problem that caused tblastn to core-dump very occassionally was corrected. 5.) A problem with multiple patterns in PHI-BLAST was fixed. 6.) A limit on the number of HSP's that were saved (100) was removed. Notes for 2.0.6 release: Enhancements: 1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for details. 2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary to install it separately. It is also now supported under non-UNIX platforms. 3.) Access to filtering options. If one uses "-F T" then normal filtering by seg or dust (for blastn) occurs (likewise "-F F" means no filtering whatsoever). The seg options can be changed by using: -F "S 10 1.0 1.5" which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may also specify coiled-coiled filtering by specifying: -F "C" There are three parameters for this: window, cutoff (prob of a coil-coil), and linker (distance between two coiled-coiled regions that should be linked together). These are now set to window: 22 cutoff: 40.0 linker: 32 One may also change the coiled-coiled parameters in a manner analogous to that of seg: -F "C 28 40.0 32" will change the window to 28. One may also run both seg and coiled-coiled together by using a ";": -F "C;S" 4.) BLAST has been changed to reduce the number of redundant hits that a user may see. This is acheived by keeping track of the number of hits completely contained in a certain region and eliminating those lower scoring hits that are redundant with others. This behavior may be controlled with the -K and -L options: -K Number of best hits from a region to keep [Integer] default = 50 -L Length of region used to judge hits [Integer] default = 20 Setting -K to zero turns off this feature. This is the default only on blastall. Bug fixes: 1.) There was a problem with the procedure that called the external utility seg. The need to fix this was obviated by the integration of seg into the toolkit. This showed up under LINUX. 2.) There was a memory problem with formatdb that has been fixed. This showed up mostly under NT and LINUX. 3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user) was fixed. Notes for 2.0.5 release: Enhancements: 1.) The BLAST version is printed by formatdb in it's log file. 2.) Multi-database searches no longer require that the -o option be used when preparing the databases (i.e., with formatdb). Bugs fixed: 1.) A serious bug with multi-database iterative searches was fixed (thanks to Steve Brenner for providing an example). 2.) 'lcl' is not formatted in the BLAST report when the sequence identifier is a local identifier or does not contain a bar ("|"). 3.) A large memory leak in formatdb was fixed. 4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines if the binary was made under 2.6 was fixed. 5.) Better error checking was added to protect against core-dumps. 6.) Some problems with the sum statistics treatment of the blastx and tblastn programs reported by D. Rozenbaum were fixed. The number of alignments involved in a sum group was misrepresented. Also the incorrect length for the database sequence was used, sometimes casuing a slight change in the value reported. 7.) A problem with blastpgp was fixed that reported incorrect values for matrices other than BLOSUM62 during iterative searches. Notes for 2.0.4 release: Enhancements: 1.) multiple database searches: Version 2.0.4 will accept multiple database names (bracketed by quotations). An example would be -d "nr est" which will search both the nr and est databases, presenting the results as if one 'virtual' database consisting of all the entries from both were searched. The statistics are based on the 'virtual' database. 2.) new options: -W Word size, default if zero [Integer] default = 0 -z Effective length of the database (use zero for the real size) [Integer] default = 0 3.) The number of identities, positives, and gaps are now printed out before the alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is now also enabled for ungapped BLAST. 4.) Formatdb now accepts ASN.1, as well as FASTA, as input. Bugs fixed: 1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in some cases. 2.) The last alignment of the last sequence being presented was incorrectly dropped in some cases. This change could affect the statistical significance of the last database sequence if the dropped alignment had a lower e-value than any other alignments from the same database sequence.