BlockSearch User Manual ----------------------- Rainer Fuchs EMBL Data Library European Molecular Biology laboratory Postfach 10.2209 69012 Heidelberg Germany E-mail: Fuchs@EMBL-Heidelberg.DE 1. Introduction --------------- As automated sequencing techniques improve and large-scale sequencing projects gain pace, scientists more and more frequently encounter new protein-coding sequences whose function is not known a priori. A first step in sequence analysis is commonly a sequence comparison against the general nucleotide and protein sequence databases to find similarities to already known sequences that can provide clues to the function of the newly determined one. While the steady growth of these databases increases the chances for finding similar sequences, the concurrent growth of redundancy also adds noise and complicates the detection of more distantly related proteins. Consequently, several alternative methods for the elucidation of unknown protein functions have been developed over the years. Most popular perhaps is the search for protein motifs or patterns. These patterns are descriptions of short conserved sequence regions characteristic for a protein family. If some part of a new sequence matches one of these patterns, then this strongly suggests that the new protein is a member of the family identified by the pattern. Several collections of protein patterns have been established recently (Bairoch, 1992; Ogiwara et al., 1992), and a variety of computer tools is available for their utilization (e.g., Fuchs, 1991; Staden, 1991; Sternberg, 1991; for a comprehensive overview see the User Manual of the PROSITE pattern database, Bairoch, 1992). Another, more quantitative approach is the profile analysis (Gribskov et al., 1987; Gribskov et al., 1990; Henikoff et al., 1990). A profile is essentially a frequency matrix derived from an alignment of related sequences in which each column of the matrix reflects the frequency of occurrence of each amino acid at the corresponding position of the alignment. High similarity of some region of a new sequence to a profile indicates a possible biological relationship of this sequence and the family characterized by the profile. Compared to pattern searches, profile searches provide increased sensitivity although calculation is more complex and computing-intensive and the interpretation of results is usually more complicated because clear cutoff similarity scores that indicate significance can most often not be defined. Whereas profiles are mainly based on global alignments and allow for gaps (Gribskov et al., 1990; Smith et al., 1990), it has been demonstrated that even short ungapped alignments, so-called blocks, can yield very effective scoring matrices (Henikoff et al., 1990; Henikoff and Henikoff, 1991). Subsequently, a large database of such blocks (BLOCKS) has been established recently (Henikoff and Henikoff, 1991). BlockSearch is a program developed for Digital's VAX and Alpha computer systems that utilizes the BLOCKS database for the functional analysis of new protein sequences. BlockSearch is fast and user-friendly, and the code is easily portable to other platforms. 2. Legal issues --------------- The BlockSearch and BlockShow programs and source code are copyright protected [Copyright (c) 1992,1993 by the European Molecular Biology Laboratory and the EMBL Data Library]. Permission is herewith granted to freely use and redistribute the programs and source code for non-commercial purposes. You are kindly requested to cite the author if you publish results obtained with the help of these programs: Fuchs, R. (1993) Block searches on VAX and Alpha computer systems. Comput. Appl. Biosci. 9, in press. 3. Installation --------------- Edit the config.h file to your needs. Most importantly, change DEFAULT_BLOCKSDB to point to the location of the blocks.dat database file on your system, unless you prefer to use the -d command line option. For example, if your blocks.dat file is in $databases:[blocks], change config.h to #define DEFAULT_BLOCKSDB "$databases:[blocks]blocks.dat" Under VAX/VMS execute the make.com command procedure: @make.com Ignore any warning messages that might appear during the compilation of the source code. You should now have two new files called blocksearch.exe and blockshow.exe in your default directory. Under Alpha/OpenVMS execute the make_alpha.com procedure: @make_alpha Again, ignore any compilation warnings. VMS and OpenVMS require the installation of BlockSearch and BlockShow as foreign commands. Add the following lines to the login.com file you find in your home directory: $ blocksearch :== $ disk:[directory]blocksearch.exe $ blockshow :== $ disk:[directory]blockshow.exe Replace "disk" and "directory" by your local values. Example: if your disk is called $U and your directory [JOEBIOL.BLOCKSEARCH], then you would include the lines $ blocksearch :== $ $U:[JOEBIOL.BLOCKSEARCH]blocksearch.exe $ blockshow :== $ $U:[JOEBIOL.BLOCKSEARCH]blockshow.exe (note the two consecutive dollar characters!) in your login.com file. Now logout and login again or type @sys$login:login If you want to use BlockSearch under any other operating system, follow the local instructions for compiling C programs. Make sure that an ANSI C compiler is available to you. Under ULTRIX, for instance, you can compile BlockSearch by using the command cc -o blocksearch main.c sequence.c search.c hash.c Successful compilation has been demonstrated on VAX/VMS, Alpha/OpenVMS, DEC/Ultrix, Silicon Graphics/IRIX, and Apple Macintosh Think C. Other operating systems may require modifications, in particular SUN systems that do not support the ANSI standard. The distribution package contains a file main_alt.c which adds an interactive user interface to BlockSearch, based on modifications made by Bruce Roe, University of Oklahoma. To use it, simply rename it to main.c and run make.com (or make_alpha.com). Make sure that you edit config.h, otherwise the frequency table files cannot be found. Please, be aware that main_alt.c is not supported and only provided "as is". 4. Requirements --------------- No special hardware or software is required. However, you need a copy of S. Henikoff's BLOCKS database. Copies of BLOCKS can be obtained from several sources: 1) on the EMBL Data Library Molecular Biology Databases CD-ROM set 2) by anonymous FTP from FTP.EMBL-Heidelberg.DE in /pub/databases/blocks or from NCBI.NLM.NIH.GOV in /repository/blocks 3) by electronic mail. Send a mail message to NetServ@EMBL-Heidelberg.DE and include the line "HELP BLOCKS" for more information. 5. Command line --------------- BlockSearch is completely command-line driven. Usage: blocksearch [-gx] [-i infile] [-o outfile] [-f freqfile] [-d blocksdb] [-n #hits] [-s sensitivity] infile = name of sequence data file [stdin] outfile = name of output file [stdout] freqfile = name of frequency file [freq_file] blocksdb = name of BLOCKS database [blocks_db] #hits = no of top scores (1..200) [50] sensitivity = search sensitivity/speed (>= 0) [5] If no parameters are passed to BlockSearch, it will read sequences from the standard input stream, i.e. the keyboard or a command file, and will print results to the standard output stream, i.e. the screen. It will assume default values for all other parameters. The command line options have the following meaning: -i infile Gives the name of a sequence data file. See Section 6 for details on the sequence formats recognised by BlockSearch. -g Input file is GCG format. -o outfile Specifies an output file name to hold the results. -f freqfile Specifies the name of a file with amino acid frequencies which BlockSearch will use for adjusting the scoring matrices it calculates. See Section 7 fo details. The default name is freq_file. Logical names or aliases for freq_file can be used. You can change the default name by editing config.h before compilation. -d blocksdb The location of the BLOCKS database. By default, it looks in your current directory for a file called blocks_db. Logical names or aliases for blocks_db can be used. You can change the default name by editing config.h before compilation. -n #hits A number that specifies the number of high-scoring matches (per input sequence) that BlockSearch will display in the output. The default value is 50. -x Exhaustive search. Without the -x option BlockSearch will run in "fast" mode, but may miss some remote similarities. The -x option is equivalent to -s 0. -s sensitivity You can increase the speed of BlockSearch by reducing its sensitivity, ie, you gain speed for the modest price of missing some high-scoring segments. See below for more details. The default value is 5 and can be set to any positive value, although values much greater than 10 will not make sense. A value of 0 is equivalent to using the -x option, so BlockSearch will run in the most-sensitive (but slowest) mode. Examples: blocksearch -i test.seq -o test.out -f swissfreq.dat This command tells BlockSearch to read sequences from test.seq, to use the default blocks database and a frequency table called swissfreq.dat, and to write the 200 best matches into the file test.out. blocksearch -d $databases:[blocks]blocks.dat -i test.seq -n 5 BlockSearch will read sequences from test.seq, will use the database in the $databases:[blocks] directory and the default frequency table, and will display the 5 best matches on your terminal screen. 6. Input sequence formats ------------------------- BlockSearch understands four sequence formats: a) EMBL/SWISS-PROT format ID 104K_THEPA STANDARD; PRT; 924 AA. [Other line types deleted; not required and ignored by BlockSearch] TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL // b) PIR/NBRF format >P1;104K_THEPA - Human 104 KD Microneme-rhoptry antigen TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL* c) FASTA format >P1;104K_THEPA - Human 104 KD Microneme-rhoptry antigen TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP KKPDSAYIPS ILAILVVSLI VGIL* a) - c) Each input sequence file can have multiple sequences. These are analysed consecutively, and results are listed independently in the output file. d) Standard GCG format Everything up to a line that ends with two periods is treated as comment, everything else as sequence. Only one sequence per file is allowed. 7. Frequency tables ------------------- To compensate for bias in the protein database, all entries in the scoring matrices which BlockSearch creates are weighted by using an externally supplied frequency table. See Section 9 for details on the algorithm. The following frequency tables are supplied: swissfreq.dat - Derived from all proteins in SWISS-PROT Release 24 ecolifreq.dat - Derived by S. Henikoff from E. coli proteins and used for the calibration of the BLOCKS database simplefreq.dat - All amino acids have same values. No weighting. The file format of a frequency table is very simple. Lines starting with a '#' or '!' character are ignored as well as everything that follows these characters in a line. There must be 26 consecutive lines with frequency values, one for each amino acid A through Z according to IUPAC one-letter representation, in alphabetical order. If no special table is specified on the command line with the -f option, BlockSearch looks for a file called freq_file in the default directory. As supplied, this file is a copy of the ecolifreq.dat file. Under (Open)VMS you may use a logical name FREQ_FILE pointing to the actual frequency table. Under UNIX you can use aliases. 8. Output --------- Here is an example output file produced by BlockSearch: BlockSearch v2.0.1 - R. Fuchs, EMBL Data Library Sequence file: test.seq Database file: blocks_db Frequency table: freq_file Sensitivity: 5 ---------------------------------------- Sequence MAS_HUMAN (325 residues): Score Strength Block 1310 1563 BL00237E G_PROTEIN_RECEPTOR 272: NSSANPFIYfFVGSSkK 1047 1867 BL00237A G_PROTEIN_RECEPTOR 48: VeNGILLWFLCFRMRRNPfTVYITHLSIADI 1036 2676 BL00702B GM_CSF 283: vgsskkKrFKESLKvvLTrafkDemqPrrq 1034 1542 BL00758E ARGE_DAPE_CPG2_1 90: DYALdyElSS - 997 2529 BL00594B AROMATIC_AA_PERMEASE 229: vimvtiiifLifAmpMrLLYlLyYeYwStfGn - 995 2414 BL00307A LECTIN_LEGUME_BETA 24: SVGnAhrQiPIvhWvimSiSpvgFVeNGIL - 989 3265 BL00115J RNA_POL_II_REPEAT 28: aHRqiPivhwvimSispvGFVENgiLlwflcfr - 970 2555 BL00557C FMN_HYDROXY_ACID_DH 143: HRpkyQSAlvcALlwalscLVtTmeyvmcidREeesHsRN - 961 3434 BL00446A RNA_POL_30KD 16: nIstgrnaSVgNAhRqIpIvhwvImsIspVgFvENgilLwflcfrmRrnpftvyi - 954 1510 BL00614E IGPS 192: aIlSFLVfTpLMLv 2302 block(s) searched in 1 sequence(s). 571 blocks searched in slow mode. Time ellapsed: 0:12 min After some general information, there is a list of high-scoring sequence segments, sorted by score. For each match, there are two lines. The first line shows the match score, the strength of the matching block, the BLOCKS entry's accession number and its entry name. If the score is greater than the strength, the score value is preceeded by a '+' sign. Scores less than 1,000 are indicated by a '-' sign in the first column. The second line of each match shows the offset of the matching sequence fragment from the beginning of the sequence (numbering starts with 1) and the matching sequence fragment itself. A lower case character indicates that this residue is not found at this position in any of the database sequences comprising the block. If the input file contains multiple sequences, then results are shown for each sequence individually. At the bottom of the result file there are some statistics, incl. how many sequences were analysed and how long it took. 9. Algorithm ------------ The BlockSearch program is an implementation of a profile or scoring-matrix search algorithm (Gribskov et al., 1987). It uses the BLOCKS database as a data source for deriving site-specific scoring matrices. This database and its format are described in detail in Henikoff and Henikoff, 1991. In brief, all protein families defined by the PROSITE pattern database (Bairoch, 1992) are analyzed for common motifs which are then used as focal points for local alignments. These alignments are converted to blocks - short ungapped regions of aligned protein sequence fragments. The best set of non-overlapping blocks is determined for each family and calibrated by searching against the SWISS-PROT protein sequence database (Bairoch and Boeckmann, 1992). This collection of calibrated blocks constitutes the BLOCKS database. BlockSearch converts BLOCKS database entries to site-specific scoring matrices essentially as described by Henikoff et al., 1990. Basically, in such a matrix each column is filled with values that represent the frequency of occurrence of each amino acid at the corresponding position of the block. However, BlockSearch also considers the arrangement of protein sequence fragments within a BLOCKS entry into subgroups of very closely related proteins. Frequency values are averaged for each subgroup prior to the calculation of a scoring matrix column to reduce the contribution of multiply represented protein subfamilies. Matrix values can then be optionally weighted to compensate for database bias, i.e., uneven occurrences of amino acids in the protein database, by using a user-supplied amino acid frequency table. Each matrix entry is divided by the frequency of that amino acid. The values in each matrix column are finally normalised to 100%. The resulting matrix is used to assign a score to each fragment of the query sequence(s) of the width of the block by adding the matrix values for each residue at each position. For every block the highest score and the position of the matching query sequence fragment is remembered. Because blocks are of unequal length, raw scores have to be adjusted to allow the comparison of results obtained with different blocks (Wallace and Henikoff, 1992). Otherwise, wider blocks would yield higher scores simply because they were wider. To normalize scores, BlockSearch uses the "99.5%" value stored in each BLOCKS entry, which is the 99.5th percentile of scores of true negative sequences detected by this block during calibration (Henikoff and Henikoff, 1991). Raw scores are divided by this value and multiplied by 1,000. A corrected score of less than 1,000, therefore, indicates that a match is probably not significant. This adjustment procedure is valid because the method used by BlockSearch for calculating a scoring matrix from a block of aligned protein sequences is compatible with the method used for the original calibration of the BLOCKS database. 10. Speed and sensitivity ------------------------- Version 2.0 of BlockSearch offers a new search method that greatly improves its speed. This "fast" mode is faster than the normal mode by a factor of 2 to 10. The price you pay is a somewhat reduced sensitivity, i.e., some high-scoring segments may not be detected. In fast mode, the standard algorithm described above is altered in the following way: When BlockSearch calculates a scoring matrix, it looks for positions in the matrix which are absolutely conserved in all members of a protein family. In other words, in such columns all values are 0 except for one amino acid whose score is 100. If it finds at least one such position, BlockSearch uses this information to "jump" along the sequence using a hashing technique and looking for occurences of this residue. Identical residues may appear at a given position just by chance. Since the "reliability" of a constant position increases as the number of members in a family grows, BlockSearch only uses this modified algorithm with families that have at least as many members as specified by the user using the -s command line option. With the modified algorithm, the number of necessary comparisons can be reduced drastically. On the other hand, you will miss those high-scoring regions in which the conserved residue is not contained in the query sequence. In practice, various test have shown that this loss of information does not seem to be a problem at all and that the increase in speed is well worth the slightly decreased sensitivity. In fact, it can be seen as a sort of filtering mechanism, as in all tests performed only non-significant hits had been affected. The -s command line option is used to adjust the sensitivity and speed of BlockSearch. Setting -s to 0 turns off "fast" mode and forces BlockSearch to fall back to the standard block search algorithm. The -x option may be used as a shortcut for -s 0. Positive values of -s result in improved speed. In general, the lower the value is, the quicker BlockSearch performs and the lower the sensitivity is. A value of 1, for instance, gives the highest speed improvements, while values greater than 100 are essentially the same as using -s 0 or -x. In respect to the algorithm used, the value of -s (with the exception of -s 0) determines the minimum number of sequences that have to exist in a family before the new algorithm is applied to this family. The proper choice of the -s value depends on the question at hand. If sensitivity matters, for example if you want to analyse a new sequence thoroughly, "fast" mode should be turned off, while for the rapid analysis of a large number of cDNA clones a value of 1 seems more appropriate. Various practical tests indicate that for routine applications a value between 5 and 10 is most suited, which yields a significant speed improvement without a great danger of missing important hits. The default value is 5. In all tests performed no significant hit was lost. The influence of -s on speed depends largely on your hardware/software platform and on the sequence length. On a VAX 9000-420 search times for average-sized query sequences (about 300 aa) could be reduced by 50% using the -s 1 option. For larger sequences improvements of a factor 4 could be achieved. The effect is less drastic if you are I/O bound such as on an Alpha machine. On an ULTRIX DECstation speed improvements were similar to those on a VAX. 11. Restrictions and known bugs ------------------------------- Please note that the maximum input sequence length is 100,000 residues. The number of high-scoring matches that BlockSearch will show per sequence is limited to 200. Both parameters, however, can be changed in the config.h file. There are no bugs in BlockSearch; only some unexpected features... 12. References -------------- Bairoch,A. (1992) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Res., 20, 2013-2018. Bairoch,A. and Boeckmann,B. (1992) The SWISS-PROT protein sequence data bank. Nucleic Acids Res., 20, 2019-2022. Doolittle,R.F. (1987) Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. University Science Books, Mill Valley. Fuchs,R. (1991) MacPattern: protein pattern searching on the Apple Macintosh. Comput. Appl. Biosci., 7, 105-106. Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84, 4355-4358. Gribskov,M., Luthy,R. and Eisenberg,D. (1990) Profile analysis. Methods Enzymol., 183, 146-159. Henikoff,S., Wallace,J.C. and Brown,J.P. (1990) Finding protein similarities with nucleotide sequence databases. Methods Enzymol., 183, 111-132. Henikoff,S. and Henikoff,J.G. (1991) Automated assembly of protein blocks for database searching. Nucleic Acids Res., 19, 6565-6572. Ogiwara,A., Uchiyama,I., Seto,Y. and Kanehisa,M. (1992) Construction of a dictionary of sequence motifs that characterize groups of unrelated proteins. Protein Eng., 5, 479-488. Smith,H.O., Annau,T.M. and Chandrasegaran,S. (1990) Finding sequence motifs in groups of functionally related proteins. Proc. Natl. Acad. Sci. USA, 87, 826-830. Staden,R. (1990) Searching for patterns in protein and nucleic acid sequences. Methods Enzymol., 183, 193-211. Staden,R. (1991) Screening protein and nucleic acid sequences against libraries of patterns. DNA Sequence, 1, 369-374. Sternberg,M.J.E. (1991) PROMOT: a FORTRAN program to scan protein sequences against a library of known motifs. Comput. Appl. Biosci., 7, 257-260. Sternberg,M.J.E. and Islam,S.A. (1991) Protein sequences - homologies and motifs. TIBTECH, 9, 300-302. Wallace,J.C. and Henikoff,S. (1992) PATMAT: a searching and extraction program for sequence, pattern and block queries and databases. Comput. Appl. Biosci., 8, 249-254. 13. Version history ------------------- v1.0 (December 1992): First release v1.0.1 (January 1993): GCG format support v2.0 (May 1993): Added "fast" mode v2.0.1(June 1993): High scoring sequence fragments are displayed in mixed case. A lower case character indicates that this residue does not occur in any sequence of the block. 14. BlockShow ------------- A simple utility, BlockShow, is provided that facilitates access to individual BLOCKS database entries. Under (Open)VMS, BlockShow is compiled as part of the installation procedure. On UNIX systems, use the command "cc -o blockshow blockshow.c" to obtain the executable. Usage: blockshow [-d database] acc# ... Enter one or a list of accession numbers separated by blanks (note that the program is case-sensitive!). The corresponding database entries are displayed on the standard output stream. By default, BlockShow looks for a database file called blocks_db. If the database is centrally installed at your site, use the -d option to specify its location: blockshow -d $databases:[blocks]blocks.dat BL00011 BL00012 BL00013 Or simply edit config.h and change DEFAULT_BLOCKSDB before you compile blockshow.c.