BlockSearch User Manual
-----------------------


Rainer Fuchs
EMBL Data Library
European Molecular Biology laboratory
Postfach 10.2209
69012 Heidelberg
Germany
E-mail: Fuchs@EMBL-Heidelberg.DE


1. Introduction
---------------

As automated sequencing techniques improve and large-scale sequencing projects
gain pace, scientists more and more frequently encounter new protein-coding
sequences whose function is not known a priori. A first step in sequence
analysis is commonly a sequence comparison against the general nucleotide and
protein sequence databases to find similarities to already known sequences that
can provide clues to the function of the newly determined one. While the steady
growth of these databases increases the chances for finding similar sequences,
the concurrent growth of redundancy also adds noise and complicates the
detection of more distantly related proteins.

Consequently, several alternative methods for the elucidation of unknown protein
functions have been developed over the years. Most popular perhaps is the search
for protein motifs or patterns. These patterns are descriptions of short
conserved sequence regions characteristic for a protein family. If some part of
a new sequence matches one of these patterns, then this strongly suggests that
the new protein is a member of the family identified by the pattern. Several
collections of protein patterns have been established recently (Bairoch, 1992;
Ogiwara et al., 1992), and a variety of computer tools is available for their
utilization (e.g., Fuchs, 1991; Staden, 1991; Sternberg, 1991; for a
comprehensive overview see the User Manual of the PROSITE pattern database,
Bairoch, 1992).

Another, more quantitative approach is the profile analysis (Gribskov et al.,
1987; Gribskov et al., 1990; Henikoff et al., 1990). A profile is essentially a
frequency matrix derived from an alignment of related sequences in which each
column of the matrix reflects the frequency of occurrence of each amino acid at
the corresponding position of the alignment. High similarity of some region of a
new sequence to a profile indicates a possible biological relationship of this
sequence and the family characterized by the profile. Compared to pattern
searches, profile searches provide increased sensitivity although calculation is
more complex and computing-intensive and the interpretation of results is
usually more complicated because clear cutoff similarity scores that indicate
significance can most often not be defined. Whereas profiles are mainly based on
global alignments and allow for gaps (Gribskov et al., 1990; Smith et al.,
1990), it has been demonstrated that even short ungapped alignments, so-called
blocks, can yield very effective scoring matrices (Henikoff et al., 1990;
Henikoff and Henikoff, 1991). Subsequently, a large database of such blocks
(BLOCKS) has been established recently (Henikoff and Henikoff, 1991). 

BlockSearch is a program developed for Digital's VAX and Alpha computer systems
that utilizes the BLOCKS database for the functional analysis of new protein
sequences. BlockSearch is fast and user-friendly, and the code is easily
portable to other platforms.


2. Legal issues
---------------

The BlockSearch and BlockShow programs and source code are copyright protected
[Copyright (c) 1992,1993 by the European Molecular Biology Laboratory and the
EMBL Data Library]. Permission is herewith granted to freely use and
redistribute the programs and source code for non-commercial purposes. You are
kindly requested to cite the author if you publish results obtained with the
help of these programs:

        Fuchs, R. (1993) Block searches on VAX and Alpha computer systems.
        Comput. Appl. Biosci. 9, in press.


3. Installation
---------------

Edit the config.h file to your needs. Most importantly, change DEFAULT_BLOCKSDB
to point to the location of the blocks.dat database file on your system, unless
you prefer to use the -d command line option. For example, if your blocks.dat
file is in $databases:[blocks], change config.h to

#define DEFAULT_BLOCKSDB "$databases:[blocks]blocks.dat"

Under VAX/VMS execute the make.com command procedure:

@make.com

Ignore any warning messages that might appear during the compilation of the
source code. You should now have two new files called blocksearch.exe and
blockshow.exe in your default directory.

Under Alpha/OpenVMS execute the make_alpha.com procedure:

@make_alpha

Again, ignore any compilation warnings.

VMS and OpenVMS require the installation of BlockSearch and BlockShow as foreign
commands. Add the following lines to the login.com file you find in your home
directory:

$ blocksearch :== $ disk:[directory]blocksearch.exe
$ blockshow   :== $ disk:[directory]blockshow.exe
                 
Replace "disk" and "directory" by your local values. Example: if your disk is
called $U and your directory [JOEBIOL.BLOCKSEARCH], then you would include the
lines

$ blocksearch :== $ $U:[JOEBIOL.BLOCKSEARCH]blocksearch.exe
$ blockshow   :== $ $U:[JOEBIOL.BLOCKSEARCH]blockshow.exe

(note the two consecutive dollar characters!)

in your login.com file. Now logout and login again or type

@sys$login:login


If you want to use BlockSearch under any other operating system, follow the
local instructions for compiling C programs. Make sure that an ANSI C compiler
is available to you. Under ULTRIX, for instance, you can compile BlockSearch by
using the command

cc -o blocksearch main.c sequence.c search.c hash.c

Successful compilation has been demonstrated on VAX/VMS, Alpha/OpenVMS,
DEC/Ultrix, Silicon Graphics/IRIX, and Apple Macintosh Think C. Other operating
systems may require modifications, in particular SUN systems that do not
support the ANSI standard.


The distribution package contains a file main_alt.c which adds an interactive
user interface to BlockSearch, based on modifications made by Bruce Roe,
University of Oklahoma. To use it, simply rename it to main.c and run make.com
(or make_alpha.com). Make sure that you edit config.h, otherwise the frequency
table files cannot be found.

Please, be aware that main_alt.c is not supported and only provided "as is". 


4. Requirements
---------------

No special hardware or software is required. However, you need a copy of
S. Henikoff's BLOCKS database.

Copies of BLOCKS can be obtained from several sources:

1) on the EMBL Data Library Molecular Biology Databases CD-ROM set

2) by anonymous FTP from FTP.EMBL-Heidelberg.DE in /pub/databases/blocks
   or from NCBI.NLM.NIH.GOV in /repository/blocks

3) by electronic mail. Send a mail message to NetServ@EMBL-Heidelberg.DE and
   include the line "HELP BLOCKS" for more information.


5. Command line
---------------

BlockSearch is completely command-line driven.

Usage:
blocksearch [-gx] [-i infile] [-o outfile] [-f freqfile] [-d blocksdb]
            [-n #hits] [-s sensitivity]
        infile   = name of sequence data file [stdin]
        outfile  = name of output file [stdout]
        freqfile = name of frequency file [freq_file]
        blocksdb = name of BLOCKS database [blocks_db]
        #hits    = no of top scores (1..200) [50]
        sensitivity = search sensitivity/speed (>= 0) [5]

If no parameters are passed to BlockSearch, it will read sequences from the
standard input stream, i.e. the keyboard or a command file, and will print
results to the standard output stream, i.e. the screen. It will assume default
values for all other parameters.

The command line options have the following meaning:

-i infile     Gives the name of a sequence data file. See Section 6 for
              details on the sequence formats recognised by BlockSearch.

-g            Input file is GCG format.

-o outfile    Specifies an output file name to hold the results.

-f freqfile   Specifies the name of a file with amino acid frequencies which
              BlockSearch will use for adjusting the scoring matrices it 
              calculates. See Section 7 fo details. The default name is
              freq_file. Logical names or aliases for freq_file can be used.
              You can change the default name by editing config.h before
              compilation.

-d blocksdb   The location of the BLOCKS database. By default, it looks in your
              current directory for a file called blocks_db. Logical names or
              aliases for blocks_db can be used. You can change the default name
              by editing config.h before compilation.

-n #hits      A number that specifies the number of high-scoring matches (per
              input sequence) that BlockSearch will display in the output. The
              default value is 50.

-x            Exhaustive search. Without the -x option BlockSearch will run
              in "fast" mode, but may miss some remote similarities. The -x
              option is equivalent to -s 0.

-s sensitivity  You can increase the speed of BlockSearch by reducing its
              sensitivity, ie, you gain speed for the modest price of missing
              some high-scoring segments. See below for more details.
              The default value is 5 and can be set to any positive value,
              although values much greater than 10 will not make sense. A value
              of 0 is equivalent to using the -x option, so BlockSearch will
              run in the most-sensitive (but slowest) mode.

Examples:

blocksearch -i test.seq -o test.out -f swissfreq.dat
     This command tells BlockSearch to read sequences from test.seq, to use the
     default blocks database and a frequency table called swissfreq.dat, and 
     to write the 200 best matches into the file test.out. 

blocksearch -d $databases:[blocks]blocks.dat -i test.seq -n 5
     BlockSearch will read sequences from test.seq, will use the database in
     the $databases:[blocks] directory and the default frequency table,
     and will display the 5 best matches on your terminal screen.


6. Input sequence formats
-------------------------

BlockSearch understands four sequence formats:

a) EMBL/SWISS-PROT format

ID   104K_THEPA     STANDARD;      PRT;   924 AA.
[Other line types deleted; not required and ignored by BlockSearch]
     TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
     KKPDSAYIPS ILAILVVSLI VGIL
//

b) PIR/NBRF format

>P1;104K_THEPA - Human
104 KD Microneme-rhoptry antigen
TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
KKPDSAYIPS ILAILVVSLI VGIL*

c) FASTA format

>P1;104K_THEPA - Human 104 KD Microneme-rhoptry antigen
TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
KKPDSAYIPS ILAILVVSLI VGIL*

a) - c)
Each input sequence file can have multiple sequences. These are analysed
consecutively, and results are listed independently in the output file.

d) Standard GCG format

Everything up to a line that ends with two periods is treated as comment,
everything else as sequence.

Only one sequence per file is allowed.


7. Frequency tables
-------------------

To compensate for bias in the protein database, all entries in the scoring
matrices which BlockSearch creates are weighted by using an externally supplied
frequency table. See Section 9 for details on the algorithm.

The following frequency tables are supplied:

swissfreq.dat     -     Derived from all proteins in SWISS-PROT Release 24
ecolifreq.dat     -     Derived by S. Henikoff from E. coli proteins and used
                        for the calibration of the BLOCKS database
simplefreq.dat    -     All amino acids have same values. No weighting.

The file format of a frequency table is very simple. Lines starting with a '#'
or '!' character are ignored as well as everything that follows these
characters in a line. There must be 26 consecutive lines with frequency
values, one for each amino acid A through Z according to IUPAC one-letter
representation, in alphabetical order.

If no special table is specified on the command line with the -f option,
BlockSearch looks for a file called freq_file in the default directory. As
supplied, this file is a copy of the ecolifreq.dat file. Under (Open)VMS you may
use a logical name FREQ_FILE pointing to the actual frequency table. Under UNIX
you can use aliases.


8. Output
---------

Here is an example output file produced by BlockSearch:

BlockSearch v2.0.1 - R. Fuchs, EMBL Data Library
Sequence file:   test.seq
Database file:  blocks_db
Frequency table: freq_file
Sensitivity:     5

----------------------------------------
Sequence MAS_HUMAN (325 residues):

Score  Strength  Block
 1310  1563      BL00237E G_PROTEIN_RECEPTOR   
            272: NSSANPFIYfFVGSSkK
 1047  1867      BL00237A G_PROTEIN_RECEPTOR   
             48: VeNGILLWFLCFRMRRNPfTVYITHLSIADI
 1036  2676      BL00702B GM_CSF               
            283: vgsskkKrFKESLKvvLTrafkDemqPrrq
 1034  1542      BL00758E ARGE_DAPE_CPG2_1     
             90: DYALdyElSS
- 997  2529      BL00594B AROMATIC_AA_PERMEASE 
            229: vimvtiiifLifAmpMrLLYlLyYeYwStfGn
- 995  2414      BL00307A LECTIN_LEGUME_BETA   
             24: SVGnAhrQiPIvhWvimSiSpvgFVeNGIL
- 989  3265      BL00115J RNA_POL_II_REPEAT    
             28: aHRqiPivhwvimSispvGFVENgiLlwflcfr
- 970  2555      BL00557C FMN_HYDROXY_ACID_DH  
            143: HRpkyQSAlvcALlwalscLVtTmeyvmcidREeesHsRN
- 961  3434      BL00446A RNA_POL_30KD         
             16: nIstgrnaSVgNAhRqIpIvhwvImsIspVgFvENgilLwflcfrmRrnpftvyi
- 954  1510      BL00614E IGPS                 
            192: aIlSFLVfTpLMLv

2302 block(s) searched in 1 sequence(s).
571 blocks searched in slow mode.
Time ellapsed: 0:12 min


After some general information, there is a list of high-scoring sequence
segments, sorted by score. For each match, there are two lines.

The first line shows the match score, the strength of the matching block, the
BLOCKS entry's accession number and its entry name. If the score is greater
than the strength, the score value is preceeded by a '+' sign. Scores less than
1,000 are indicated by a '-' sign in the first column.

The second line of each match shows the offset of the matching sequence
fragment from the beginning of the sequence (numbering starts with 1) and the
matching sequence fragment itself. A lower case character indicates that this
residue is not found at this position in any of the database sequences
comprising the block.

If the input file contains multiple sequences, then results are shown for each
sequence individually.

At the bottom of the result file there are some statistics, incl. how many
sequences were analysed and how long it took.


9. Algorithm
------------

The BlockSearch program is an implementation of a profile or scoring-matrix
search algorithm (Gribskov et al., 1987). It uses the BLOCKS database as a data
source for deriving site-specific scoring matrices. This database and its format
are described in detail in Henikoff and Henikoff, 1991. In brief, all protein
families defined by the PROSITE pattern database (Bairoch, 1992) are analyzed
for common motifs which are then used as focal points for local alignments.
These alignments are converted to blocks - short ungapped regions of aligned
protein sequence fragments. The best set of non-overlapping blocks is determined
for each family and calibrated by searching against the SWISS-PROT protein
sequence database (Bairoch and Boeckmann, 1992). This collection of calibrated
blocks constitutes the BLOCKS database.

BlockSearch converts BLOCKS database entries to site-specific scoring matrices
essentially as described by Henikoff et al., 1990. Basically, in such a matrix
each column is filled with values that represent the frequency of occurrence of
each amino acid at the corresponding position of the block. However, BlockSearch
also considers the arrangement of protein sequence fragments within a BLOCKS
entry into subgroups of very closely related proteins. Frequency values are
averaged for each subgroup prior to the calculation of a scoring matrix column
to reduce the contribution of multiply represented protein subfamilies. Matrix
values can then be optionally weighted to compensate for database bias, i.e.,
uneven occurrences of amino acids in the protein database, by using a
user-supplied amino acid frequency table. Each matrix entry is divided by the
frequency of that amino acid. The values in each matrix column are finally
normalised to 100%.

The resulting matrix is used to assign a score to each fragment of the query
sequence(s) of the width of the block by adding the matrix values for each
residue at each position. For every block the highest score and the position of
the matching query sequence fragment is remembered. Because blocks are of
unequal length, raw scores have to be adjusted to allow the comparison of
results obtained with different blocks (Wallace and Henikoff, 1992). Otherwise,
wider blocks would yield higher scores simply because they were wider. To
normalize scores, BlockSearch uses the "99.5%" value stored in each BLOCKS
entry, which is the 99.5th percentile of scores of true negative sequences
detected by this block during calibration (Henikoff and Henikoff, 1991). Raw
scores are divided by this value and multiplied by 1,000. A corrected score of
less than 1,000, therefore, indicates that a match is probably not significant.
This adjustment procedure is valid because the method used by BlockSearch for
calculating a scoring matrix from a block of aligned protein sequences is
compatible with the method used for the original calibration of the BLOCKS
database.


10. Speed and sensitivity
-------------------------

Version 2.0 of BlockSearch offers a new search method that greatly improves
its speed. This "fast" mode is faster than the normal mode by a factor of 2 to
10. The price you pay is a somewhat reduced sensitivity, i.e., some high-scoring
segments may not be detected.

In fast mode, the standard algorithm described above is altered in the following
way: When BlockSearch calculates a scoring matrix, it looks for positions in the
matrix which are absolutely conserved in all members of a protein family. In
other words, in such columns all values are 0 except for one amino acid whose
score is 100. If it finds at least one such position, BlockSearch uses this
information to "jump" along the sequence using a hashing technique and looking
for occurences of this residue.

Identical residues may appear at a given position just by chance. Since the
"reliability" of a constant position increases as the number of members in a
family grows, BlockSearch only uses this modified algorithm with families that
have at least as many members as specified by the user using the -s command line
option.

With the modified algorithm, the number of necessary comparisons can be reduced
drastically. On the other hand, you will miss those high-scoring regions in
which the conserved residue is not contained in the query sequence.

In practice, various test have shown that this loss of information does not seem
to be a problem at all and that the increase in speed is well worth the slightly
decreased sensitivity. In fact, it can be seen as a sort of filtering mechanism,
as in all tests performed only non-significant hits had been affected.

The -s command line option is used to adjust the sensitivity and speed of
BlockSearch. Setting -s to 0 turns off "fast" mode and forces BlockSearch to
fall back to the standard block search algorithm. The -x option may be used as a
shortcut for -s 0. Positive values of -s result in improved speed. In general,
the lower the value is, the quicker BlockSearch performs and the lower the
sensitivity is. A value of 1, for instance, gives the highest speed
improvements, while values greater than 100 are essentially the same as using -s
0 or -x. In respect to the algorithm used, the value of -s (with the exception
of -s 0) determines the minimum number of sequences that have to exist in a
family before the new algorithm is applied to this family.

The proper choice of the -s value depends on the question at hand. If
sensitivity matters, for example if you want to analyse a new sequence
thoroughly, "fast" mode should be turned off, while for the rapid analysis of a
large number of cDNA clones a value of 1 seems more appropriate. Various
practical tests indicate that for routine applications a value between 5 and 10
is most suited, which yields a significant speed improvement without a great
danger of missing important hits. The default value is 5. In all tests performed
no significant hit was lost.

The influence of -s on speed depends largely on your hardware/software platform
and on the sequence length. On a VAX 9000-420 search times for average-sized
query sequences (about 300 aa) could be reduced by 50% using the -s 1 option.
For larger sequences improvements of a factor 4 could be achieved. The effect is
less drastic if you are I/O bound such as on an Alpha machine. On an ULTRIX
DECstation speed improvements were similar to those on a VAX.


11. Restrictions and known bugs
-------------------------------

Please note that the maximum input sequence length is 100,000 residues. The
number of high-scoring matches that BlockSearch will show per sequence is
limited to 200. Both parameters, however, can be changed in the config.h file.

There are no bugs in BlockSearch; only some unexpected features...


12. References
--------------

Bairoch,A. (1992) PROSITE: a dictionary of sites and patterns in proteins.
Nucleic Acids Res., 20, 2013-2018.

Bairoch,A. and Boeckmann,B. (1992) The SWISS-PROT protein sequence data bank.
Nucleic Acids Res., 20, 2019-2022.

Doolittle,R.F. (1987) Of URFs and ORFs: a primer on how to analyze derived amino
acid sequences. University Science Books, Mill Valley.

Fuchs,R. (1991) MacPattern: protein pattern searching on the Apple Macintosh.
Comput. Appl. Biosci., 7, 105-106.

Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile analysis: detection
of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84, 4355-4358.

Gribskov,M., Luthy,R. and Eisenberg,D. (1990) Profile analysis. Methods
Enzymol., 183, 146-159.

Henikoff,S., Wallace,J.C. and Brown,J.P. (1990) Finding protein similarities
with nucleotide sequence databases. Methods Enzymol., 183, 111-132.

Henikoff,S. and Henikoff,J.G. (1991) Automated assembly of protein blocks for
database searching. Nucleic Acids Res., 19, 6565-6572.

Ogiwara,A., Uchiyama,I., Seto,Y. and Kanehisa,M. (1992) Construction of a
dictionary of sequence motifs that characterize groups of unrelated proteins.
Protein Eng., 5, 479-488.

Smith,H.O., Annau,T.M. and Chandrasegaran,S. (1990) Finding sequence motifs in
groups of functionally related proteins. Proc. Natl. Acad. Sci. USA, 87,
826-830.

Staden,R. (1990) Searching for patterns in protein and nucleic acid sequences.
Methods Enzymol., 183, 193-211.

Staden,R. (1991) Screening protein and nucleic acid sequences against libraries
of patterns. DNA Sequence, 1, 369-374.

Sternberg,M.J.E. (1991) PROMOT: a FORTRAN program to scan protein sequences
against a library of known motifs. Comput. Appl. Biosci., 7, 257-260.

Sternberg,M.J.E. and Islam,S.A. (1991) Protein sequences - homologies and
motifs. TIBTECH, 9, 300-302.

Wallace,J.C. and Henikoff,S. (1992) PATMAT: a searching and extraction program
for sequence, pattern and block queries and databases. Comput. Appl. Biosci., 8,
249-254.


13. Version history
-------------------

v1.0 (December 1992): First release
v1.0.1 (January 1993): GCG format support
v2.0 (May 1993): Added "fast" mode
v2.0.1(June 1993): High scoring sequence fragments are displayed in mixed case. A
        lower case character indicates that this residue does not occur in any sequence
        of the block.


14. BlockShow
-------------

A simple utility, BlockShow, is provided that facilitates access to individual
BLOCKS database entries.

Under (Open)VMS, BlockShow is compiled as part of the installation procedure.
On UNIX systems, use the command "cc -o blockshow blockshow.c" to obtain the
executable.

Usage: blockshow [-d database] acc# ...

Enter one or a list of accession numbers separated by blanks (note that the
program is case-sensitive!). The corresponding database entries are displayed
on the standard output stream.

By default, BlockShow looks for a database file called blocks_db. If the
database is centrally installed at your site, use the -d option to specify its
location:

blockshow -d $databases:[blocks]blocks.dat BL00011  BL00012  BL00013

Or simply edit config.h and change DEFAULT_BLOCKSDB before you compile
blockshow.c.