From:	Jnet%"BISANCE@FRCITI51" 14-FEB-1990 09:07:43.18
To:	BRUNIE@FRPOLY52
CC:	
Subj:	

Received: From FRCITI51(BISANCE) by FRPOLY52 with Jnet id 0048
          for BRUNIE@FRPOLY52; Wed, 14 Feb 90 09:07 GMT
Date: 14 FEB 90 09:05:54.82-GMT
From: BISANCE@FRCITI51
Subject:
To: BRUNIE@FRPOLY52
 
 
 
 
 
 
 
 
 
                                  User's Guide
                                    for the
                            Alignment Score Program
 
 
 
                         AAA   L      III    GGG   N   N
                        A   A  L       I    G   G  N   N
                        A   A  L       I    G      NN  N
                        A   A  L       I    G      N N N
                        AAAAA  L       I    G GGG  N  NN
                        A   A  L       I    G   G  N   N
                        A   A  LLLLL  III    GGG   N   N
 
 
 
 
 
 
 
                            NBR Report  820501-08710
                            Document Date  15-May-84
                             Program Version    1.1
 
 
 
 
 
 
 
 
 
 
                   B.C. Orcutt, M.O. Dayhoff, and W.C. Barker
 
 
 
 
                     Protein Identification Resource (PIR)
 
                    National Biomedical Research Foundation
                      Georgetown University Medical Center
                           Washington, D.C. 20007 USA
 
 
        ALIGN - Alignment Score Program                           Page 2
 
 
                              GENERAL DESCRIPTION
 
        Program ALIGN determines a best alignment of two protein or two
        nucleic acid sequences by computing the maximum match score
        using a version of the Needleman and Wunsch algorithm (1). An
        Alignment Score in standard deviation units is calculated by
        taking the difference between this maximum match score and the
        average maximum match score for random permutations of the two
        sequences and dividing by the standard deviation of the random
        scores.
 
        The best alignment between any pair of sequences is based on a
        computed numerical value. The contribution of each match of a
        residue or gap in one sequence with a residue or gap in the
        other sequence is accumulated to give a total score for the
        alignment. A scoring matrix defines the contribution assigned to
        each of the possible pairings between residues. A break penalty
        parameter, NPEN, can be assigned. A string of one or more
        consecutive residues in one sequence matched with a string of
        consecutive gaps (a break) within the other sequence contributes
        a score of -NPEN (independent of the length of the string).
        Residues matching a string of gaps at either end of a sequence
        are assigned a score of 0. A gap cannot be matched with a gap.
        Considering all possible alignments of the residues and any
        number of gaps, the basic algorithm of program ALIGN determines
        the maximum score possible and an alignment with that score.
 
        The scoring matrix is constructed from an input matrix and a
        matrix bias parameter, B, that is added to all terms of the
        input matrix. The net effect of adding B is that the score for
        any given alignment is increased by B times the number of
        positions where a residue matches another residue. Increasing B
        will often produce maximum scoring alignments with shorter
        overall lengths.
 
        The maximum score, R, that is achieved by an alignment of a pair
        of real sequences is compared with the distribution of maximum
        scores for a large number (usually 100) of random permutations
        of the two sequences. The mean and standard deviation of this
        approximately normal distribution are M and D. The alignment
        score, A, is the number of standard deviations by which the
        maximum score for the real sequences exceeds the average maximum
        score for the random permutations:
 
                          A = (R-M)/D   (in SD units)
 
        The probability that a score as high as that from the real
        sequences could have been obtained in a comparison of randomized
        sequences can be determined from a table for the cumulative
        standardized normal distribution.
 
        We have found the mutation data matrix (MD) to be the most
        satisfactory scoring matrix for detecting distant relationships
        between protein sequences (2,3). This matrix is based on amino
        acid replacements between present-day sequences and those
 
 
        ALIGN - Alignment Score Program                           Page 3
 
 
        inferred as common ancestors on evolutionary trees. The residues
        that did not change and the relative exposure of the sequences
        to mutational change were taken into account. A bias between 2
        and 20 is added depending on the evolutionary distance of the
        pair of proteins, so that the alignment approximates the length
        adjustments that actually occurred during evolution. Judicious
        choice of the break penalty parameter produces alignments with a
        reasonable number of breaks. A value of 2 is appropriate for
        very distant pairs where there have been many changes in length,
        whereas 12 or more is appropriate for comparison of a short
        segment that closely matches a portion of a longer sequence. The
        values of these two parameters also affect the average maximum
        score and the standard deviation obtained from the randomized
        sequences. When these parameters are varied, the alignment score
        exhibits a maximum, presumably near a position where the gaps
        reflect the actual genetic events.
 
        Two other scoring matrices are distributed with this program for
        use with protein sequences. The simplest of these, the unitary
        matrix (UP), assigns a value of 1 to identical residues and 0 to
        nonidentical ones. A slightly more complicated scoring system
        reflects the maximum possible number of identities in the
        nucleotides of the genes coding for the proteins. Identical
        amino acids obtain a score of 3; those for which two nucleotides
        could be identical, 2; one nucleotide, 1; and 0 if no
        nucleotides are ever shared in the codons for the amino acids.
        We refer to this as the genetic code matrix (GC).
 
        A unitary matrix for use with nucleic acids (UN) has also been
        supplied. This matrix assigns a score of 1 to identical
        nucleotides, to the pair U/T, and to any pairs of ambiguous
        nucleotides that could possibly be identical. The scoring
        matrices are further described in a separate document (4).
 
        The sequences to be compared can be obtained from those stored
        in the Protein Sequence Database, in the Nucleic Acid Sequence
        Database, or in user-created files. Separate documents (5,6)
        describe the formats for user-created sequence files. Coding
        regions of nucleic acid sequences can be translated and compared
        as proteins. When compared as nucleic acids, the amino acid
        translations can also appear on the alignment.
 
 
        ALIGN - Alignment Score Program                           Page 4
 
 
                                 RUNNING ALIGN
 
        ALIGN is designed to be run in batch mode on a VAX/VMS system. A
        command procedure, ALIGN.COM, is distributed with the program to
        facilitate execution of the program. To submit an ALIGN job from
        the interactive mode, invoke the procedure by entering the
        following command line:
 
                 $ @ALIGN input-file
 
        where input-file is the file specification for the ALIGN input
        file. If the file type is omitted it is assumed to be .DAT. The
        disposition of the output from ALIGN is determined by an
        assignment in the command procedure.
 
 
        The ALIGN input file must contain the following items in the
        order shown below.
 
             1.  Run Options (optional)
 
             2.  Run Title
 
             3.  Matrix Specification
 
             4.  Break Penalty
 
             5.  Number of Comparisons, Number of Random Runs
 
             6.  Sequence Specifications (A sequence specification must
                 be included for each sequence to be used in the
                 pairwise comparisons in the run. A maximum of 10
                 sequences can be compared in one run.)
 
 
 
        Multiple runs may be included in the input file. For each run
        after the first, insert a single blank line and then repeat
        items 1 to 6 (or 2 to 6). An option set on an options line
        remains in effect on all subsequent runs in which the options
        line is omitted or until another options line is processed. If
        the option is not listed on an options line, it reverts to its
        default value.
 
 
        ALIGN - Alignment Score Program                           Page 5
 
 
                               INPUT FILE FORMAT
 
        1. Run Options (optional)
 
        The first character on this line must be an exclamation mark (!)
        to indicate an options line. The list of desired options, in any
        order and separated by commas, must follow the mark. The options
        available are:
 
            SHORT       Short form of the output.
            MATRIX      Print the Maximum Score Matrix.
            BREAKS      Compute average number of breaks for the
                        random comparisons.
            ALIGN       Print the alignment for a random comparison.
            SCORES      Print the match scores for the random
                        comparisons. Only the first 100 scores
                        can be displayed.
            PROTEIN     Set the database to the Protein Sequence
                        Database (this is the default).
            NUCLEIC     Set the database to the Nucleic Acid
                        Sequence Database.
 
        2. Run Title
 
        Title for the ALIGN run. This line contains text that is used as
        the page heading in the output. The first character must not be
        an exclamation mark. Only the first 72 characters of the line
        are used for the run title.
 
        3. Matrix Specification
 
        The matrix specification is the file specification for a file
        that contains the scoring matrix. The file specification can be
        followed by an optional positive or negative integer constant,
        the matrix bias. The matrix bias is added to each element of the
        scoring matrix read from the file. The default file type is
        .MAT. The format for a matrix file is described in reference 4.
 
        4. Break Penalty
 
        An integer number: NPEN
 
        Each break (string of one or more consecutive gaps) in an
        aligned sequence is assigned a score of -NPEN, except when the
        break occurs at the beginning or at the end of the aligned
        sequence; in this case, the score assigned is zero.
 
        5. Number of Comparisons, Number of Random Runs
 
        Two integer numbers separated by a comma: NCOMP,NRUNS
 
        NCOMP is the number of pairwise comparisons between the input
        sequences. The first sequence is compared with each subsequent
        sequence. Then the second sequence is compared with each
        subsequent sequence, and so on, until NCOMP comparisons have
 
 
        ALIGN - Alignment Score Program                           Page 6
 
 
        been made or until all possible comparisons have been made. If
        NCOMP is zero or is omitted, all possible comparisons are made.
 
        NRUNS is the number of times that the two sequences are to be
        randomly permuted and matched. If NRUNS is zero or is omitted no
        random runs are performed.
 
        6. Sequence Specification
 
        A sequence specification is a single line that contains one to
        four fields that must occur in the following order:
 
        A. CODE
           This is a four- to six-character code that identifies a
           sequence.
 
        B. RESIDUES (optional)
           If this field is omitted, the entire sequence is used in the
           analysis. Otherwise, this field consists of a pair of numbers
           enclosed in parentheses and separated by a dash; for example,
           (26-343). These numbers specify the first and the last
           residues of the fragment to be extracted from the sequence
           and used in the analysis. Either number may be omitted,
           causing its value to be assigned by default. The default
           value for the first number is 1; for example, (-343) means
           residues 1 to 343. The default value for the second number is
           the length of the sequence; for example, (26-) means residues
           26 to the end of the sequence.
 
        C. FILE (optional)
           If this field is omitted, the sequence is retrieved from the
           database files. Otherwise, this field contains the file
           specification of a user-created file in which the sequence to
           be input is stored. The file specification must be separated
           from the CODE and RESIDUES fields by an equals sign. For
           example, TRYP(37-170)=TRYPSIN.SEQ
           The default file type is .SEQ. The format for a sequence file
           is described in references 5 and 6.
 
        D. OPTIONS (optional)
           The ALIGN program recognizes two options that can be
           specified on the sequence specification line:
           /P   This option instructs the program to translate the
                sequence, assumed to be a nucleic acid sequence, to a
                protein sequence and to use the protein sequence in the
                analysis.
           /M   This option can be used when comparing nucleic acid
                sequences. It instructs the program to print the
                translated protein sequences on the alignment of the
                nucleic acid sequences. If /M is specified, then either
                /1, /2, or /3 must also be added to indicate the reading
                frame for the protein.
 
           For example, the sequence specification GBPM2(1678-1902)/P
           causes sequence GBPM2 to be read from the nucleic acid
 
 
        ALIGN - Alignment Score Program                           Page 7
 
 
           database, the 225-residue fragment extending from 1678 to
           1902 to be extracted and translated to a protein sequence of
           75 amino acids.
 
        ALIGN can read in any sequence up to 60,000 residues long.
        However, the fragment extracted from the sequence for use in the
        analysis must not exceed 400 residues. When the /P option is
        used, the 400-residue limit applies to the protein translation,
        not the nucleotide sequence from which it is derived. Further,
        the /P option requires the !NUCLEIC option together with a
        protein scoring matrix.
 
 
        ALIGN - Alignment Score Program                           Page 8
 
 
                               SAMPLE INPUT FILES
 
        The numbers on the left are line numbers; they are not a part of
        the file.
 
 
        ALIGN1.DAT
 
          1   !NUCLEIC
          2   Align MS2 and Q-Beta coat protein genes
          3   UN+2
          4   2
          5   ,100
          6   GBPM2(1335-1724)=NUC.SEQ/M/1
          7   GBPQB(99-338)=NUC.SEQ/M/1
 
 
 
 
 
        ALIGN2.DAT
 
          1   Align MS2 and Q-Beta coat proteins
          2   MD+2
          3   6
          4   ,100
          5   GBPM2=PRO.SEQ
          6   GBPQB=PRO.SEQ
 
 
        ALIGN - Alignment Score Program                           Page 9
 
 
                                   REFERENCES
 
        1. Needleman, S.B., and Wunsch, C.D., "A general method
           applicable to the search for similarities in the amino acid
           sequence of two proteins," J. Mol. Biol. 48, 443-453, 1970.
        2. Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C., "A model of
           evolutionary change in proteins," in Atlas of Protein
           Sequence and Structure, vol.5, suppl.3, pp.345-362 (M.O.
           Dayhoff, ed.). National Biomedical Research Foundation,
           Washington, D.C., 1979.
        3. Schwartz, R.M., and Dayhoff, M.O., "Matrices for detecting
           distant relationships," in Atlas of Protein Sequence and
           Structure, vol.5, suppl.3, pp.353-358 (M.O. Dayhoff, ed.).
           National Biomedical Research Foundation, Washington, D.C.,
           1979.
        4. Orcutt, B.C., and Dayhoff, M.O., Scoring Matrices. NBR Report
           820541-08710. National Biomedical Research Foundation,
           Washington, D.C., 1982.
        5. Orcutt, B.C., and Dayhoff, M.O., Nucleic Acid Sequence
           Database: Sequence File Format. NBR Report 820530-08710.
           National Biomedical Research Foundation, Washington, D.C.,
           1982.
        6. Orcutt, B.C., and Dayhoff, M.O., Protein Sequence Database:
           Sequence File Format. NBR Report 820535-08710. National
           Biomedical Research Foundation, Washington, D.C., 1982.