From: Jnet%"BISANCE@FRCITI51" 14-FEB-1990 09:07:43.18 To: BRUNIE@FRPOLY52 CC: Subj: Received: From FRCITI51(BISANCE) by FRPOLY52 with Jnet id 0048 for BRUNIE@FRPOLY52; Wed, 14 Feb 90 09:07 GMT Date: 14 FEB 90 09:05:54.82-GMT From: BISANCE@FRCITI51 Subject: To: BRUNIE@FRPOLY52 User's Guide for the Alignment Score Program AAA L III GGG N N A A L I G G N N A A L I G NN N A A L I G N N N AAAAA L I G GGG N NN A A L I G G N N A A LLLLL III GGG N N NBR Report 820501-08710 Document Date 15-May-84 Program Version 1.1 B.C. Orcutt, M.O. Dayhoff, and W.C. Barker Protein Identification Resource (PIR) National Biomedical Research Foundation Georgetown University Medical Center Washington, D.C. 20007 USA ALIGN - Alignment Score Program Page 2 GENERAL DESCRIPTION Program ALIGN determines a best alignment of two protein or two nucleic acid sequences by computing the maximum match score using a version of the Needleman and Wunsch algorithm (1). An Alignment Score in standard deviation units is calculated by taking the difference between this maximum match score and the average maximum match score for random permutations of the two sequences and dividing by the standard deviation of the random scores. The best alignment between any pair of sequences is based on a computed numerical value. The contribution of each match of a residue or gap in one sequence with a residue or gap in the other sequence is accumulated to give a total score for the alignment. A scoring matrix defines the contribution assigned to each of the possible pairings between residues. A break penalty parameter, NPEN, can be assigned. A string of one or more consecutive residues in one sequence matched with a string of consecutive gaps (a break) within the other sequence contributes a score of -NPEN (independent of the length of the string). Residues matching a string of gaps at either end of a sequence are assigned a score of 0. A gap cannot be matched with a gap. Considering all possible alignments of the residues and any number of gaps, the basic algorithm of program ALIGN determines the maximum score possible and an alignment with that score. The scoring matrix is constructed from an input matrix and a matrix bias parameter, B, that is added to all terms of the input matrix. The net effect of adding B is that the score for any given alignment is increased by B times the number of positions where a residue matches another residue. Increasing B will often produce maximum scoring alignments with shorter overall lengths. The maximum score, R, that is achieved by an alignment of a pair of real sequences is compared with the distribution of maximum scores for a large number (usually 100) of random permutations of the two sequences. The mean and standard deviation of this approximately normal distribution are M and D. The alignment score, A, is the number of standard deviations by which the maximum score for the real sequences exceeds the average maximum score for the random permutations: A = (R-M)/D (in SD units) The probability that a score as high as that from the real sequences could have been obtained in a comparison of randomized sequences can be determined from a table for the cumulative standardized normal distribution. We have found the mutation data matrix (MD) to be the most satisfactory scoring matrix for detecting distant relationships between protein sequences (2,3). This matrix is based on amino acid replacements between present-day sequences and those ALIGN - Alignment Score Program Page 3 inferred as common ancestors on evolutionary trees. The residues that did not change and the relative exposure of the sequences to mutational change were taken into account. A bias between 2 and 20 is added depending on the evolutionary distance of the pair of proteins, so that the alignment approximates the length adjustments that actually occurred during evolution. Judicious choice of the break penalty parameter produces alignments with a reasonable number of breaks. A value of 2 is appropriate for very distant pairs where there have been many changes in length, whereas 12 or more is appropriate for comparison of a short segment that closely matches a portion of a longer sequence. The values of these two parameters also affect the average maximum score and the standard deviation obtained from the randomized sequences. When these parameters are varied, the alignment score exhibits a maximum, presumably near a position where the gaps reflect the actual genetic events. Two other scoring matrices are distributed with this program for use with protein sequences. The simplest of these, the unitary matrix (UP), assigns a value of 1 to identical residues and 0 to nonidentical ones. A slightly more complicated scoring system reflects the maximum possible number of identities in the nucleotides of the genes coding for the proteins. Identical amino acids obtain a score of 3; those for which two nucleotides could be identical, 2; one nucleotide, 1; and 0 if no nucleotides are ever shared in the codons for the amino acids. We refer to this as the genetic code matrix (GC). A unitary matrix for use with nucleic acids (UN) has also been supplied. This matrix assigns a score of 1 to identical nucleotides, to the pair U/T, and to any pairs of ambiguous nucleotides that could possibly be identical. The scoring matrices are further described in a separate document (4). The sequences to be compared can be obtained from those stored in the Protein Sequence Database, in the Nucleic Acid Sequence Database, or in user-created files. Separate documents (5,6) describe the formats for user-created sequence files. Coding regions of nucleic acid sequences can be translated and compared as proteins. When compared as nucleic acids, the amino acid translations can also appear on the alignment. ALIGN - Alignment Score Program Page 4 RUNNING ALIGN ALIGN is designed to be run in batch mode on a VAX/VMS system. A command procedure, ALIGN.COM, is distributed with the program to facilitate execution of the program. To submit an ALIGN job from the interactive mode, invoke the procedure by entering the following command line: $ @ALIGN input-file where input-file is the file specification for the ALIGN input file. If the file type is omitted it is assumed to be .DAT. The disposition of the output from ALIGN is determined by an assignment in the command procedure. The ALIGN input file must contain the following items in the order shown below. 1. Run Options (optional) 2. Run Title 3. Matrix Specification 4. Break Penalty 5. Number of Comparisons, Number of Random Runs 6. Sequence Specifications (A sequence specification must be included for each sequence to be used in the pairwise comparisons in the run. A maximum of 10 sequences can be compared in one run.) Multiple runs may be included in the input file. For each run after the first, insert a single blank line and then repeat items 1 to 6 (or 2 to 6). An option set on an options line remains in effect on all subsequent runs in which the options line is omitted or until another options line is processed. If the option is not listed on an options line, it reverts to its default value. ALIGN - Alignment Score Program Page 5 INPUT FILE FORMAT 1. Run Options (optional) The first character on this line must be an exclamation mark (!) to indicate an options line. The list of desired options, in any order and separated by commas, must follow the mark. The options available are: SHORT Short form of the output. MATRIX Print the Maximum Score Matrix. BREAKS Compute average number of breaks for the random comparisons. ALIGN Print the alignment for a random comparison. SCORES Print the match scores for the random comparisons. Only the first 100 scores can be displayed. PROTEIN Set the database to the Protein Sequence Database (this is the default). NUCLEIC Set the database to the Nucleic Acid Sequence Database. 2. Run Title Title for the ALIGN run. This line contains text that is used as the page heading in the output. The first character must not be an exclamation mark. Only the first 72 characters of the line are used for the run title. 3. Matrix Specification The matrix specification is the file specification for a file that contains the scoring matrix. The file specification can be followed by an optional positive or negative integer constant, the matrix bias. The matrix bias is added to each element of the scoring matrix read from the file. The default file type is .MAT. The format for a matrix file is described in reference 4. 4. Break Penalty An integer number: NPEN Each break (string of one or more consecutive gaps) in an aligned sequence is assigned a score of -NPEN, except when the break occurs at the beginning or at the end of the aligned sequence; in this case, the score assigned is zero. 5. Number of Comparisons, Number of Random Runs Two integer numbers separated by a comma: NCOMP,NRUNS NCOMP is the number of pairwise comparisons between the input sequences. The first sequence is compared with each subsequent sequence. Then the second sequence is compared with each subsequent sequence, and so on, until NCOMP comparisons have ALIGN - Alignment Score Program Page 6 been made or until all possible comparisons have been made. If NCOMP is zero or is omitted, all possible comparisons are made. NRUNS is the number of times that the two sequences are to be randomly permuted and matched. If NRUNS is zero or is omitted no random runs are performed. 6. Sequence Specification A sequence specification is a single line that contains one to four fields that must occur in the following order: A. CODE This is a four- to six-character code that identifies a sequence. B. RESIDUES (optional) If this field is omitted, the entire sequence is used in the analysis. Otherwise, this field consists of a pair of numbers enclosed in parentheses and separated by a dash; for example, (26-343). These numbers specify the first and the last residues of the fragment to be extracted from the sequence and used in the analysis. Either number may be omitted, causing its value to be assigned by default. The default value for the first number is 1; for example, (-343) means residues 1 to 343. The default value for the second number is the length of the sequence; for example, (26-) means residues 26 to the end of the sequence. C. FILE (optional) If this field is omitted, the sequence is retrieved from the database files. Otherwise, this field contains the file specification of a user-created file in which the sequence to be input is stored. The file specification must be separated from the CODE and RESIDUES fields by an equals sign. For example, TRYP(37-170)=TRYPSIN.SEQ The default file type is .SEQ. The format for a sequence file is described in references 5 and 6. D. OPTIONS (optional) The ALIGN program recognizes two options that can be specified on the sequence specification line: /P This option instructs the program to translate the sequence, assumed to be a nucleic acid sequence, to a protein sequence and to use the protein sequence in the analysis. /M This option can be used when comparing nucleic acid sequences. It instructs the program to print the translated protein sequences on the alignment of the nucleic acid sequences. If /M is specified, then either /1, /2, or /3 must also be added to indicate the reading frame for the protein. For example, the sequence specification GBPM2(1678-1902)/P causes sequence GBPM2 to be read from the nucleic acid ALIGN - Alignment Score Program Page 7 database, the 225-residue fragment extending from 1678 to 1902 to be extracted and translated to a protein sequence of 75 amino acids. ALIGN can read in any sequence up to 60,000 residues long. However, the fragment extracted from the sequence for use in the analysis must not exceed 400 residues. When the /P option is used, the 400-residue limit applies to the protein translation, not the nucleotide sequence from which it is derived. Further, the /P option requires the !NUCLEIC option together with a protein scoring matrix. ALIGN - Alignment Score Program Page 8 SAMPLE INPUT FILES The numbers on the left are line numbers; they are not a part of the file. ALIGN1.DAT 1 !NUCLEIC 2 Align MS2 and Q-Beta coat protein genes 3 UN+2 4 2 5 ,100 6 GBPM2(1335-1724)=NUC.SEQ/M/1 7 GBPQB(99-338)=NUC.SEQ/M/1 ALIGN2.DAT 1 Align MS2 and Q-Beta coat proteins 2 MD+2 3 6 4 ,100 5 GBPM2=PRO.SEQ 6 GBPQB=PRO.SEQ ALIGN - Alignment Score Program Page 9 REFERENCES 1. Needleman, S.B., and Wunsch, C.D., "A general method applicable to the search for similarities in the amino acid sequence of two proteins," J. Mol. Biol. 48, 443-453, 1970. 2. Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C., "A model of evolutionary change in proteins," in Atlas of Protein Sequence and Structure, vol.5, suppl.3, pp.345-362 (M.O. Dayhoff, ed.). National Biomedical Research Foundation, Washington, D.C., 1979. 3. Schwartz, R.M., and Dayhoff, M.O., "Matrices for detecting distant relationships," in Atlas of Protein Sequence and Structure, vol.5, suppl.3, pp.353-358 (M.O. Dayhoff, ed.). National Biomedical Research Foundation, Washington, D.C., 1979. 4. Orcutt, B.C., and Dayhoff, M.O., Scoring Matrices. NBR Report 820541-08710. National Biomedical Research Foundation, Washington, D.C., 1982. 5. Orcutt, B.C., and Dayhoff, M.O., Nucleic Acid Sequence Database: Sequence File Format. NBR Report 820530-08710. National Biomedical Research Foundation, Washington, D.C., 1982. 6. Orcutt, B.C., and Dayhoff, M.O., Protein Sequence Database: Sequence File Format. NBR Report 820535-08710. National Biomedical Research Foundation, Washington, D.C., 1982.