Doc CLUSTAL.HLP 1/9/89 Des Higgins Genetics Department, Trinity College, Dublin 2 Ireland. dHiggins@vax1.tcd.ie (irl.) 1-772941 ext. 1969 Higgins, D.G. and Sharp, P.M. (1988) Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, vol. 73, pp. 237-244. Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS, vol. 5, pp. 151-153. >>HELP<< 1 General Help ..... from main menu CLUSTAL is a program for performing multiple alignments of up to 100 DNA or protein sequences of up to 5000 residues (including gaps in the final alignment). By now, you should already have specified a sequence data set to be used. If not, go to item 1 on the menu and give the name of a file in the correct format. Help is available there to explain what that format should be. Then, if you go to item 2 on the menu, the complete multiple alignment process will be applied and the alignemnt will be sent to a file. The alignments are carried out in 3 stages: 1) all pairwise similarity scores between all the sequences are calculated; 2) the similarity matrix is used to cluster the sequences using UPGMA cluster analysis; 3) the final multiple alignment is performed by gradually aligning groups of sequences, according to the branching order in the dendrogram. Two files are usually produced during the alignment process: 1) a file containing a description of the dendrogram; 2) a file for the multiple align- ment. You can use item 3 on the menu to specify that only the dendrogram is to be produced. This is useful for large data sets (many sequences) where the dendrogram takes a long time to produce. This dendrogram file can later be used as input (item 4 on the menu). This is useful because, you only have to produce a dendrogram file once, if you wish to experiment with different GAP parameters in the multiple alignment. Item 5 on the main menu is useful for aligning 2 sequences from a dataset. The sequences are numbered from 1 to N (N is the number of sequences) and you are asked to specify 2 of them. There are 2 lots of parameters that control the speed/sensitivity/"gappiness" of the alignments. Go to item 6 on the menu to see what these are or to change them. Help is available under item 6 to explain what these parameters do. Use item 7 to view text files, for example output files from previous alignments. Details of CLUSTAL have been published in: Higgins, D.G. and Sharp, P.M. (1988) Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene, vol. 73, pp. 237-244. Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS, vol. 5, pp. 151-153. Des Higgins, Genetics Department, Trinity College, Dublin 2, Ireland e-mail: dHiggins@vax1.tcd.ie >>HELP<< 2 Help for the alignment parameters There are 2 groups of parameters used in the alignments: 1) the pairwise sim- ilarity score parameters; 2) the multiple alignment parameters. PAIRWISE SIMILARITY SCORE PARAMETERS All pairs of sequence are taken and a similarity score is calculated, which is approximately the number of identical residues between the 2 sequences, minus a fixed penalty for each gap, needed to align them. The method used is that of Wilbur and Lipman (P.N.A.S., 1983). The method works by finding a fast approx- imate global alignment between 2 sequences. This is done using 2 techniques: 1) only exactly matching fragments (k-tuples) are considered; 2) using an imaginary dot-matrix plot, only those diagonals with a high number of matches and a "window" around these, are considered. If sequences are very dissimilar, you may get similarity scores of zero. This is due to the approximate nature of the alignments. It can be fixed by using slower (more sensitive) parameters but, unless there are very many zero scores, this will not matter. 1) K-TUPLE SIZE This is the size of exactly matching fragment that is used. The larger this is set to (max= 2 for proteins; max= 4 for DNA), the faster but more approximate will be the alignment. For short sequences (e.g. 300 residues, or less) or for small numbers of sequences (less than 20) a value of 1 will be fine; for longer sequences (especially DNA) larger values might be used. 2) GAP PENALTY This parameter controls the frequency of gaps in the pair- wise alignments. It willl not have much affect on the scores. The higher the gap penalty, the less likely are gaps. The penalty specifies the number of exactly matching residues that must be found by introducing a gap. 3) FILTERING LEVEL Consider an imaginary dot-matrix plot between 2 sequences. The number of k-tuple matches along each diagonal is counted. Then the mean and standard deviation of the matches on the diagonals is calculated. Then, only those diagonals that have "filtering level" number of standard deviations of matches above the mean number of matches are considered. These diagonals are the ones with most matches and are hence the most interesting for an alignment. Increasing this parameter will speed up the alignments but will result in more scores of zero. 4) WINDOW SIZE After the diagonals with most matches are found, this parameter specifies a window around each that will be used in the alignment. Decreasing this parameter will speed up the alignments. 5) SIMILARITY SCORES = PERCENTAGE or ABSOLUTE If you choose percentage, then the scores are calculated as percentage matches between 2 sequences (approx- imately) i.e. (score/shorter length) * 100; if you choose absolute scores then the scores are simply the number of matches. Percentage scores are advisable if the lengths differ greatly. Absolute scores are better otherwise. MULTIPLE ALIGNMENT PARAMETERS These parameters control the final multiple alignment. There are 2 gap penalty parameters and 1 for whether transitions are weighted in DNA align- ments. The basic algorithm used, attempts to minimise the distance between groups of sequences. By default, identical DNA residues have a distance of 0 and different ones have a distance of 10; if transitions are weighted then transitions have a distance of 5. For amino acids, a Dayhoff PAM matrix is used. 6) GAP PENALTY (Fixed) This parameter is a penalty for every gap that is introduced, regardless of length of gap. Therefore, decreasing this parameter will encourage gaps of all sizes. BEWARE: if you make this too small (approx. 5 or so) then the program may prefer to align each sequence opposite a long gap. 7) GAP PENALTY (Varying) This parameter is a penalty for each item in each gap. Therefore, this is a penalty for longer gaps. Increase this and gaps will get shorter. BEWARE: if you make this too small (approx. 5 or so) then the program may prefer to align each sequence opposite a long gap. 8) TRANSITIONS = WEIGHTED or UNWEIGHTED If transitions are unweighted, then all nucleic acid mismatches have the same weight. If transitions are weighted then transitions (C vs T; A vs G) score have an intermediate score between exact matches and other mismatches. >>HELP<< 3 Help on the sequence input format Two input formats are allowed: 1) each sequence is in a separate file, each with a TITLE in line one, the sequence in free format on line 2 onwards; the input is a file of file names. Don't forget the title in line 1 of each seq- uence file. 2) all sequences are in a single file. Each sequence is delimited by a > character in column 1 (same as FASTP format). Each line beginning with a > is treated as a TITLE for the following sequence. Sequences are entered in free format (gaps and punctuation marks are ignored) with a maximum of 120 residues per line. Maximum allowed sequence length is 5000 residues (including gaps in the final alignment). Upper or lower case may be used. The 1 letter code is used for amino acids; for DNA U = T. No ambiguity codes are used. Residues are either a valid amino acid or nucleotide or unknown. The program will attempt to determine (from the sequence composition) whether the sequences are DNA/RNA or protein. EXAMPLE: > Mouse nose drying factor ACGTAGCTAGCACTTAGCTAGCTGCTACGTAGCTAGCTAGCTACGCT TAGTAGCTACGCGTAGTGTAGCATCGATGCTAGCTGATCGATGCTAGCTGACT GATGCTATCGACTGATCG > Rabbit Guinness receptor CACGTCGCAGCTGCCTAGCCGGGGGATGATACGCTGTATATCGGATTATATGCGCGCGATGCTA AGGATGCTACGCGTCAGTCGCTAGGCGCAGTAGCGCTAG > Hamster interferon alpha GACGCATGCATTTACGCTCGATCAGFTCGACATCAGCAGAGAT GCATCAGATCAGCATCAGCATACGACAGCAGATACGA etc. etc. >>HELP<< 4 Help for the dendrogram file The intermediate dendrograms used by CLUSTAL, are stored in files. The default file name is derived from the sequence input file, with the extension .DND added on. The dendrogram is a description of the relatedness of the sequences in the data set. You can use a dendrogram file immediately for a multiple alignment or use it again at a later date, to save having to generate it again. Also, you can edit the file to change the branching order of the sequences. Example dendrogram for 7 sequences: 2 122.00 0 0 1200000 each row represents 1 cluster. There are 3 102.00 1 0 1120000 N-1 clusters for N sequences. 2 96.00 0 0 0001200 5 84.83 2 3 1112200 6 77.60 4 0 1111120 7 31.50 5 0 1111112 There are 5 columns of numbers. The first column states how many sequeces are clustered at any stage e.g. in row 1, two sequences are joined. In row 6, all 7 sequences are joined. The second column of figures (with the decimal points) are the similarity scores that the sequences in that cluster join at e.g. in row 1, the two sequences join at a level of 122; in the final row, the 7 sequences join at a level of 31.5. The final block of 0,1 and 2's represent the sequences joining together in each cluster. Each column is a sequence; each row is a cluster. In each cluster, the sequences marked with a 1 join with those marked with a 2 e.g. in the first cluster, sequences 1 and 2 join; in the fourth cluster sequences 1,2 and 3 join with sequences 4 and 5. The 2 remaining columns of figures (columns 3 and 4) are pointers to the rows that the two groups of sequences, joined at this level, come from e.g. in cluster 4 (row 4) the 2 groups of sequences come from rows 2 and 3. >>HELP<< 5 Help for the alignment file The final multiple alignment is sent to a file whose name is derived from the sequence input file with the addition of the ending: .ALN . The output is self explanatory. Positions where all residues are identical are marked with a star ( * ) and, for proteins, positions where all residues are "similar" are marked with a dot ( . ). "Similar"ity of residues is determined by having a score of 8 or greater in a Dayhoff PAM matrix of amino-acid similarity. EXAMPLE: * :=> match across all seqs. . :=> conservative substitutions Human DSHUCZ M-ATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTE-GLHGFHVHEFGDNTAG- Bovine DSBOCZ --ATKAVCVLKGDGPVQGTIHFEAKG--DTVVVTGSITGLTE-GDHGFHVHQFGDNTQG- Swordfish SODL V-L-KAVCVLRGAGETTGTVYFEQEGNANAVGKGIILKGLTP-GEHGFHVHGFGDNTNG- Drosophila DSF V-V-KAVCVING-D-AKGTVFFEQESSGTPVKVSGEVCGLAK-GLHGFHVHEFGDNTNG- Maize SDMZ M-V-KAVAVLAGTD-VKGTIFFSQEGDG-PTTVTGSISGLKP-GLHGFHVHALGDTTNG- Yeast DSBYC V---QAVAVLKGDAGVSGVVKFEQASESEPTTVSYEIAGNSPNAERGFHIHEFGDATNG- Photobacter DS QDLTVKMTDLQTGKPV-GTIELSQNKYG--VVFTPELADLTP-GMHGFHIHQNGSCASSE . . . . *.. ... . . . . .***.* *. . . etc. etc. >>HELP<< 6 Produce a dendrogram file only This option allows you to calculate all the pairwise similarity scores and produce a dendrogram, without doing the final multiple alignment. The dendrogram will be sent to a file and can be used again at a later date (by specifying item 4 on the main menu). Example dendrogram for 7 sequences: 2 122.00 0 0 1200000 each row represents 1 cluster. There are 3 102.00 1 0 1120000 N-1 clusters for N sequences. 2 96.00 0 0 0001200 5 84.83 2 3 1112200 6 77.60 4 0 1111120 7 31.50 5 0 1111112 There are 5 columns of numbers. The first column states how many sequeces are clustered at any stage e.g. in row 1, two sequences are joined. In row 6, all 7 sequences are joined. The second column of figures (with the decimal points) are the similarity scores that the sequences in that cluster join at e.g. in row 1, the two sequences join at a level of 122; in the final row, the 7 sequences join at a level of 31.5. The final block of 0,1 and 2's represent the sequences joining together in each cluster. Each column is a sequence; each row is a cluster. In each cluster, the sequences marked with a 1 join with those marked with a 2 e.g. in the first cluster, sequences 1 and 2 join; in the fourth cluster sequences 1,2 and 3 join with sequences 4 and 5. The 2 remaining columns of figures (columns 3 and 4) are pointers to the rows that the two groups of sequences, joined at this level, come from e.g. in cluster 4 (row 4) the 2 groups of sequences come from rows 2 and 3. >>HELP<< 7 Use an old dendrogram file This option allows you to use a dendrogram file that was produced during an earlier multiple alignment. This is useful because, some dendrograms are very time consuming to produce. The format of the dendrogram is complicated; therefore you should only use a file produced by this program or one that was edited CAREFULLY. The number of sequences in the dendrogram file MUST be the same as the number of sequences in the current sequence data set. The number of rows in the file will be equal to the number of clusters which is the number of sequences - 1. Every time you do a complete multiple alignment (option 2 from the main menu) a dendrogram file is automatically produced.