Doc CLUSTAL.HLP   1/9/89
Des Higgins  Genetics Department, Trinity College, Dublin 2
             Ireland.
 dHiggins@vax1.tcd.ie
 (irl.) 1-772941 ext. 1969
     
Higgins, D.G. and Sharp, P.M. (1988)
   Clustal: a package for performing multiple sequence alignment on a
   microcomputer.  Gene, vol. 73, pp. 237-244.
     
Higgins, D.G. and Sharp, P.M. (1989)
   Fast and sensitive multiple sequence alignments on a microcomputer.
   CABIOS, vol. 5, pp. 151-153.
     
>>HELP<< 1     General Help ..... from main menu
     
CLUSTAL is a program for performing multiple alignments of up to 100 DNA or
protein sequences of up to 5000 residues (including gaps in the final
alignment).
     
By now, you should already have specified a sequence data set to be used.  If
not, go to item 1 on the menu and give the name of a file in the correct
format.  Help is available there to explain what that format should be.  Then,
if you go to item 2 on the menu, the complete multiple alignment process will
be applied and the alignemnt will be sent to a file.
     
The alignments are carried out in 3 stages: 1) all pairwise similarity scores
between all the sequences are calculated; 2) the similarity matrix is used to
cluster the sequences using UPGMA cluster analysis; 3) the final multiple
alignment is performed by gradually aligning groups of sequences, according to
the branching order in the dendrogram.
     
Two files are usually produced during the alignment process: 1) a file
containing a description of the dendrogram; 2) a file for the multiple align-
ment.  You can use item 3 on the menu to specify that only the dendrogram is
to be produced.  This is useful for large data sets (many sequences) where the
dendrogram takes a long time to produce.  This dendrogram file can later
be used as input (item 4 on the menu).  This is useful because, you only have
to produce a dendrogram file once, if you wish to experiment with different
GAP parameters in the multiple alignment.
     
Item 5 on the main menu is useful for aligning 2 sequences from a dataset.
The sequences are numbered from 1 to N (N is the number of sequences) and
you are asked to specify 2 of them.
     
There are 2 lots of parameters that control the speed/sensitivity/"gappiness"
of the alignments.  Go to item 6 on the menu to see what these are or to
change them.  Help is available under item 6 to explain what these parameters
do.
     
Use item 7 to view text files, for example output files from previous
alignments.
     
Details of CLUSTAL have been published in:
     
Higgins, D.G. and Sharp, P.M. (1988)
   Clustal: a package for performing multiple sequence alignment on a
   microcomputer.  Gene, vol. 73, pp. 237-244.
     
Higgins, D.G. and Sharp, P.M. (1989)
   Fast and sensitive multiple sequence alignments on a microcomputer.
   CABIOS, vol. 5, pp. 151-153.
     
     
Des Higgins, Genetics Department, Trinity College, Dublin 2, Ireland
             e-mail:     dHiggins@vax1.tcd.ie
     
>>HELP<< 2     Help for the alignment parameters
     
There are 2 groups of parameters used in the alignments: 1) the pairwise sim-
ilarity score parameters; 2) the multiple alignment parameters.
     
PAIRWISE SIMILARITY SCORE PARAMETERS
     
All pairs of sequence are taken and a similarity score is calculated, which is
approximately the number of identical residues between the 2 sequences, minus
a fixed penalty for each gap, needed to align them.  The method used is that of
Wilbur and Lipman (P.N.A.S., 1983). The method works by finding a fast approx-
imate global alignment between 2 sequences.   This is done using 2 techniques:
1) only exactly matching fragments (k-tuples) are considered; 2) using an
imaginary dot-matrix plot, only those diagonals with a high number of matches
and a "window" around these, are considered.
     
If sequences are very dissimilar, you may get similarity scores of zero.  This
is due to the approximate nature of the alignments.  It can be fixed by using
slower (more sensitive) parameters but, unless there are very many zero scores,
this will not matter.
     
1) K-TUPLE SIZE  This is the size of exactly matching fragment that is used.
The larger this is set to (max= 2 for proteins; max= 4 for DNA), the faster
but more approximate will be the alignment.   For short sequences (e.g. 300
residues, or less) or for small numbers of sequences (less than 20) a value
of 1 will be fine;  for longer sequences (especially DNA) larger values
might be used.
     
2) GAP PENALTY   This parameter controls the frequency of gaps in the pair-
wise alignments.  It willl not have much affect on the scores.  The higher
the gap penalty, the less likely are gaps.  The penalty specifies the number
of exactly matching residues that must be found by introducing a gap.
     
3) FILTERING LEVEL  Consider an imaginary dot-matrix plot between 2 sequences.
The number of k-tuple matches along each diagonal is counted.  Then the mean
and standard deviation of the matches on the diagonals is calculated.  Then,
only those diagonals that have "filtering level" number of standard deviations
of matches above the mean number of matches are considered.  These diagonals
are the ones with most matches and are hence the most interesting for an
alignment.  Increasing this parameter will speed up the alignments but will
result in more scores of zero.
     
4) WINDOW SIZE   After the diagonals with most matches are found, this
parameter specifies a window around each that will be used in the alignment.
Decreasing this parameter will speed up the alignments.
     
5) SIMILARITY SCORES = PERCENTAGE or ABSOLUTE   If you choose percentage, then
the scores are calculated as percentage matches between 2 sequences (approx-
imately) i.e. (score/shorter length) * 100;  if you choose absolute scores
then the scores are simply the number of matches.  Percentage scores are
advisable if the lengths differ greatly.  Absolute scores are better otherwise.
     
     
MULTIPLE ALIGNMENT PARAMETERS
     
These parameters control the final multiple alignment.   There are 2 gap
penalty parameters and 1 for whether transitions are weighted in DNA align-
ments.  The basic algorithm used, attempts to minimise the distance between
groups of sequences.  By default,  identical DNA residues have a distance of
0 and different ones have a distance of 10; if transitions are weighted then
transitions have a distance of 5.  For amino acids, a Dayhoff PAM matrix is
used.
     
6) GAP PENALTY (Fixed)      This parameter is a penalty for every gap that is
introduced, regardless of length of gap.  Therefore, decreasing this parameter
will encourage gaps of all sizes.   BEWARE: if you make this too small (approx.
5 or so) then the program may prefer to align each sequence opposite a long
gap.
     
7) GAP PENALTY (Varying)    This parameter is a penalty for each item in each
gap.  Therefore, this is a penalty for longer gaps.  Increase this and gaps
will get shorter.   BEWARE: if you make this too small (approx.
5 or so) then the program may prefer to align each sequence opposite a long
gap.
     
8) TRANSITIONS = WEIGHTED or UNWEIGHTED  If transitions are unweighted, then
all nucleic acid mismatches have the same weight.  If transitions are weighted
then transitions (C vs T; A vs G) score have an intermediate score between
exact matches and other mismatches.
     
>>HELP<< 3     Help on the sequence input format
     
Two input formats are allowed:
     
                                 1) each sequence is in a separate file, each
with a TITLE in line one, the sequence in free format on line 2 onwards; the
input is a file of file names.  Don't forget the title in line 1 of each seq-
uence file.
                                 2) all sequences are in a single file.  Each
sequence is delimited by a  >  character in column 1 (same as FASTP format).
Each line beginning with a  > is treated as a TITLE for the following sequence.
     
Sequences are entered in free format (gaps and punctuation marks are ignored)
with a maximum of 120 residues per line.   Maximum allowed sequence length is
5000 residues (including gaps in the final alignment).  Upper or lower case
may be used.  The 1 letter code is used for amino acids; for DNA U = T.
     
No ambiguity codes are used.  Residues are either a valid amino acid or
nucleotide or unknown.  The program will attempt to determine (from the
sequence composition) whether the sequences are DNA/RNA or protein.
     
     
     
     
     
EXAMPLE:
     
> Mouse nose drying factor
ACGTAGCTAGCACTTAGCTAGCTGCTACGTAGCTAGCTAGCTACGCT
TAGTAGCTACGCGTAGTGTAGCATCGATGCTAGCTGATCGATGCTAGCTGACT
GATGCTATCGACTGATCG
> Rabbit Guinness receptor
CACGTCGCAGCTGCCTAGCCGGGGGATGATACGCTGTATATCGGATTATATGCGCGCGATGCTA
AGGATGCTACGCGTCAGTCGCTAGGCGCAGTAGCGCTAG
> Hamster interferon alpha
  GACGCATGCATTTACGCTCGATCAGFTCGACATCAGCAGAGAT
  GCATCAGATCAGCATCAGCATACGACAGCAGATACGA
     
etc. etc.
     
>>HELP<< 4     Help for the dendrogram file
     
The intermediate dendrograms used by CLUSTAL, are stored in files.  The default
file name is derived from the sequence input file, with the extension .DND
added on.  The dendrogram is a description of the relatedness of the sequences
in the data set.  You can use a dendrogram file immediately for a multiple
alignment or use it again at a later date, to save having to generate it again.
Also, you can edit the file to change the branching order of the sequences.
     
     
Example dendrogram for 7 sequences:
     
  2    122.00  0  0     1200000     each row represents 1 cluster.  There are
  3    102.00  1  0     1120000     N-1 clusters for N sequences.
  2     96.00  0  0     0001200
  5     84.83  2  3     1112200
  6     77.60  4  0     1111120
  7     31.50  5  0     1111112
     
There are 5 columns of numbers.  The first column states how many sequeces are
clustered at any stage e.g. in row 1, two sequences are joined.  In row 6, all
7 sequences are joined.  The second column of figures (with the decimal points)
are the similarity scores that the sequences in that cluster join at e.g. in
row 1, the two sequences join at a level of 122; in the final row, the 7
sequences join at a level of 31.5.    The final block of 0,1 and 2's represent
the sequences joining together in each cluster.  Each column is a sequence;
each row is a cluster.  In each cluster, the sequences marked with a 1 join
with those marked with a 2 e.g. in the first cluster, sequences 1 and 2 join;
in the fourth cluster sequences 1,2 and 3 join with sequences 4 and 5.
     
The 2 remaining columns of figures (columns 3 and 4) are pointers to the rows
that the two groups of sequences, joined at this level, come from e.g. in
cluster 4 (row 4) the 2 groups of sequences come from rows 2 and 3.
     
>>HELP<< 5     Help for the alignment file
     
The final multiple alignment is sent to a file whose name is derived from the
sequence input file with the addition of the ending: .ALN .  The output is
self explanatory.   Positions where all residues are identical are marked with
a star ( * ) and, for proteins, positions where all residues are "similar" are
marked with a dot ( . ).  "Similar"ity of residues is determined by having a
score of 8 or greater in a Dayhoff PAM matrix of amino-acid similarity.
     
EXAMPLE:
     
* :=>  match across all seqs.
. :=>  conservative substitutions
     
Human DSHUCZ   M-ATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTE-GLHGFHVHEFGDNTAG-
Bovine DSBOCZ  --ATKAVCVLKGDGPVQGTIHFEAKG--DTVVVTGSITGLTE-GDHGFHVHQFGDNTQG-
Swordfish SODL V-L-KAVCVLRGAGETTGTVYFEQEGNANAVGKGIILKGLTP-GEHGFHVHGFGDNTNG-
Drosophila DSF V-V-KAVCVING-D-AKGTVFFEQESSGTPVKVSGEVCGLAK-GLHGFHVHEFGDNTNG-
Maize SDMZ     M-V-KAVAVLAGTD-VKGTIFFSQEGDG-PTTVTGSISGLKP-GLHGFHVHALGDTTNG-
Yeast DSBYC    V---QAVAVLKGDAGVSGVVKFEQASESEPTTVSYEIAGNSPNAERGFHIHEFGDATNG-
Photobacter DS QDLTVKMTDLQTGKPV-GTIELSQNKYG--VVFTPELADLTP-GMHGFHIHQNGSCASSE
                     .  . .   . *.. ...      .     . .    . .***.*  *. . .
etc. etc.
     
>>HELP<< 6     Produce a dendrogram file only
     
This option allows you to calculate all the pairwise similarity scores and
produce a dendrogram, without doing the final multiple alignment.  The
dendrogram will be sent to a file and can be used again at a later date
(by specifying item 4 on the main menu).
     
     
Example dendrogram for 7 sequences:
     
  2    122.00  0  0     1200000     each row represents 1 cluster.  There are
  3    102.00  1  0     1120000     N-1 clusters for N sequences.
  2     96.00  0  0     0001200
  5     84.83  2  3     1112200
  6     77.60  4  0     1111120
  7     31.50  5  0     1111112
     
There are 5 columns of numbers.  The first column states how many sequeces are
clustered at any stage e.g. in row 1, two sequences are joined.  In row 6, all
7 sequences are joined.  The second column of figures (with the decimal points)
are the similarity scores that the sequences in that cluster join at e.g. in
row 1, the two sequences join at a level of 122; in the final row, the 7
sequences join at a level of 31.5.    The final block of 0,1 and 2's represent
the sequences joining together in each cluster.  Each column is a sequence;
each row is a cluster.  In each cluster, the sequences marked with a 1 join
with those marked with a 2 e.g. in the first cluster, sequences 1 and 2 join;
in the fourth cluster sequences 1,2 and 3 join with sequences 4 and 5.
     
The 2 remaining columns of figures (columns 3 and 4) are pointers to the rows
that the two groups of sequences, joined at this level, come from e.g. in
cluster 4 (row 4) the 2 groups of sequences come from rows 2 and 3.
     
>>HELP<< 7     Use an old dendrogram file
     
This option allows you to use a dendrogram file that was produced during an
earlier multiple alignment.  This is useful because, some dendrograms are
very time consuming to produce.  The format of the dendrogram is complicated;
therefore you should only use a file produced by this program or one that was
edited CAREFULLY.   The number of sequences in the dendrogram file MUST be
the same as the number of sequences in the current sequence data set.  The
number of rows in the file will be equal to the number of clusters which
is the number of sequences - 1.
     
Every time you do a complete multiple alignment (option 2 from the main menu)
a dendrogram file is automatically produced.