*************************
                     **                     **
                     **  COILS version 2.2  **
                     **                     **
                     *************************

                           by A. N. Lupas
                     programmed by J. M. Lupas

1. Introduction
2. Installation
3. Input file formats
4. Scoring options
5. Weighting options
6. Output options
7. Performance: A. Database statistics
                B. Highscoring sequences in globular proteins
                C. Performance on coiled coils
                D. Limits of the method
8. Recommendations for using the program

------------------------------------------------------------------------------

    1. INTRODUCTION

    COILS is a program that compares a sequence to a database of known 
parallel two-stranded coiled-coils and derives a similarity score. By 
comparing this score to the distribution of scores in globular and 
coiled-coil proteins, the program then calculates the probability that 
the sequence will adopt a coiled-coil conformation.
    COILS is described in

    Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coled Coils
from Protein Sequences, Science 252:1162-1164,

    Lupas, A. (1996) Prediction and Analysis of Coiled-Coil Structures,
Meth. Enzymology 266:513-525,

and is based on a prediction protocol proposed by David Parry:

    Parry, D. A. D. (1982) Coiled-coils in alpha-helix-containing proteins:
analysis of the residue types within the heptad repeat and the use of these 
data in the prediction of coiled coils in other proteins, Biosci. Rep. 
2:1017-1024.

------------------------------------------------------------------------------

    2. INSTALLATION

    Source codes for COILS and for the auxiliary programs ALIGNED20,
ALIGNED80, ALLFRAME and CAPS can be downloaded by anonymous FTP from 
the VMS folder of the server FTP.BIOCHEM.MPG.DE. The programs are written 
in VAX Pascal and can be compiled under Vax/VMS or OpenVMS.
    To run the programs on your machine, download the source code (e.g. 
COILS2.PAS) and compile it with the commands PASCAL COILS2 and LINK COILS2.
If you encounter problems during compilation, send the error messages by 
e-mail to LUPAS@VMS.BIOCHEM.MPG.DE and I will try to help.

------------------------------------------------------------------------------

    3. INPUT FILE FORMATS

    COILS accepts files in the following formats:

(a) GCG:

P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN)
ID   MULI_ERWAM     STANDARD;      PRT;    78 AA.
AC   P02939;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-APR-1988 (REL. 07, LAST ANNOTATION UPDATE)
DE   MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN).
OS   ERWINIA AMYLOVORA.
OC   PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS;
OC   ENTEROBACTERIACEAE.
RN   [1]
RP   SEQUENCE FROM N.A.
RM   81117327
RA   YAMAGATA H., NAKAMURA K., INOUYE M.;
RL   J. BIOL. CHEM. 256:2194-2198(1981).
DR   EMBL; J01577; EALPP.
DR   PIR; A03439; NPWCWY.
DR   PROSITE; PS00013; PROKAR_LIPOPROTEIN.
KW   SIGNAL; OUTER MEMBRANE; LIPOPROTEIN; DUPLICATION.
FT   SIGNAL        1     20
FT   CHAIN        21     78       MUREIN-LIPOPROTEIN.
FT   LIPID        21     21       N-ACYL DIGLYCERIDE.
FT   REPEAT       24     34
FT   REPEAT       38     48
SQ   SEQUENCE   78 AA;  8369 MW;  24285 CN;

Muli_Erwam  Length: 78  January 21, 1994  16:04  Type: P  Check: 4477  ..

       1  MNRTKLVLGA VILGSTLLAG CSSNAKIDQL STDVQTLNAK VDQLSNDVTA

      51  IRSDVQAAKD DAARANQRLD NQAHSYRK


(b) Pearson (FASTA):

>MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD
DAARANQRLDNMATKYRK


(c) user-defined:

The program recognizes the start of a sequence by a > at the beginning
or a [space,space,dot,dot] at the end of the line preceeding the sequence.
The program recognizes the end of a sequence by a *, a //, or by the
end-of-file character.  The program accepts sequences in upper- and lower-
case letters and ignores all spaces, numbers and other characters not
representing an amino acid.
    If a file contains several proteins, the end of each sequence but last
must be marked by * or by //:

>M_ECOLI P1;MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD
DAARANQRLDNMATKYRK*
>M_ERWAM P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MNRTKLVLGAVILGSTLLAGCSSNAKIDQLSTDVQTLNAKVDQLSNDVTAIRSDVQAAKD
DAARANQRLDNQAHSYRK*
>M_MORMO P1;MULI_MORMO - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR
MGRSKIVLGAVVLASALLAGCSSNAKFDQLDNDVKTLNAKVDQLSNDVNAIRADVQQAKD
EAARANQRLDNQVRSYKK

------------------------------------------------------------------------------

    4. SCORING OPTIONS

    After asking for input and output filenames, the program will offer
the choice of two scoring matrices that it can compare a sequence to:

MTK - is a matrix derived from the sequences of myosins, tropomyosins
and keratins (intermediate filaments type I and II).  It is the one
described in Science, 252:1162 (1991).

MTIDK - is a new matrix derived from myosins, paramyosins, tropomyosins,
intermediate filaments type I - V, desmosomal proteins and kinesins.
The matrix was compiled by weighting the residue frequencies of the
different protein families according to the following scheme:

 0.2 MYOSINS - 0.5 myosins
             - 0.5 paramyosins

 0.2 TROPOMYOSINS

 0.2 INTERMEDIATE FILAMENTS - 0.2 type I (keratin)
                            - 0.2 type II (keratin)
                            - 0.2 type III (desmin, vimentin, GFAP, peripherin)
                            - 0.2 type IV (NF light, medium and heavy chains)
                            - 0.2 type V (lamins A and B)

 0.2 DESMOSOMAL PROTEINS - 0.33 desmoplakin
                         - 0.33 plectin
                         - 0.33 hemidesmosomal plaque prot. (bullous pemphigoid)

 0.2 KINESINS

While the MTIDK matrix provides for a somewhat better resolution between 
the scores of globular and coiled-coil proteins as well as for a more 
consistent evaluation of the different families of coiled-coil proteins, 
the MTK matrix yields fewer highscoring segments in a database of globular 
sequences (see Section 7: PERFORMANCE). Current data are consistent with 
the assumption that the MTK matrix is more specific for two-stranded 
structures and that the MTIDK matrix gives a more realistic assesment 
for other types of coiled coils.

------------------------------------------------------------------------------

    5. WEIGHTING OPTIONS

    Because coiled coils are generally fibrous, solvent-exposed structures,
all but the internal a and d positions have a high likelihood of being
occupied by hydrophilic residues.  A program that gives equal weight to all
positions is therefore going to be biased towards hydrophilic, charge-rich
sequences. While this does not pose a problem for the vast majority of
natural sequences, some highly charged sequences obtain high coiled-coil 
probabilities in the obvious absence of heptad periodicity and coiled-coil-
forming potential. An extreme case is that of polyglutamate which obtains a 
coiled-coil-forming probability > 99%.
    To counter this problem, COILS2 contains a weighting option, which 
allows the user to assign the the same weight to the two hydrophobic 
positions a and d as to the five hydrophilic positions b, c, e, f and g. 
This leads to an only slightly worse performance of the program (see 
Section 7: PERFORMANCE) and permits the identification of the class of 
false positives described above. It is recommended to run a weighted and
and unweighted scan in parallel and to compare the outputs. A drop of more 
than 20-30% in the probability is a clear indication of a highly-charged 
false positive. 
    Two examples (window=21, probabilities abbreviated to the first digit):

sequence  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
MTK       99999999999999999999999999999999999999999999999999
MTIDK     99999999999999999999999999999999999999999999999999
MTK_W     00000000000000000000000000000000000000000000000000
MTIDK_W   00000000000000000000000000000000000000000000000000

sequence  DDEKRKEKKDKKEKEKERRREKEKKEKEKEKERREKKKRKREEDDEEKKE
MTK       47888999999999999999999999999999999999999998766666
MTIDK     99999999999999999999999999999999999999999999999999
MTK_W     00000000000000000000000000000000000000000000000000
MTIDK_W   00111111122222333333333333333333333222222220000000

In many cases a 21 residue scan yields clearer results than a 28 residue scan.

    As an alternative, it is possible to use the auxiliary program ALLFRAME.
This program lists the scores (not probabilities) of a sequence in all seven 
frames. The presence and strength of a heptad periodicity can be inferred
directly from the difference between the highest-scoring frame and all
others:

   1 E  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99
   2 E  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99
   3 E  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99
   4 E  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99
   5 E  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99
   6 E  f 1.99  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99
   7 E  g 1.99  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99
   8 E  a 1.99  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99
   9 E  b 1.99  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99
  10 E  c 1.99  d 1.99  e 1.99  f 1.99  g 1.99  a 1.99  b 1.99
.....

   1 D  a 1.21  b 1.38  c 1.51  d 1.42  e 1.58  f 1.18  g 1.14
   2 D  b 1.38  c 1.39  d 1.51  e 1.42  f 1.63  g 1.19  a 1.16
   3 E  c 1.40  d 1.41  e 1.58  f 1.52  g 1.66  a 1.22  b 1.26
   4 K  d 1.40  e 1.41  f 1.58  g 1.52  a 1.66  b 1.29  c 1.26
   5 R  e 1.46  f 1.41  g 1.58  a 1.52  b 1.66  c 1.31  d 1.27
   6 K  f 1.46  g 1.41  a 1.58  b 1.52  c 1.66  d 1.31  e 1.27
   7 E  g 1.46  a 1.41  b 1.58  c 1.52  d 1.66  e 1.31  f 1.27
   8 K  a 1.46  b 1.41  c 1.58  d 1.52  e 1.66  f 1.31  g 1.27
   9 K  b 1.46  c 1.41  d 1.58  e 1.52  f 1.66  g 1.31  a 1.27
  10 D  c 1.46  d 1.41  e 1.58  f 1.52  g 1.66  a 1.31  b 1.27
.....

In both examples, the absence of a heptad periodicity is obvious. For
comparison, here are scores for the GCN4 leucine zipper; the heptad
frame with the leucines in position d is immediately apparent:

.....
  30 L  b 0.79  c 1.17  d 1.91  e 0.98  f 0.94  g 1.21  a 1.17
  31 E  c 0.79  d 1.17  e 1.91  f 0.98  g 0.96  a 1.21  b 1.17
  32 D  d 0.79  e 1.17  f 1.91  g 0.98  a 0.96  b 1.21  c 1.17
  33 K  e 0.79  f 1.17  g 1.91  a 0.98  b 1.02  c 1.21  d 1.17
  34 V  f 0.79  g 1.17  a 1.91  b 1.02  c 1.02  d 1.21  e 1.17
  35 E  g 0.79  a 1.17  b 1.91  c 1.02  d 1.02  e 1.21  f 1.17
  36 E  a 0.78  b 1.17  c 1.91  d 1.02  e 1.02  f 1.21  g 1.17
  37 L  b 1.02  c 1.17  d 1.91  e 1.02  f 1.02  g 1.19  a 1.17
  38 L  c 1.02  d 1.17  e 1.91  f 1.02  g 1.02  a 1.19  b 1.17
  39 S  d 1.02  e 1.17  f 1.91  g 1.02  a 1.02  b 1.17  c 1.17
  40 K  e 1.02  f 1.17  g 1.91  a 1.02  b 1.02  c 1.17  d 1.17
  41 N  f 1.02  g 1.17  a 1.91  b 1.02  c 1.02  d 1.06  e 1.17
  42 Y  g 1.02  a 1.17  b 1.91  c 1.02  d 1.02  e 1.02  f 1.17
  43 H  a 1.02  b 1.17  c 1.91  d 1.02  e 1.02  f 1.02  g 1.17
  44 L  b 1.02  c 1.17  d 1.91  e 1.02  f 1.02  g 1.02  a 1.17
  45 E  c 1.02  d 1.17  e 1.91  f 1.02  g 1.02  a 1.02  b 1.17
  46 N  d 1.02  e 1.17  f 1.91  g 1.02  a 1.02  b 1.02  c 1.17
  47 E  e 1.02  f 1.17  g 1.91  a 1.02  b 1.02  c 1.02  d 1.17
  48 V  f 1.02  g 1.10  a 1.91  b 1.02  c 1.02  d 1.02  e 1.17
  49 A  g 1.02  a 1.10  b 1.91  c 1.02  d 1.02  e 1.02  f 1.17
  50 R  a 1.02  b 1.10  c 1.91  d 1.02  e 1.02  f 1.02  g 1.17
  51 L  b 1.02  c 1.04  d 1.91  e 1.02  f 1.02  g 1.02  a 1.17
.....

------------------------------------------------------------------------------

    6. OUTPUT OPTIONS

    COILS2 offers four output options: 

    The default option gives residue number, residue type and the frame
and coiled-coil-forming probability obtained in scanning windows of 14,
21 and 28 residues:
.....
   61 E        c  0.317       c  0.379       c  0.562
   62 L        d  0.317       d  0.379       d  0.562
   63 E        e  0.317       e  0.379       e  0.562
   64 L        f  0.167       f  0.379       f  0.562
   65 T        c  0.472       c  0.598       g  0.562
   66 H        d  0.472       d  0.740       a  0.562
   67 R        e  0.916       e  0.740       e  0.677
   68 K        f  0.943       f  0.740       f  0.677
   69 M        g  0.943       g  0.740       g  0.677
   70 K        a  0.943       a  0.740       a  0.677
   71 D        b  0.943       b  0.740       b  0.677
.....

    Opion a is similar to the default option, except that the results are 
displayed in rows. As a result, residue numbers are indicated by a scale
above the sequence, probabilities are abbreviated to the first digit
(but 100% is also 9) and the frames for the three scans are listed below
the probabilities. This option gives a good overview over the location of
peaks in a protein:
.....
61
    .    |    .    |    .    |    .    |    .    |    .    |
ELELTHRKMKDAYEEEIKHLKLGLEQRDHQIASLTVQQQRQQQQQQQVQQHLQQQQQQLA

111144999999999999999777770000000000000000000333333333333332
333357777777777777777777772222222200004444444444444444444443
111112666666666666666666666666666654422222222222222222222222
cdefcdefgabcdefgabcdefgabcdefgabcdefdefgabcdebcdefgabcdefgab
cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab
cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab
.....

    Option b asks the user for the size of the scanning window and returns
scores only. This option allows the user to inspect the scores behind the
probabilities given in the previous options and to scan sequences with
window sizes for which no statistics are currently available. For an 
application, see Seo, J. and Cohen, C. (1993) Pitch diversity in alpha-
helical coiled coils, Proteins 15:223-234.
.....
        61 E      c     1.59
        62 L      d     1.59
        63 E      e     1.59
        64 L      f     1.50
        65 T      c     1.65
        66 H      d     1.65
        67 R      e     1.90
        68 K      f     1.94
        69 M      g     1.94
        70 K      a     1.94
        71 D      b     1.94
.....

    Option c is useful for scanning very large proteins or files containing
many proteins as it only displays (in default format) sequences with
coiled-coil-forming probabilities above a cutoff value that is set by the
user.

------------------------------------------------------------------------------

    7. PERFORMANCE

    A. Database statistics

    The following is a synopsis of the score distributions for the PDB 
and coiled-coil databases. The score distributions are approximated by 
Gaussians and the means and standard deviations of the Gaussians are given.
PDB is a database of globular sequences from The Protein Data Bank 
(32,592 res.) described in Science 252:1162. The combined coiled-coil 
database contains 26,965 residues from various coiled-coil proteins (see 
Section 4: SCORING OPTIONS) and will be described in detail in print. 
Obviously, every family of coiled-coil proteins was scored with a scoring 
matrix that excluded residue frequencies from that family.

                         28 residue scan   21 residue scan   14 residue scan
                         mean   std.dev.   mean   std.dev.   mean   std.dev.
PDB           MTK        0.77   0.20       0.83   0.24       0.94   0.29
              MTIDK      0.80   0.18       0.86   0.21       0.95   0.26
              MTK_W      0.79   0.23       0.86   0.26       1.00   0.33
              MTIDK_W    0.86   0.18       0.92   0.22       1.04   0.27

Coiled coils  MTK        1.63   0.22       1.70   0.25       1.79   0.30
              MTIDK      1.69   0.18       1.74   0.23       1.82   0.28
              MTK_W      1.70   0.24       1.76   0.28       1.88   0.34
              MTIDK_W    1.74   0.20       1.79   0.24       1.89   0.30

From these numbers, several conclusions can be drawn:
    - The difference between the mean scores in PDB and in coiled coils 
is slightly larger with the MTIDK matrix than with the MTK matrix. More
importantly, the standard deviation of the score distribution is lower 
with the MTIDK matrix for both databases. This means that the MTIDK
matrix yields a more consistent evaluation of globular and coiled-coil
sequences and provides for a better resolution between the two score
distributions. Not shown here is that the MTIDK matrix also improves the
score of intermediate filament sequences relative to the scores of other
coiled-coil sequences, thus providing for a more balanced scoring of the
different families of coiled-coil proteins than the MTK matrix.
    - For both matrices, weighting slightly decreases the resolution 
between the globular and coiled-coil score distributions.
    - For all scoring methods, the resolution between the globular and 
coiled-coil score distributions decreases strongly with decreasing size
of the scanning window.
    - The difference in performance between the MTK matrix and the MTIDK 
matrix is small although the MTIDK matrix is derived from over twice the 
number of residues and many more protein families. I conclude that little 
further progress can be expected from even larger coiled-coil databases.


    B. Highscoring sequences in globular proteins

    I scored release 13.0 (8/93) of the NRL_3D database (containing the
sequences of proteins of known structure from PDB) with all four scoring 
methods and counted the number of segments obtaining probabilities >10%. 
The database contained 539 nonredundant protein sequences and excluded
the coiled-coil proteins tropomyosin, hemagglutinin, GCN4, Gal4 and 
apolipoprotein E. Apolipoprotein E  was included with the coiled-coil 
subset because its helices are very long compared to those of other helical
bundles and because it forms a partly three-stranded structure. All other 
helical bundles were included with the globular proteins because their 
helices are short and frequently packed at irregular angles. These features
generally prevent their detection by this algorithm although several helices
from four-helix bundles appear as high-scoring segments in the following
table. Results are compared to the number of segments obtained in a database 
of sequences generated by means of a random number generator (see Science 
252:1162). (1 - MTK; 2 - MTIDK; 3 - MTK_W; 4 - MTIDK_W)
           
                                                         RANDOM SEQUENCES
             28 res.        21 res.        14 res.      28      21      14
           1  2  3  4     1  2  3  4     1  2  3  4    1  2    1  2    1  2

  10-19%   8  5 11 13    37 22 24 35    96 85 99 85    1  2   12 10   51 60
  20-29%   4  1  5  3    18 14 23 14    47 33 51 45    2  1   10  5   21 26
  30-39%   2  0  2  4    14  8  9  9    29 35 42 21    2  0    7  4   14 14
  40-49%   4  0  2  5     6  2 15 10    21 14 17 19    1  0    2  1    8  9
  50-59%   2  2  1  1     1  4  4  7    11  9 11 14    0  0    1  0   10  9
  60-69%   1  0  3  6     3  4  7  5     9 11 12 14    0  0    0  0    5  6
  70-79%   3  2  2  1     4  1  6  1    12  7 12 13    0  0    2  1    6  4
  80-89%   1  2  3  1     3  4  3  4    10 14  8 18    0  0    1  2    2  5
  >= 90%   1  3  1  1     4  9  6  7    11 20  8 15    2  2    2  2    5  7

    In this table, the number of segments per 10% increment levels off above
50% rather than decreasing continuously. This is due to the sigmoid shape of
the curve that relates scores to probabilities which masks a continuing 
decrease in number of segments per score interval.
    Above 50%, the number of segments per 10% increment doubles from around 
2 in the 28 res. scan to around 4 in the 21 res. scan and then triples to 
around 12 in the 14 res. scan.  A similar progression at a lower level is 
observed for the random sequence database.  This progression is due to the 
significantly poorer resolution of smaller scanning windows. The difference 
in numbers between PDB and random sequences is attributable to amphipathic 
helices that are frequently present in native proteins but are not a preferred 
element of random sequences.  Outside the tail end of the score distribution 
seen in this table, the score distributions of PDB and random sequences are 
superimposable (see Science 252:1162). This means that the real resolution 
between the globular and coiled-coil score distributions is slightly lower 
than the nominal resolution.
    The weighted matrices are less reliable than the unweighted matrices.
    The MTK matrix yields fewer highscoring segments at probabilities >90%
than the MTIDK matrix and thus appears more reliable even though its nominal 
resolution is poorer. This is probably an incorrect conclusion. As is detailed 
in the next paragraph, there are now several examples of sequences that do 
not assume a coiled-coil (or even alpha-helical!) structure under normal 
circumstances but that have the potential to do so if their context is changed. 
It therefore appears likely that the sequences which are assigned elevated 
coiled-coil probabilities by the COILS program actually do have the potential to 
form coiled coils even though they do not do so in the protein context or under 
the conditions in which the structure was determined. The larger number of high-
scoring segments with the MTIDK matrix would then be the result of an increased 
sensitivity of this matrix. 
    Virtually all segments with scores above 50% in 21 and 28 scans are
centered on a surface helix although several contain two discotinuous
helices rather than one continuous helix. Several of the helices are from
four-helix bundles and thus have coiled-coil characteristics. Following recent
developments, it is increasingly likely that most (if not all) of these
high-scoring sequences have an elevated coiled-coil-forming potential and
could form coiled coils in a different context. This follows from three recent results: (1) A loop segment of influenza hemagglutinin, pH7, which was predicted by COILS to have elevated coiled-coil potential, in fact forms a coiled coil in the pH4 structure (Bullough et al., Nature 371:37, 1994). (2) The basic region of bZip transcription factors, which is not even alpha-helical in the absence of DNA, can be converted into a coiled coil by a designed peptide (Krylov et al., EMBO J. 14:5329, 1995). (3) A peptide from topoisomerase II, which was identified using COILS, forms a coiled coil in solution but not in the structure of the full protein (Frere et al., J. Biol.Chem. 270:17502, 1995).
    Nevertheless, the decreased coiled-coil-forming potential of these 
sequences relative to "constitutive" coiled coils can be seen from the 
fact that they score highly in one method but generally much lower in at 
least one of the other methods; example: 5LDH - lactate dehydrogenase:

seq        CAISILGKSLTDELALVDVLEDKLKGEMMDLQHGSLFLQTP
MTK        00112444444444444444444444444444411000000
MTK_W      35678999999999999999999999999999911000000
MTIDK      00000000000000000000000000000000000000000
MTIDK_W    00012333333333333333333333333333300000000

and several segments drop considerably in score from a 28 residue scan to a 
21 residue scan; example: 2TS1 - tyrosyl-tRNA synthetase:

seq        PEKRAAQKTLAEEVTKLVHGEEALRQAIRIS
14         0001111111111111100000000000000
21         0222222222222222222222222220000
28         0777777777777777777777777777721.

The latter effect is observed particularly if a segment contains two 
discontinuous helices. These effects can be taken as indicators for a 
decreased likelihood of coiled-coil formation since neither effect is 
normally observed in coiled coils, as can be seen in part C of this section.


    C. Performance on coiled coils

In the following, secondary structure (c = coiled-coil helix) and coiled-
coil-forming probabilities are shown beneath the sequences as scored by 
MTK, MTIDK, MTK_W and MTIDK_W in that order. The values were obtained with 
a 21 residue scanning window which appears to spot the ends of coiled-coil 
segments somewhat more accurately than a 28 residue window. (For spotting
the ends of coiled coil helices, see also the documentation for the auxiliary
program CAPS). The coiled coils in Gal4, GreA and human mannose-binding 
protein were analyzed with a 14 residue window because of their short length.
Tropomyosin is not shown; it obtains probabilities >99% over its entire 
length except for the C-terminal 20 residues.


(C1) parallel, two-stranded structures

>GCN4 bZip (Cell 71:1223)

MKDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER
hhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccccccccccccccccccccccccc
0000000000222779999999999999999999999999999999999999988330
0000011111777999999999999999999999999999999999999999988110
0000000000000224555566699999999999999999999999999999999770
0000000000000889999999999999999999999999999999999999999770

Similar probabilities (>99%) are obtained for the bZip regions of Fos
and Jun (see Meth. Enzymology 266:513). As seen here, the ends of 
coiled-coil segments may be overpredicted significantly in the absence 
of strong flanking helix-breaking residues. This is a particular problem 
in bZip proteins, where the coiled coil follows continuously out of the 
basic-region helix. Note, though, that the basic region also has some 
coiled-coil-forming potential, as demonstrated by Krylov et al. (EMBO J. 
14:5329, 1995).


>Max b-HLH-Zip (Nature 363:38)

ADKRAHHNALERKRRDHIKDSFHSLRDSVPSLQGEKASRAQILDKATEYIQYMRRKNDTH
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh         hhhhhhhhhhhhhhccccccc
000000000000000000000000000000000000000000111112288889999999
000000000000000000000000000000000000000000000001199999999999
000000000000000000000000000000000000000000000011155556888999
000000000000000000000000000000000000000000111113388889999999

QQDIDDLKRQNALLEQQVRALEKARSSAQLQT
ccccccccccccccccccccc
99999999999999999999999999999884
99999999999999999999999999999996
99999999999999999999999999988771
99999999999999999999999999999992


>Gal4 (Nature 356:408)

MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEF
          hhhhhhhh         hhhhhhhh              ccccccccccccccc
000000000000000000000000000000000000000000000000014888888888888882
000000000000000000000000000000000000000000000000017999999999999992
000000000000000000000000000000000000000000000000006888888888888884
000000000000000000000000000000000000000000000000008999999999999995

COILS works well for parallel two-stranded structures (independently of the 
scoring method used) if they are solvent-exposed.  The parallel two-stranded 
coiled coil buried in CAP is entirely invisible to this program because of 
the absence of a heptad repeat.


(C2) antiparallel, two-stranded structures

>Seryl-tRNA synthetase - Escherichia coli (Nature 347:249)

MLDPNLLRNEPDAVAEKLARRGFKLDVDKLGALEERRKVLQVKTENLQAERNSRSKSIGQ
                          cccccccccccccccccccccccccccccccchh
000000000000000000000000003888888888888888888888888882100000
000000000000000000000000003999999999999999999999999993000000
000000000000000000000000003777777777777777777777773330000000
000000000000000000000000004889999999999999999999998880000000

AKARGEDIEPLRLEVNKLGEELDAAKAELDALQAEIRDIALTIPNLPADEVPVG......
hhhh    cccccccccccccccccccccccccccccccccc
000000000099999999999999999999999999999999900000000000
000007788899999999999999999999999999999999988800000000
000000000099999999999999999999999999999999955500000000
000089999999999999999999999999999999999999999933100000


>Seryl-tRNA synthetase - Thermus thermophilus (JMB 234:222)

MVDRKRLRQEPEVFHRAIREKGVALDLEALLALDREVQELKKRLQEVQTERNQVAKRVPK
                          ccccccccccccccccccccccccccccccc
000000000000000000011124599999999999999999999999999999999910
000000000000000000000013499999999999999999999999999999999986
000000000000000000022236699999999999999999999999999998887700
000000000000000000000014599999999999999999999999999999999954

APPEEKEALIARGKALGEEAKRLEEALREKEARLEALLLQVPLPPWPGAPVG........
   ccccccccccccccccccccccccccccccccccccc
0008888888888999999999999999999999999999920000000000
4009999999999999999999999999999999999999997000000000
0002224444444999999999999999999999999999932000000000
1005556677777999999999999999999999999999999000000000


>GreA transcript cleavage factor (Nature 373:636)

MQAIPMTLRGAEKLREELDFLKSVRRPEIIAAIAEAREHGDLKENAEYHAAREQQGFCEGRIKDIEAKLSNAQVID
     sscccccccccccccccc-ccccccccccccc        cccccccccccccccccccccccccc  ss
0000011366666666666666664200000000000000000000000000000002999999999999998730
0000011388888888888888888500000001111111111111100000000004999999999999997710
0000022688888888888888885300000000000000000000000000000000777777777777776630
0000033899999999999999999800000000000000000000000000000000777777777777776620

GreA resembles in its structural organization seryl-tRNA synthase. It is 
currently the only known coiled-coil structure with a true skip residue 
(Val34). The high scores in the two coiled coil helices correspond to the 
segment of coiled coil that is located between the skip and the globular 
part of the protein.


>Replication terminator protein (Cell 80:651)

MKEEKRSSTGFLVKQRAFLKLYMITMTEQERLYGLKLLEVLRSEFKEIGFKPNHTEVYRSL
             hhhhhhhhhhhhhhhh ssss hhhhhhhhhhh       hhhhhhhh
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000

HELLDDGILKQIKVKKEGAKLQEVVLYQFKDYEAAKLYKKQLKVELDRCKKLIEKALSDNF
hhhhh   sssssss       sssssss hhhhhhhhhhccccccccccccccccccccc
0000000000000000000000000001111133666666666666666666666655540
0000000000000000000000000000033333444488888888888888888888880
0000000000000000000000000002222233555555555555555555555533320
0000000000000000000000000001133344555588888888888888888888882

COILS is also generally reliable in the analysis of antiparallel two-stranded
coiled coils, but does not detect the DNA-binding coiled coil in serum response
factor (Nature 376:490), which, because of its special function, has a very 
distinct residue distribution.


(C3) parallel, three-stranded structures

>hemagglutinin (Nature 333:426 and 371:37)

GLFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTN
                                     hhhhhhhhhhhhhhhhhh        pH7
                                       ccccccccccccccccccccc   pH4
000000000000000000000000000000001223466666666666666666666658
000000000000000000000000000000000222455555555555667888888889
000000000000000000000000000000000122344444444444444444444402
000000000000000000000000000000000111222222222222222222222211

EKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFE
               ccccccccccccccccccccccccccccccccccccccccccccc   pH7
ccccccccccccccccccccccccccccccccccccccccccccc       hhhhhhhh   pH4
999999999999999999999999999988800000000000000000000144444444
999999999999999999999999999766611111110000000000000288888888
333377777788888888888888888888800000000000000000000000000000
333355555555555555555555555555533333331000000000000033333333

KTRRQLRENAEEMGNGCFKIYHKCDNACIESIRNGTYDHDVYRDEALNNRFQIKG
cccccc                                                         pH7
hhhhhhhhh                                                      pH4
4444444444444220000000000000000000000000000000000000000
8888888888888440000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000
3333333333333110000000000000000000000000000000000000000

Influenza haemagglutinin is a complex structure which undergoes a large structural transition between pH7 and pH4. There is multiple evidence that the structure at pH7 is only meta-stable.


>Mannose-binding protein A, rat (Structure 2:1227)

AIEVKLANMEAEINTLKSKLELTNKLHAFSMGKKSGKKFFVTNHERMPFSKVKALCSELRGTVAIPRNAEENKAI
cccccccccccccccccccccccccccccc        sssssssss hhhhhhhhhh   ss     hhhhhhh
999999999999999999999999997731000000000000000000000000000000000000000000000
999999999999999999999999998830000000000000000000000000000000000000000000000
999999999999999999999999993320000000000000000000000000000000000000000000000
999999999999999999999999995520000000000000000000000000000000000000000000000

QEVAKTSAFLGITDEVTEGQFMYVTGGRLTYSNWKKDEPNDHGSGEDCVTIVDNGLWNDISCQASHTAVCEFPA
hhhh   ssssss        ss                       sssss     ssss     sssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000


>Mannose-binding protein C, human (Nature Struct. Biol. 1:789)

        AASERKALQTEMARIKKWLTFSLGKQVGNKFFLTNGEIMTFEKVKALCVKFQASVATPRNAAENGAI
          cccccccccccccccccccc  sss  ssssssssssshhhhhhhhhh   ss     hhhhhhh
        2246666666666666600000000000000000000000000000000000000000000000000
        5579999999999999900000000000000000000000000000000000000000000000000
        2222222222222222200000000000000000000000000000000000000000000000000
        5555555555555555500000000000000000000000000000000000000000000000000

QNLIKEEAFLGITDEKTEGQFVDLTGNRLTYTNWNEGEPNNAGSDEDCVLLLKNGQWNDVPCSTSHLAVCEFPI
hhh    ssssss        ss                        ssss     ssss    sssssssss
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000000000000000000000000000


(C4) antiparallel, three-stranded structures

>coil-Ser (Science 259:1288)

EWEALEKKLAALESKLQALEKKLEALEHG
ccccccccccccccccccccccccccccc
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999
99999999999999999999999999999

This is an unusual homotrimeric structure that was produced incidentally 
to the design of a two-stranded coiled coil.


>spectrin (Science 262:2027)

NLDLQLYMRDCELAESWMSAREAFLNADDDANAGGNVEALIKKHEDFDKAINGHEQKIAA
cccccccccccccccccccccccccccc        cccccccccccccccccccccccc
000000000000000000000000000000111114466667777777777777777777
000000000000000000000000000000000003355558888888888888888888
000000000000000000000000000000000000011111111117777777777777
000000000000000000000000000000000000011113333339999999999999

LQTVADQLIAQNHYASNLVDEKRKQVLERWRHLKEGLIEKRSRLGD
cccccccccc     ccccccccccccccccccccccccccccccc
7777777777742220000000000000000000000000000000
8888888888863110000000022222222222222222222200
7777777777755552211110000000000000000000000000
9999999999977773322220044444444444444444444400

As an antiparallel three-helix bundle, spectrin is already fairly far removed
from the reference set of parallel two-stranded structures that is used for
scoring. Accordingly, as with four-helix bundles, the program has problems
identifying all the helices in the structure.  While this does not make the
prediction of helix B as a coiled coil incorrect, it makes it rather useless
and indeed misleading for model-building. In the long run, scoring matrices
that are specific for helical bundles should be the answer, but my experiments 
with a matrix derived from four-helix bundles (Paliakasis & Kokkinidis, 
Prot.Eng. 5:739) show that the ones currently available have only little 
predictive power. Even in the absence of such matrices, the prediction can be 
improved significantly using the auxiliary programs ALIGNED20/80 if homologous 
sequences are available for a protein. Their application to spectrin is shown 
in the documentation file ALIGNED.DOC.
    One of the specific problems of the program with helix A of spectrin
are the Trp and Phe residues in position g of the heptad repeat. These
residues are very rare at that position both in two-stranded and three-
stranded coiled coils. Such residues can occur or even be important in
certain structures even though they are disfavored in most others. It is
therefore recommended that a protein with a single peak be also analyzed
with all rare residues (W, C, P) replaced by Ala. Emergence of more peaks
indicates the presence of a helical bundle. Also, if proteins that one
suspects may form a helical bundle have a peak that occurs only in a 14 
residue scan, one should look whether replacement of a single unfavorable 
residue (e.g. D in a) by Ala does not greatly lengthen the predicted length
of the helix or raise significantly its score. Such "wrong" residues may
actually help to build a model since their presence needs to be accounted
for and limits the possibilities.


(C5) other antiparallel helical bundles

>ApoE (Science 252:1817)

GQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQL
 ccccccccccccccccccc hhhhhhhhhhcccccccccccccccccccccccccccc
000000000000000000000000000000013379999999999999999999999999
000000000000000000000000000000026699999999999999999999999999
000000000000000000000000000000001129999999999999999999999999
000000000000000000000000000001689999999999999999999999999999

TPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLR
    cccccccccccccccccccccccccccccccccccc       ccccccccccccc
818999999999999999999999533331111000000000000111111114478999
889999999999999999999999733330000000000000000444444445589999
959999999999999999999999444441111111111111000333333336689999
999999999999999999999999433331111111111110011888888888899999

KLRKRLLRDADDLQKRLAVYQAGA
cccccccccccccccccccccc
999999999999999999988877
999999999999999999999855
999999999999999999999999
999999999999999999999999

The prediction for ApoE is good for the three-stranded part but much
poorer for the four-stranded part: the short N-terminal helix 1 is not seen
by the program, partly because of its length but mostly because of the three
Trp residues, and the C-terminus of helix 3 and the N-terminus of helix 4 
which interact with helix 1 also obtain low scores.  This brings me to:


    D. Limits of the method

    As can be seen from the examples given, the program works well for 
parallel two-stranded structures that are solvent-exposed but runs 
progressively into problems with the addition of more helices, their
antiparallel orientation and their decreasing length. The program fails
entirely on buried structures.  Limits are also set by the statistical
noise which greatly decreases the usefulness of small scanning windows.
Finally, the possibility that sequences with good coiled-coil potential
do not form a coiled coil because of constraints from other parts of the
sequence may add a further limit to the accuracy of the program.
    Because many reasons can lead the program to miss a helix while the
conditions for detection are quite stringent, the absence of a peak is
not nearly as conclusive as the presence of a peak.  Effects of this on
interpreting scores from multiple alignments is discussed in ALIGNED.DOC.
What I believe one can conclude safely from the absence of a peak is that
no solvent-exposed two- or three-stranded coiled-coil of length greater than 
approximately 20 residues is present in the protein. 

------------------------------------------------------------------------------

    8. RECOMMENDATIONS FOR USING THE PROGRAM

    COILS is specific for solvent-exposed, left-handed coiled coils. Other 
types of coiled-coil structure, such as buried coiled coils (e.g the central
coiled coil in catabolite repressor protein, or some transmembrane domains) 
and right-handed coiled coils, are not detected by the program.

    COILS does not reach yes-or-no decisions based on a threshold value. 
Rather, it yields a set of probabilities that presumably reflect the 
coiled-coil forming potential of a sequence. This means that even at high 
probabilities (e.g. >90%), there will be (and should be) sequences that in 
fact do not form a coiled coil, though they may have the potential to do so 
in a different context. 

    COILS is biased towards hydrophilic, highly charged sequences. For 
this reason, all scans should be performed with a weighted and an unweighted 
matrix, and the results compared. Differences of more than 20-30 percentage 
points in the probabilities should be taken to indicate that a coiled-coil 
structure is unlikely, the elevated scores being mainly due to the high 
incidence of charged residues (note, though, that this would have marked
human mannose-binding protein as a false positive).

    The MTK and MTIDK matrices both assign high probabilities to known 
coiled coils segments, but identify different helices at high probability 
in a database of globular proteins. This is a surprising feature whose 
reason is as yet unclear, but which can be exploited for predictive purposes. 
It is therefore useful to compare the results of scans made with the two 
matrices. Again, differences of more than 20-30 percentage points in the 
probabilities should be taken to indicate that a coiled-coil structure is 
unlikely (note, though, that this threshold would make the replication 
terminator protein a border-line case).

    The resolution between globular and coiled-coil score distributions 
decreases strongly with a decreasing size of the scanning window. The 
prediction of new coiled-coil segments should therefore be made using a 
28 residue window, or in special cases a 21 residue window. 14 residue 
windows should normally be reserved for the analysis of local parameters 
(such as the frame) in known or predicted coiled coils.

    The ends of coiled-coil segments appear to be most accurately identified 
in a 21 residue window. In general, I assume that residues with probabilities 
>50% are part of a coiled-coil segment. In addition, a search for the most 
likely helix ends using CAPS is generally useful (see also the CAPS 
documentation). 

    Sequences with high coiled-coil probabilitiy from globular proteins 
rarely exceed a length of 30 residues. None is longer than 35 residues. 
Sequences with probabilities >80-90% that extend for more than 35 residues 
are therefore more likely to assume a coiled-coil structure than is indicated 
by the obtained probabilities.

    Where possible, sequences related to the protein of interest should 
also be analyzed for predicted coiled-coil segments (see the section on 
the ALIGNED programs). It should be kept in mind, though, that the 
sequences must be related in the region of high scores in order for the 
comparison to be significant. 

    Comparison of the coiled-coil prediction with predictions of the 
secondary structure are generally useful, particularly if multiple 
related sequences are available.