************************* ** ** ** COILS version 2.2 ** ** ** ************************* by A. N. Lupas programmed by J. M. Lupas 1. Introduction 2. Installation 3. Input file formats 4. Scoring options 5. Weighting options 6. Output options 7. Performance: A. Database statistics B. Highscoring sequences in globular proteins C. Performance on coiled coils D. Limits of the method 8. Recommendations for using the program ------------------------------------------------------------------------------ 1. INTRODUCTION COILS is a program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation. COILS is described in Lupas, A., Van Dyke, M., and Stock, J. (1991) Predicting Coled Coils from Protein Sequences, Science 252:1162-1164, Lupas, A. (1996) Prediction and Analysis of Coiled-Coil Structures, Meth. Enzymology 266:513-525, and is based on a prediction protocol proposed by David Parry: Parry, D. A. D. (1982) Coiled-coils in alpha-helix-containing proteins: analysis of the residue types within the heptad repeat and the use of these data in the prediction of coiled coils in other proteins, Biosci. Rep. 2:1017-1024. ------------------------------------------------------------------------------ 2. INSTALLATION Source codes for COILS and for the auxiliary programs ALIGNED20, ALIGNED80, ALLFRAME and CAPS can be downloaded by anonymous FTP from the VMS folder of the server FTP.BIOCHEM.MPG.DE. The programs are written in VAX Pascal and can be compiled under Vax/VMS or OpenVMS. To run the programs on your machine, download the source code (e.g. COILS2.PAS) and compile it with the commands PASCAL COILS2 and LINK COILS2. If you encounter problems during compilation, send the error messages by e-mail to LUPAS@VMS.BIOCHEM.MPG.DE and I will try to help. ------------------------------------------------------------------------------ 3. INPUT FILE FORMATS COILS accepts files in the following formats: (a) GCG: P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN) ID MULI_ERWAM STANDARD; PRT; 78 AA. AC P02939; DT 21-JUL-1986 (REL. 01, CREATED) DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) DT 01-APR-1988 (REL. 07, LAST ANNOTATION UPDATE) DE MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR (MUREIN-LIPOPROTEIN). OS ERWINIA AMYLOVORA. OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS; OC ENTEROBACTERIACEAE. RN [1] RP SEQUENCE FROM N.A. RM 81117327 RA YAMAGATA H., NAKAMURA K., INOUYE M.; RL J. BIOL. CHEM. 256:2194-2198(1981). DR EMBL; J01577; EALPP. DR PIR; A03439; NPWCWY. DR PROSITE; PS00013; PROKAR_LIPOPROTEIN. KW SIGNAL; OUTER MEMBRANE; LIPOPROTEIN; DUPLICATION. FT SIGNAL 1 20 FT CHAIN 21 78 MUREIN-LIPOPROTEIN. FT LIPID 21 21 N-ACYL DIGLYCERIDE. FT REPEAT 24 34 FT REPEAT 38 48 SQ SEQUENCE 78 AA; 8369 MW; 24285 CN; Muli_Erwam Length: 78 January 21, 1994 16:04 Type: P Check: 4477 .. 1 MNRTKLVLGA VILGSTLLAG CSSNAKIDQL STDVQTLNAK VDQLSNDVTA 51 IRSDVQAAKD DAARANQRLD NQAHSYRK (b) Pearson (FASTA): >MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD DAARANQRLDNMATKYRK (c) user-defined: The program recognizes the start of a sequence by a > at the beginning or a [space,space,dot,dot] at the end of the line preceeding the sequence. The program recognizes the end of a sequence by a *, a //, or by the end-of-file character. The program accepts sequences in upper- and lower- case letters and ignores all spaces, numbers and other characters not representing an amino acid. If a file contains several proteins, the end of each sequence but last must be marked by * or by //: >M_ECOLI P1;MULI_ECOLI - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR MKATKLVLGAVILGSTLLAGCSSNAKIDQLSSDVQTLNAKVDQLSNDVNAMRSDVQAAKD DAARANQRLDNMATKYRK* >M_ERWAM P1;MULI_ERWAM - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR MNRTKLVLGAVILGSTLLAGCSSNAKIDQLSTDVQTLNAKVDQLSNDVTAIRSDVQAAKD DAARANQRLDNQAHSYRK* >M_MORMO P1;MULI_MORMO - MAJOR OUTER MEMBRANE LIPOPROTEIN PRECURSOR MGRSKIVLGAVVLASALLAGCSSNAKFDQLDNDVKTLNAKVDQLSNDVNAIRADVQQAKD EAARANQRLDNQVRSYKK ------------------------------------------------------------------------------ 4. SCORING OPTIONS After asking for input and output filenames, the program will offer the choice of two scoring matrices that it can compare a sequence to: MTK - is a matrix derived from the sequences of myosins, tropomyosins and keratins (intermediate filaments type I and II). It is the one described in Science, 252:1162 (1991). MTIDK - is a new matrix derived from myosins, paramyosins, tropomyosins, intermediate filaments type I - V, desmosomal proteins and kinesins. The matrix was compiled by weighting the residue frequencies of the different protein families according to the following scheme: 0.2 MYOSINS - 0.5 myosins - 0.5 paramyosins 0.2 TROPOMYOSINS 0.2 INTERMEDIATE FILAMENTS - 0.2 type I (keratin) - 0.2 type II (keratin) - 0.2 type III (desmin, vimentin, GFAP, peripherin) - 0.2 type IV (NF light, medium and heavy chains) - 0.2 type V (lamins A and B) 0.2 DESMOSOMAL PROTEINS - 0.33 desmoplakin - 0.33 plectin - 0.33 hemidesmosomal plaque prot. (bullous pemphigoid) 0.2 KINESINS While the MTIDK matrix provides for a somewhat better resolution between the scores of globular and coiled-coil proteins as well as for a more consistent evaluation of the different families of coiled-coil proteins, the MTK matrix yields fewer highscoring segments in a database of globular sequences (see Section 7: PERFORMANCE). Current data are consistent with the assumption that the MTK matrix is more specific for two-stranded structures and that the MTIDK matrix gives a more realistic assesment for other types of coiled coils. ------------------------------------------------------------------------------ 5. WEIGHTING OPTIONS Because coiled coils are generally fibrous, solvent-exposed structures, all but the internal a and d positions have a high likelihood of being occupied by hydrophilic residues. A program that gives equal weight to all positions is therefore going to be biased towards hydrophilic, charge-rich sequences. While this does not pose a problem for the vast majority of natural sequences, some highly charged sequences obtain high coiled-coil probabilities in the obvious absence of heptad periodicity and coiled-coil- forming potential. An extreme case is that of polyglutamate which obtains a coiled-coil-forming probability > 99%. To counter this problem, COILS2 contains a weighting option, which allows the user to assign the the same weight to the two hydrophobic positions a and d as to the five hydrophilic positions b, c, e, f and g. This leads to an only slightly worse performance of the program (see Section 7: PERFORMANCE) and permits the identification of the class of false positives described above. It is recommended to run a weighted and and unweighted scan in parallel and to compare the outputs. A drop of more than 20-30% in the probability is a clear indication of a highly-charged false positive. Two examples (window=21, probabilities abbreviated to the first digit): sequence EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MTK 99999999999999999999999999999999999999999999999999 MTIDK 99999999999999999999999999999999999999999999999999 MTK_W 00000000000000000000000000000000000000000000000000 MTIDK_W 00000000000000000000000000000000000000000000000000 sequence DDEKRKEKKDKKEKEKERRREKEKKEKEKEKERREKKKRKREEDDEEKKE MTK 47888999999999999999999999999999999999999998766666 MTIDK 99999999999999999999999999999999999999999999999999 MTK_W 00000000000000000000000000000000000000000000000000 MTIDK_W 00111111122222333333333333333333333222222220000000 In many cases a 21 residue scan yields clearer results than a 28 residue scan. As an alternative, it is possible to use the auxiliary program ALLFRAME. This program lists the scores (not probabilities) of a sequence in all seven frames. The presence and strength of a heptad periodicity can be inferred directly from the difference between the highest-scoring frame and all others: 1 E a 1.99 b 1.99 c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 2 E b 1.99 c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 a 1.99 3 E c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 a 1.99 b 1.99 4 E d 1.99 e 1.99 f 1.99 g 1.99 a 1.99 b 1.99 c 1.99 5 E e 1.99 f 1.99 g 1.99 a 1.99 b 1.99 c 1.99 d 1.99 6 E f 1.99 g 1.99 a 1.99 b 1.99 c 1.99 d 1.99 e 1.99 7 E g 1.99 a 1.99 b 1.99 c 1.99 d 1.99 e 1.99 f 1.99 8 E a 1.99 b 1.99 c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 9 E b 1.99 c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 a 1.99 10 E c 1.99 d 1.99 e 1.99 f 1.99 g 1.99 a 1.99 b 1.99 ..... 1 D a 1.21 b 1.38 c 1.51 d 1.42 e 1.58 f 1.18 g 1.14 2 D b 1.38 c 1.39 d 1.51 e 1.42 f 1.63 g 1.19 a 1.16 3 E c 1.40 d 1.41 e 1.58 f 1.52 g 1.66 a 1.22 b 1.26 4 K d 1.40 e 1.41 f 1.58 g 1.52 a 1.66 b 1.29 c 1.26 5 R e 1.46 f 1.41 g 1.58 a 1.52 b 1.66 c 1.31 d 1.27 6 K f 1.46 g 1.41 a 1.58 b 1.52 c 1.66 d 1.31 e 1.27 7 E g 1.46 a 1.41 b 1.58 c 1.52 d 1.66 e 1.31 f 1.27 8 K a 1.46 b 1.41 c 1.58 d 1.52 e 1.66 f 1.31 g 1.27 9 K b 1.46 c 1.41 d 1.58 e 1.52 f 1.66 g 1.31 a 1.27 10 D c 1.46 d 1.41 e 1.58 f 1.52 g 1.66 a 1.31 b 1.27 ..... In both examples, the absence of a heptad periodicity is obvious. For comparison, here are scores for the GCN4 leucine zipper; the heptad frame with the leucines in position d is immediately apparent: ..... 30 L b 0.79 c 1.17 d 1.91 e 0.98 f 0.94 g 1.21 a 1.17 31 E c 0.79 d 1.17 e 1.91 f 0.98 g 0.96 a 1.21 b 1.17 32 D d 0.79 e 1.17 f 1.91 g 0.98 a 0.96 b 1.21 c 1.17 33 K e 0.79 f 1.17 g 1.91 a 0.98 b 1.02 c 1.21 d 1.17 34 V f 0.79 g 1.17 a 1.91 b 1.02 c 1.02 d 1.21 e 1.17 35 E g 0.79 a 1.17 b 1.91 c 1.02 d 1.02 e 1.21 f 1.17 36 E a 0.78 b 1.17 c 1.91 d 1.02 e 1.02 f 1.21 g 1.17 37 L b 1.02 c 1.17 d 1.91 e 1.02 f 1.02 g 1.19 a 1.17 38 L c 1.02 d 1.17 e 1.91 f 1.02 g 1.02 a 1.19 b 1.17 39 S d 1.02 e 1.17 f 1.91 g 1.02 a 1.02 b 1.17 c 1.17 40 K e 1.02 f 1.17 g 1.91 a 1.02 b 1.02 c 1.17 d 1.17 41 N f 1.02 g 1.17 a 1.91 b 1.02 c 1.02 d 1.06 e 1.17 42 Y g 1.02 a 1.17 b 1.91 c 1.02 d 1.02 e 1.02 f 1.17 43 H a 1.02 b 1.17 c 1.91 d 1.02 e 1.02 f 1.02 g 1.17 44 L b 1.02 c 1.17 d 1.91 e 1.02 f 1.02 g 1.02 a 1.17 45 E c 1.02 d 1.17 e 1.91 f 1.02 g 1.02 a 1.02 b 1.17 46 N d 1.02 e 1.17 f 1.91 g 1.02 a 1.02 b 1.02 c 1.17 47 E e 1.02 f 1.17 g 1.91 a 1.02 b 1.02 c 1.02 d 1.17 48 V f 1.02 g 1.10 a 1.91 b 1.02 c 1.02 d 1.02 e 1.17 49 A g 1.02 a 1.10 b 1.91 c 1.02 d 1.02 e 1.02 f 1.17 50 R a 1.02 b 1.10 c 1.91 d 1.02 e 1.02 f 1.02 g 1.17 51 L b 1.02 c 1.04 d 1.91 e 1.02 f 1.02 g 1.02 a 1.17 ..... ------------------------------------------------------------------------------ 6. OUTPUT OPTIONS COILS2 offers four output options: The default option gives residue number, residue type and the frame and coiled-coil-forming probability obtained in scanning windows of 14, 21 and 28 residues: ..... 61 E c 0.317 c 0.379 c 0.562 62 L d 0.317 d 0.379 d 0.562 63 E e 0.317 e 0.379 e 0.562 64 L f 0.167 f 0.379 f 0.562 65 T c 0.472 c 0.598 g 0.562 66 H d 0.472 d 0.740 a 0.562 67 R e 0.916 e 0.740 e 0.677 68 K f 0.943 f 0.740 f 0.677 69 M g 0.943 g 0.740 g 0.677 70 K a 0.943 a 0.740 a 0.677 71 D b 0.943 b 0.740 b 0.677 ..... Opion a is similar to the default option, except that the results are displayed in rows. As a result, residue numbers are indicated by a scale above the sequence, probabilities are abbreviated to the first digit (but 100% is also 9) and the frames for the three scans are listed below the probabilities. This option gives a good overview over the location of peaks in a protein: ..... 61 . | . | . | . | . | . | ELELTHRKMKDAYEEEIKHLKLGLEQRDHQIASLTVQQQRQQQQQQQVQQHLQQQQQQLA 111144999999999999999777770000000000000000000333333333333332 333357777777777777777777772222222200004444444444444444444443 111112666666666666666666666666666654422222222222222222222222 cdefcdefgabcdefgabcdefgabcdefgabcdefdefgabcdebcdefgabcdefgab cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab cdefcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgab ..... Option b asks the user for the size of the scanning window and returns scores only. This option allows the user to inspect the scores behind the probabilities given in the previous options and to scan sequences with window sizes for which no statistics are currently available. For an application, see Seo, J. and Cohen, C. (1993) Pitch diversity in alpha- helical coiled coils, Proteins 15:223-234. ..... 61 E c 1.59 62 L d 1.59 63 E e 1.59 64 L f 1.50 65 T c 1.65 66 H d 1.65 67 R e 1.90 68 K f 1.94 69 M g 1.94 70 K a 1.94 71 D b 1.94 ..... Option c is useful for scanning very large proteins or files containing many proteins as it only displays (in default format) sequences with coiled-coil-forming probabilities above a cutoff value that is set by the user. ------------------------------------------------------------------------------ 7. PERFORMANCE A. Database statistics The following is a synopsis of the score distributions for the PDB and coiled-coil databases. The score distributions are approximated by Gaussians and the means and standard deviations of the Gaussians are given. PDB is a database of globular sequences from The Protein Data Bank (32,592 res.) described in Science 252:1162. The combined coiled-coil database contains 26,965 residues from various coiled-coil proteins (see Section 4: SCORING OPTIONS) and will be described in detail in print. Obviously, every family of coiled-coil proteins was scored with a scoring matrix that excluded residue frequencies from that family. 28 residue scan 21 residue scan 14 residue scan mean std.dev. mean std.dev. mean std.dev. PDB MTK 0.77 0.20 0.83 0.24 0.94 0.29 MTIDK 0.80 0.18 0.86 0.21 0.95 0.26 MTK_W 0.79 0.23 0.86 0.26 1.00 0.33 MTIDK_W 0.86 0.18 0.92 0.22 1.04 0.27 Coiled coils MTK 1.63 0.22 1.70 0.25 1.79 0.30 MTIDK 1.69 0.18 1.74 0.23 1.82 0.28 MTK_W 1.70 0.24 1.76 0.28 1.88 0.34 MTIDK_W 1.74 0.20 1.79 0.24 1.89 0.30 From these numbers, several conclusions can be drawn: - The difference between the mean scores in PDB and in coiled coils is slightly larger with the MTIDK matrix than with the MTK matrix. More importantly, the standard deviation of the score distribution is lower with the MTIDK matrix for both databases. This means that the MTIDK matrix yields a more consistent evaluation of globular and coiled-coil sequences and provides for a better resolution between the two score distributions. Not shown here is that the MTIDK matrix also improves the score of intermediate filament sequences relative to the scores of other coiled-coil sequences, thus providing for a more balanced scoring of the different families of coiled-coil proteins than the MTK matrix. - For both matrices, weighting slightly decreases the resolution between the globular and coiled-coil score distributions. - For all scoring methods, the resolution between the globular and coiled-coil score distributions decreases strongly with decreasing size of the scanning window. - The difference in performance between the MTK matrix and the MTIDK matrix is small although the MTIDK matrix is derived from over twice the number of residues and many more protein families. I conclude that little further progress can be expected from even larger coiled-coil databases. B. Highscoring sequences in globular proteins I scored release 13.0 (8/93) of the NRL_3D database (containing the sequences of proteins of known structure from PDB) with all four scoring methods and counted the number of segments obtaining probabilities >10%. The database contained 539 nonredundant protein sequences and excluded the coiled-coil proteins tropomyosin, hemagglutinin, GCN4, Gal4 and apolipoprotein E. Apolipoprotein E was included with the coiled-coil subset because its helices are very long compared to those of other helical bundles and because it forms a partly three-stranded structure. All other helical bundles were included with the globular proteins because their helices are short and frequently packed at irregular angles. These features generally prevent their detection by this algorithm although several helices from four-helix bundles appear as high-scoring segments in the following table. Results are compared to the number of segments obtained in a database of sequences generated by means of a random number generator (see Science 252:1162). (1 - MTK; 2 - MTIDK; 3 - MTK_W; 4 - MTIDK_W) RANDOM SEQUENCES 28 res. 21 res. 14 res. 28 21 14 1 2 3 4 1 2 3 4 1 2 3 4 1 2 1 2 1 2 10-19% 8 5 11 13 37 22 24 35 96 85 99 85 1 2 12 10 51 60 20-29% 4 1 5 3 18 14 23 14 47 33 51 45 2 1 10 5 21 26 30-39% 2 0 2 4 14 8 9 9 29 35 42 21 2 0 7 4 14 14 40-49% 4 0 2 5 6 2 15 10 21 14 17 19 1 0 2 1 8 9 50-59% 2 2 1 1 1 4 4 7 11 9 11 14 0 0 1 0 10 9 60-69% 1 0 3 6 3 4 7 5 9 11 12 14 0 0 0 0 5 6 70-79% 3 2 2 1 4 1 6 1 12 7 12 13 0 0 2 1 6 4 80-89% 1 2 3 1 3 4 3 4 10 14 8 18 0 0 1 2 2 5 >= 90% 1 3 1 1 4 9 6 7 11 20 8 15 2 2 2 2 5 7 In this table, the number of segments per 10% increment levels off above 50% rather than decreasing continuously. This is due to the sigmoid shape of the curve that relates scores to probabilities which masks a continuing decrease in number of segments per score interval. Above 50%, the number of segments per 10% increment doubles from around 2 in the 28 res. scan to around 4 in the 21 res. scan and then triples to around 12 in the 14 res. scan. A similar progression at a lower level is observed for the random sequence database. This progression is due to the significantly poorer resolution of smaller scanning windows. The difference in numbers between PDB and random sequences is attributable to amphipathic helices that are frequently present in native proteins but are not a preferred element of random sequences. Outside the tail end of the score distribution seen in this table, the score distributions of PDB and random sequences are superimposable (see Science 252:1162). This means that the real resolution between the globular and coiled-coil score distributions is slightly lower than the nominal resolution. The weighted matrices are less reliable than the unweighted matrices. The MTK matrix yields fewer highscoring segments at probabilities >90% than the MTIDK matrix and thus appears more reliable even though its nominal resolution is poorer. This is probably an incorrect conclusion. As is detailed in the next paragraph, there are now several examples of sequences that do not assume a coiled-coil (or even alpha-helical!) structure under normal circumstances but that have the potential to do so if their context is changed. It therefore appears likely that the sequences which are assigned elevated coiled-coil probabilities by the COILS program actually do have the potential to form coiled coils even though they do not do so in the protein context or under the conditions in which the structure was determined. The larger number of high- scoring segments with the MTIDK matrix would then be the result of an increased sensitivity of this matrix. Virtually all segments with scores above 50% in 21 and 28 scans are centered on a surface helix although several contain two discotinuous helices rather than one continuous helix. Several of the helices are from four-helix bundles and thus have coiled-coil characteristics. Following recent developments, it is increasingly likely that most (if not all) of these high-scoring sequences have an elevated coiled-coil-forming potential and could form coiled coils in a different context. This follows from three recent results: (1) A loop segment of influenza hemagglutinin, pH7, which was predicted by COILS to have elevated coiled-coil potential, in fact forms a coiled coil in the pH4 structure (Bullough et al., Nature 371:37, 1994). (2) The basic region of bZip transcription factors, which is not even alpha-helical in the absence of DNA, can be converted into a coiled coil by a designed peptide (Krylov et al., EMBO J. 14:5329, 1995). (3) A peptide from topoisomerase II, which was identified using COILS, forms a coiled coil in solution but not in the structure of the full protein (Frere et al., J. Biol.Chem. 270:17502, 1995). Nevertheless, the decreased coiled-coil-forming potential of these sequences relative to "constitutive" coiled coils can be seen from the fact that they score highly in one method but generally much lower in at least one of the other methods; example: 5LDH - lactate dehydrogenase: seq CAISILGKSLTDELALVDVLEDKLKGEMMDLQHGSLFLQTP MTK 00112444444444444444444444444444411000000 MTK_W 35678999999999999999999999999999911000000 MTIDK 00000000000000000000000000000000000000000 MTIDK_W 00012333333333333333333333333333300000000 and several segments drop considerably in score from a 28 residue scan to a 21 residue scan; example: 2TS1 - tyrosyl-tRNA synthetase: seq PEKRAAQKTLAEEVTKLVHGEEALRQAIRIS 14 0001111111111111100000000000000 21 0222222222222222222222222220000 28 0777777777777777777777777777721. The latter effect is observed particularly if a segment contains two discontinuous helices. These effects can be taken as indicators for a decreased likelihood of coiled-coil formation since neither effect is normally observed in coiled coils, as can be seen in part C of this section. C. Performance on coiled coils In the following, secondary structure (c = coiled-coil helix) and coiled- coil-forming probabilities are shown beneath the sequences as scored by MTK, MTIDK, MTK_W and MTIDK_W in that order. The values were obtained with a 21 residue scanning window which appears to spot the ends of coiled-coil segments somewhat more accurately than a 28 residue window. (For spotting the ends of coiled coil helices, see also the documentation for the auxiliary program CAPS). The coiled coils in Gal4, GreA and human mannose-binding protein were analyzed with a 14 residue window because of their short length. Tropomyosin is not shown; it obtains probabilities >99% over its entire length except for the C-terminal 20 residues. (C1) parallel, two-stranded structures >GCN4 bZip (Cell 71:1223) MKDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER hhhhhhhhhhhhhhhhhhhhhhhhhhcccccccccccccccccccccccccccccccc 0000000000222779999999999999999999999999999999999999988330 0000011111777999999999999999999999999999999999999999988110 0000000000000224555566699999999999999999999999999999999770 0000000000000889999999999999999999999999999999999999999770 Similar probabilities (>99%) are obtained for the bZip regions of Fos and Jun (see Meth. Enzymology 266:513). As seen here, the ends of coiled-coil segments may be overpredicted significantly in the absence of strong flanking helix-breaking residues. This is a particular problem in bZip proteins, where the coiled coil follows continuously out of the basic-region helix. Note, though, that the basic region also has some coiled-coil-forming potential, as demonstrated by Krylov et al. (EMBO J. 14:5329, 1995). >Max b-HLH-Zip (Nature 363:38) ADKRAHHNALERKRRDHIKDSFHSLRDSVPSLQGEKASRAQILDKATEYIQYMRRKNDTH hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh hhhhhhhhhhhhhhccccccc 000000000000000000000000000000000000000000111112288889999999 000000000000000000000000000000000000000000000001199999999999 000000000000000000000000000000000000000000000011155556888999 000000000000000000000000000000000000000000111113388889999999 QQDIDDLKRQNALLEQQVRALEKARSSAQLQT ccccccccccccccccccccc 99999999999999999999999999999884 99999999999999999999999999999996 99999999999999999999999999988771 99999999999999999999999999999992 >Gal4 (Nature 356:408) MKLLSSIEQACDICRLKKLKCSKEKPKCAKCLKNNWECRYSPKTKRSPLTRAHLTEVESRLERLEF hhhhhhhh hhhhhhhh ccccccccccccccc 000000000000000000000000000000000000000000000000014888888888888882 000000000000000000000000000000000000000000000000017999999999999992 000000000000000000000000000000000000000000000000006888888888888884 000000000000000000000000000000000000000000000000008999999999999995 COILS works well for parallel two-stranded structures (independently of the scoring method used) if they are solvent-exposed. The parallel two-stranded coiled coil buried in CAP is entirely invisible to this program because of the absence of a heptad repeat. (C2) antiparallel, two-stranded structures >Seryl-tRNA synthetase - Escherichia coli (Nature 347:249) MLDPNLLRNEPDAVAEKLARRGFKLDVDKLGALEERRKVLQVKTENLQAERNSRSKSIGQ cccccccccccccccccccccccccccccccchh 000000000000000000000000003888888888888888888888888882100000 000000000000000000000000003999999999999999999999999993000000 000000000000000000000000003777777777777777777777773330000000 000000000000000000000000004889999999999999999999998880000000 AKARGEDIEPLRLEVNKLGEELDAAKAELDALQAEIRDIALTIPNLPADEVPVG...... hhhh cccccccccccccccccccccccccccccccccc 000000000099999999999999999999999999999999900000000000 000007788899999999999999999999999999999999988800000000 000000000099999999999999999999999999999999955500000000 000089999999999999999999999999999999999999999933100000 >Seryl-tRNA synthetase - Thermus thermophilus (JMB 234:222) MVDRKRLRQEPEVFHRAIREKGVALDLEALLALDREVQELKKRLQEVQTERNQVAKRVPK ccccccccccccccccccccccccccccccc 000000000000000000011124599999999999999999999999999999999910 000000000000000000000013499999999999999999999999999999999986 000000000000000000022236699999999999999999999999999998887700 000000000000000000000014599999999999999999999999999999999954 APPEEKEALIARGKALGEEAKRLEEALREKEARLEALLLQVPLPPWPGAPVG........ ccccccccccccccccccccccccccccccccccccc 0008888888888999999999999999999999999999920000000000 4009999999999999999999999999999999999999997000000000 0002224444444999999999999999999999999999932000000000 1005556677777999999999999999999999999999999000000000 >GreA transcript cleavage factor (Nature 373:636) MQAIPMTLRGAEKLREELDFLKSVRRPEIIAAIAEAREHGDLKENAEYHAAREQQGFCEGRIKDIEAKLSNAQVID sscccccccccccccccc-ccccccccccccc cccccccccccccccccccccccccc ss 0000011366666666666666664200000000000000000000000000000002999999999999998730 0000011388888888888888888500000001111111111111100000000004999999999999997710 0000022688888888888888885300000000000000000000000000000000777777777777776630 0000033899999999999999999800000000000000000000000000000000777777777777776620 GreA resembles in its structural organization seryl-tRNA synthase. It is currently the only known coiled-coil structure with a true skip residue (Val34). The high scores in the two coiled coil helices correspond to the segment of coiled coil that is located between the skip and the globular part of the protein. >Replication terminator protein (Cell 80:651) MKEEKRSSTGFLVKQRAFLKLYMITMTEQERLYGLKLLEVLRSEFKEIGFKPNHTEVYRSL hhhhhhhhhhhhhhhh ssss hhhhhhhhhhh hhhhhhhh 0000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000 HELLDDGILKQIKVKKEGAKLQEVVLYQFKDYEAAKLYKKQLKVELDRCKKLIEKALSDNF hhhhh sssssss sssssss hhhhhhhhhhccccccccccccccccccccc 0000000000000000000000000001111133666666666666666666666655540 0000000000000000000000000000033333444488888888888888888888880 0000000000000000000000000002222233555555555555555555555533320 0000000000000000000000000001133344555588888888888888888888882 COILS is also generally reliable in the analysis of antiparallel two-stranded coiled coils, but does not detect the DNA-binding coiled coil in serum response factor (Nature 376:490), which, because of its special function, has a very distinct residue distribution. (C3) parallel, three-stranded structures >hemagglutinin (Nature 333:426 and 371:37) GLFGAIAGFIENGWEGMIDGWYGFRHQNSEGTGQAADLKSTQAAIDQINGKLNRVIEKTN hhhhhhhhhhhhhhhhhh pH7 ccccccccccccccccccccc pH4 000000000000000000000000000000001223466666666666666666666658 000000000000000000000000000000000222455555555555667888888889 000000000000000000000000000000000122344444444444444444444402 000000000000000000000000000000000111222222222222222222222211 EKFHQIEKEFSEVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDLTDSEMNKLFE ccccccccccccccccccccccccccccccccccccccccccccc pH7 ccccccccccccccccccccccccccccccccccccccccccccc hhhhhhhh pH4 999999999999999999999999999988800000000000000000000144444444 999999999999999999999999999766611111110000000000000288888888 333377777788888888888888888888800000000000000000000000000000 333355555555555555555555555555533333331000000000000033333333 KTRRQLRENAEEMGNGCFKIYHKCDNACIESIRNGTYDHDVYRDEALNNRFQIKG cccccc pH7 hhhhhhhhh pH4 4444444444444220000000000000000000000000000000000000000 8888888888888440000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000 3333333333333110000000000000000000000000000000000000000 Influenza haemagglutinin is a complex structure which undergoes a large structural transition between pH7 and pH4. There is multiple evidence that the structure at pH7 is only meta-stable. >Mannose-binding protein A, rat (Structure 2:1227) AIEVKLANMEAEINTLKSKLELTNKLHAFSMGKKSGKKFFVTNHERMPFSKVKALCSELRGTVAIPRNAEENKAI cccccccccccccccccccccccccccccc sssssssss hhhhhhhhhh ss hhhhhhh 999999999999999999999999997731000000000000000000000000000000000000000000000 999999999999999999999999998830000000000000000000000000000000000000000000000 999999999999999999999999993320000000000000000000000000000000000000000000000 999999999999999999999999995520000000000000000000000000000000000000000000000 QEVAKTSAFLGITDEVTEGQFMYVTGGRLTYSNWKKDEPNDHGSGEDCVTIVDNGLWNDISCQASHTAVCEFPA hhhh ssssss ss sssss ssss sssssss 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 >Mannose-binding protein C, human (Nature Struct. Biol. 1:789) AASERKALQTEMARIKKWLTFSLGKQVGNKFFLTNGEIMTFEKVKALCVKFQASVATPRNAAENGAI cccccccccccccccccccc sss ssssssssssshhhhhhhhhh ss hhhhhhh 2246666666666666600000000000000000000000000000000000000000000000000 5579999999999999900000000000000000000000000000000000000000000000000 2222222222222222200000000000000000000000000000000000000000000000000 5555555555555555500000000000000000000000000000000000000000000000000 QNLIKEEAFLGITDEKTEGQFVDLTGNRLTYTNWNEGEPNNAGSDEDCVLLLKNGQWNDVPCSTSHLAVCEFPI hhh ssssss ss ssss ssss sssssssss 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 00000000000000000000000000000000000000000000000000000000000000000000000000 (C4) antiparallel, three-stranded structures >coil-Ser (Science 259:1288) EWEALEKKLAALESKLQALEKKLEALEHG ccccccccccccccccccccccccccccc 99999999999999999999999999999 99999999999999999999999999999 99999999999999999999999999999 99999999999999999999999999999 This is an unusual homotrimeric structure that was produced incidentally to the design of a two-stranded coiled coil. >spectrin (Science 262:2027) NLDLQLYMRDCELAESWMSAREAFLNADDDANAGGNVEALIKKHEDFDKAINGHEQKIAA cccccccccccccccccccccccccccc cccccccccccccccccccccccc 000000000000000000000000000000111114466667777777777777777777 000000000000000000000000000000000003355558888888888888888888 000000000000000000000000000000000000011111111117777777777777 000000000000000000000000000000000000011113333339999999999999 LQTVADQLIAQNHYASNLVDEKRKQVLERWRHLKEGLIEKRSRLGD cccccccccc ccccccccccccccccccccccccccccccc 7777777777742220000000000000000000000000000000 8888888888863110000000022222222222222222222200 7777777777755552211110000000000000000000000000 9999999999977773322220044444444444444444444400 As an antiparallel three-helix bundle, spectrin is already fairly far removed from the reference set of parallel two-stranded structures that is used for scoring. Accordingly, as with four-helix bundles, the program has problems identifying all the helices in the structure. While this does not make the prediction of helix B as a coiled coil incorrect, it makes it rather useless and indeed misleading for model-building. In the long run, scoring matrices that are specific for helical bundles should be the answer, but my experiments with a matrix derived from four-helix bundles (Paliakasis & Kokkinidis, Prot.Eng. 5:739) show that the ones currently available have only little predictive power. Even in the absence of such matrices, the prediction can be improved significantly using the auxiliary programs ALIGNED20/80 if homologous sequences are available for a protein. Their application to spectrin is shown in the documentation file ALIGNED.DOC. One of the specific problems of the program with helix A of spectrin are the Trp and Phe residues in position g of the heptad repeat. These residues are very rare at that position both in two-stranded and three- stranded coiled coils. Such residues can occur or even be important in certain structures even though they are disfavored in most others. It is therefore recommended that a protein with a single peak be also analyzed with all rare residues (W, C, P) replaced by Ala. Emergence of more peaks indicates the presence of a helical bundle. Also, if proteins that one suspects may form a helical bundle have a peak that occurs only in a 14 residue scan, one should look whether replacement of a single unfavorable residue (e.g. D in a) by Ala does not greatly lengthen the predicted length of the helix or raise significantly its score. Such "wrong" residues may actually help to build a model since their presence needs to be accounted for and limits the possibilities. (C5) other antiparallel helical bundles >ApoE (Science 252:1817) GQRWELALGRFWDYLRWVQTLSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQL ccccccccccccccccccc hhhhhhhhhhcccccccccccccccccccccccccccc 000000000000000000000000000000013379999999999999999999999999 000000000000000000000000000000026699999999999999999999999999 000000000000000000000000000000001129999999999999999999999999 000000000000000000000000000001689999999999999999999999999999 TPVAEETRARLSKELQAAQARLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLR cccccccccccccccccccccccccccccccccccc ccccccccccccc 818999999999999999999999533331111000000000000111111114478999 889999999999999999999999733330000000000000000444444445589999 959999999999999999999999444441111111111111000333333336689999 999999999999999999999999433331111111111110011888888888899999 KLRKRLLRDADDLQKRLAVYQAGA cccccccccccccccccccccc 999999999999999999988877 999999999999999999999855 999999999999999999999999 999999999999999999999999 The prediction for ApoE is good for the three-stranded part but much poorer for the four-stranded part: the short N-terminal helix 1 is not seen by the program, partly because of its length but mostly because of the three Trp residues, and the C-terminus of helix 3 and the N-terminus of helix 4 which interact with helix 1 also obtain low scores. This brings me to: D. Limits of the method As can be seen from the examples given, the program works well for parallel two-stranded structures that are solvent-exposed but runs progressively into problems with the addition of more helices, their antiparallel orientation and their decreasing length. The program fails entirely on buried structures. Limits are also set by the statistical noise which greatly decreases the usefulness of small scanning windows. Finally, the possibility that sequences with good coiled-coil potential do not form a coiled coil because of constraints from other parts of the sequence may add a further limit to the accuracy of the program. Because many reasons can lead the program to miss a helix while the conditions for detection are quite stringent, the absence of a peak is not nearly as conclusive as the presence of a peak. Effects of this on interpreting scores from multiple alignments is discussed in ALIGNED.DOC. What I believe one can conclude safely from the absence of a peak is that no solvent-exposed two- or three-stranded coiled-coil of length greater than approximately 20 residues is present in the protein. ------------------------------------------------------------------------------ 8. RECOMMENDATIONS FOR USING THE PROGRAM COILS is specific for solvent-exposed, left-handed coiled coils. Other types of coiled-coil structure, such as buried coiled coils (e.g the central coiled coil in catabolite repressor protein, or some transmembrane domains) and right-handed coiled coils, are not detected by the program. COILS does not reach yes-or-no decisions based on a threshold value. Rather, it yields a set of probabilities that presumably reflect the coiled-coil forming potential of a sequence. This means that even at high probabilities (e.g. >90%), there will be (and should be) sequences that in fact do not form a coiled coil, though they may have the potential to do so in a different context. COILS is biased towards hydrophilic, highly charged sequences. For this reason, all scans should be performed with a weighted and an unweighted matrix, and the results compared. Differences of more than 20-30 percentage points in the probabilities should be taken to indicate that a coiled-coil structure is unlikely, the elevated scores being mainly due to the high incidence of charged residues (note, though, that this would have marked human mannose-binding protein as a false positive). The MTK and MTIDK matrices both assign high probabilities to known coiled coils segments, but identify different helices at high probability in a database of globular proteins. This is a surprising feature whose reason is as yet unclear, but which can be exploited for predictive purposes. It is therefore useful to compare the results of scans made with the two matrices. Again, differences of more than 20-30 percentage points in the probabilities should be taken to indicate that a coiled-coil structure is unlikely (note, though, that this threshold would make the replication terminator protein a border-line case). The resolution between globular and coiled-coil score distributions decreases strongly with a decreasing size of the scanning window. The prediction of new coiled-coil segments should therefore be made using a 28 residue window, or in special cases a 21 residue window. 14 residue windows should normally be reserved for the analysis of local parameters (such as the frame) in known or predicted coiled coils. The ends of coiled-coil segments appear to be most accurately identified in a 21 residue window. In general, I assume that residues with probabilities >50% are part of a coiled-coil segment. In addition, a search for the most likely helix ends using CAPS is generally useful (see also the CAPS documentation). Sequences with high coiled-coil probabilitiy from globular proteins rarely exceed a length of 30 residues. None is longer than 35 residues. Sequences with probabilities >80-90% that extend for more than 35 residues are therefore more likely to assume a coiled-coil structure than is indicated by the obtained probabilities. Where possible, sequences related to the protein of interest should also be analyzed for predicted coiled-coil segments (see the section on the ALIGNED programs). It should be kept in mind, though, that the sequences must be related in the region of high scores in order for the comparison to be significant. Comparison of the coiled-coil prediction with predictions of the secondary structure are generally useful, particularly if multiple related sequences are available.