Readme is this file. It contains description of all files on the floppy and instructions about how to run the program. The programs perform phylogenetic alignment of homologous sequences (DNA/Protein) as described in the articles "A Unified Approach to Alignments and Phylogenies" in Methods of Enzymology, Spring 90 (VOL 183) and two articles in Mol.Biol.Evol(Nov.89). The first article is the better starting point. If you have received this floppy after a reprint request, but not the reprint it is because I don't have any more reprints. I don't have any reprints for the Meth.Enz. article. This program was written, when I worked in different groups. I am grateful for support thanks to Norman Kaplan and Charles Langley for 18 months in North Carolina, a similar period at UCSD thanks to Russell Doolittle, and two shorter periods with Michael Waterman and Temple Smith. Besides these people my understanding of these problems have gained from discussions with Joseph Felsenstein and David Sankoff. Tom Darden has been helpfull in answering my questions about C. Please refer to the date of the version, that you are using. If the program crashes, which it very rarely does and only with very complex data, I will be interested in receiving following: the data set, the parameters used, the value of MEMORY in typedef.h and name of computer that you use. Then I will try to correct it in a future version. Please communicate errors to me, this is a major help in improvements if you do. I presently use an IRIS 4D/220S. There might be problems on other machines as I have not tested the programs on other computers recently. If you use the the result in some publication, I would be interested in receiving a reprint. This is not to check that I have been quoted, but I should like to know about the results obtained by the program (besides generally being interested in evolutionary biology). My Workaddress is Jotun Hein,National Institue of Genetics, Mishima, Shizuoka- Ken, 411 Japan My E-mail address is JHEIN@NIGUTS.NIG.AC.JP. (c) Copyright Nov. 1990 by Jotun Hein. These programs can be copied as long as this message is not deleted and the programs are distributed free of charge. This file has following content: WHAT THE PRODRAM DOES AND DOESN'T FILES THE FUNCTION OF DIFFERENT MODULES HOW TO RUN THE PROGRAM :BEFORE COMPILATION :COMPILATION :AN EXAMPLE :OUTPUT TO SCREEN LIMITS IN PROGRAM NEW IN THIS VERSION FUTURE DEVELOPMENTS :THIS PROGRAM :OTHER PROGRAMS ------- WHAT THE PROGRAM DOES AND DOESN'T------------------------------- WHAT IS DOES: It solves following two problems simultaneously: Parsimony Tree Reconstruction: A very predominant method of choosing a phylogeny for a set of sequences is to find the phylogeny that postulates the least change. Such programs have been made since the late 60s/early 70s, but they could only be applied to prealigned sequences or sequences, that for some reason had not been subject to insertion-deletions. Pairwise alignment: Since late 60s/early 70s (again) methods have existed that could find the set of substitutions and insertion-deletions that represented the least evolution or the two sequences. This is equivalent to align two sequences. Unfortunately this was restricted to two sequences. It can analyze both DNA and protein (either or, though). I have run all existing globins (>300) and up to 20 complete retroviuses (8-10kb (DNA). The last analysis took about 2hrs CPU time, and in this case it is regrettable, that it cannot analyze sequences that is a mix of DNA and protein. But it should be able to handle most reasonable alignment problems. WHAT IT DOESN'T: There are things the program cannot do or should not be used for: i: It cannot test whether homologies are statistically significant, it can only reconstruct phylogenies and make alignments. The sequences should be homologous and there should be a history to be found. If you give it a set of COMPLETELY unrelated sequences, it is possible that it will not be able to align them, since it cannot allocate enough memory. ii: The sequences should not vary in length because they have been sequenced unequally much. Length differences should be due to evolution. Thus it should not be used to look for local homologies. iii: Sequences that has been subject to recombinations or gene conversions should be treated with caution. iv: It is part of the the principle of reconstructing the history of a set of sequences that nucleotides are only put in the same column if the are same column are connected by a path in the tree where the nucleotide has neither been inserted or deleted. This has consequences, that occasionnally gives an alignment, that people feel could be improved, i.e. could be changed a little to look 'nicer'. To illustrate the phenomena, assume 100 sequences is to be aligned. 98 have length 0 and 2 have length 1 and consists of a 'T'. How should these 100 sequences be aligned. If the 2 'T's are brothers in the tree, they will be in the same column, their immediate ancestor is a 'T' and all other positions in the column are occupied by gap signs. The weight of the alignment corresponds to one insertion-deletion. On the other hand if the 2 'T's are at tips very distant in the tree, they should be in two different columns, and the weight of the alignment will correspond to 2 insertion- deletions. Had they been put in the same column all internal nodes in the path connecting them will be assigned 'T's and a large set of insertion- deletions would be postulated, since the 'T's only survived in two tips. -------------------------FILES------------------------------------------ Program. The program is written in the programming language C consists of following modules: au.c dis.c draw.c graph.c hash.c inout.c main.c mat.c metric.c pars.c pre.c story.c stati.c tree.c typedef.h The floppy contains following additional files: Makefile that will automatically compile the modules under unix. mut.dat par.dat test1.pro test2.dna user.tree ---------------------- FUNCTION OF DIFFERENT MODULES ------------------- au.c It contains a series of standard functions like finding the minimum of 2 or 3 variables, checks if memory allocation works, freeing of dynamically allocated memory etc. dis.c This performs nearest neighbour interchanges on the distance tree to improve its fit do the distance data. draw.c Makes drawing of phylogeny. graph.c This performs certain standard operations on graphs. It adds edges to the graph that represents deletions (dummy-edges). It prunes the graph, i.e tries to find a smaller subgraph that generates the same sequences. This algorithm is not perfect at present, it can cut down the graph by too much, but almost never occurs. The pruning is necessary only for DNA because runs of identical nucleotides can be aligned in many ways all equally good, creating a large redundancy in the graph. hash.c Hashes sequences to find a fast prelimenary alignment. inout.c Reads sequences from file and writes out the final results. main.c This contains main, reads in the parameters to be used and declares global variables to be used in many modules. mat.c This performs the dynamic programming algorithm comparing sequences and the algorithm comparing graphs. These algorithms are kept separate, as sequence comparison is much faster than graph comparison. This also performs the backtracking, reconstructing sequence graphs. It is done by a breath first search in the matrix created by mat.c, never visiting the same node twice. The graph is created as minimal nodes and edges are encountered. metric.c This uses the metric space properties of sequences to determine, which distances should be calculated and which not. Upper and lower bounds can be determined on a distance between two sequences, using other known distances. Such a distance to be calculated if it is small (upper bound low), or if it is very uncertain (difference between upper and lower bound is big) and not too big (lower bound high). This method is only used for more than 30 sequences. pars.c Nearest neighbour interchanges on the parsimony tree to make it more parsimonious. Internal edges that are long or next to long edges are not investigated, since parsimony is bad guidance for these. pre.c Finds statistically significant common segments to the two sequences, that is used to speed up the computations. story.c Chooses between the set of equally parsimonious history only one. stati.c Calculates different statistics like mutations and insertion-deletetions and branch lengths in proposed tree. tree.c This constructs a distance tree from the sparse distance matrix created by metric.c or it reads user specified tree from user.tree if this option is used. typedef.h Have definitions of records and constants declarations. --------------- HOW TO RUN THE PROGRAM ------------------------- --------------------------- :BEFORE COMPILATION ------------------------- You must have access to a computer with a C compiler. I am not sure that the program will run on PC's. I suspect it is difficult to run it on computers with less computing power than a VAX 11/730. It is possible that you should change the parameter ,MEMORY, in typedef.h. The more powerfull computer, the higher can this number be. On a VAX 11/730 I would guess that 25 000 is appropriate, on VAX 11/780 100 000. This number is the area of the array used for the dynamical programming algorithm. Each entry in the area corresponds to 3 short int, i.e 6 bytes on most machines. So 25 000 implies that the comparisons will try to done by allocating 150 000 bytes. Another major consumer of memory is the sequences and sequence graphs. If you have set MEMORY too low, the program will come to a halt and will advice you to increase MEMORY. Try with 50 000 first. Three files must be present to run the program: i) par.dat. ii) file with sequences. iii) mut.dat file with mutation distance matrix. Mut.dat is not shown as there is no choice presently for the user to be made concerning this file. a fourth file must be present if the option usertree is to be used: iv) file with user specified tree. Decide with which parameters the program should be run and write these parameters in the file par.dat. The program is completely non-interactive. Make the file with the input sequences and be careful that they have the right format. Next line is the first line in an example of par.dat, the first input file: 1 4 8 3 0 0 test1.pro test.tree test.ali user.tree PRO(1)/DNA(0) nuseq a b ancestorout usertree fileseq filetree fileali usertreefile COMMENTS ON PARAMETER VALUES IN THE par.dat EXAMPLE protein or DNA corresponds to 1 and 0, respectively. Thus protein in this case nuseq is number of sequences to be found on fileseq. 4 in this case. a,b: gap penalty function is gk=a+k*b. 8 + 3*k in this case. b should not be put to less than 3. a should just be non-negative. ancesterout: only present sequences in alignment (0) or also ancestral sequences (1). usertree: should the program reconstruct the phylogeny (0) or will it be supplied (1). No in this case, phylogeny must be reconstructed. It is possible to use this option and then try to find a better tree by setting parcyc larger than 0. fileseq: file with sequences on it. test1.pro in this case. filetree: file where phylogeny will be written. test.tree in this case. fileali: file where main output will be written. test.ali in this case. user.tree: name of file with usertree. user.tree in this case, but not needed. -------------------- :COMPILATION ----------------------------------------- If you run under UNIX: 1: write make. This will create align. To run then 2: write align. The program should then be running. If you run under VMS: 1: write define lnk$library sys$library:vaxcrtl.olb 2: write cc main write cc mat etc. for all modules. 3 write link main,mat, ..., (all modules) you should then have create main.exe, that you can run. 4 write run main. The program should then be running. ---------------------AN EXAMPLE---------------------------------------- Next line is first line on test1.pro, the second inputfile > HEMOGLOBIN ALPHA CHAIN - HUMAN AND CHIMPANZEES V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T K T Y F P H F D L S H G S A Q V K G H G K K V A D A L T N A V A H V D D M P N A L S A L S D L H A H K L R V D P V N F K L L S H C L L V T L A A H L P A E F T P A V H A S L D K F L A S V S T V L T S K Y R * > HEMOGLOBIN BETA CHAIN - HUMAN, CHIMPANZEES, AND GORILLA V H L T P E E K S A V T A L W G K V N V D E V G G E A L G R L L V V Y P W T Q R F F E S F G D L S T P D A V M G N P K V K A H G K K V L G A F S D G L A H L D N L K G T F A T L S E L H C D K L H V D P E N F R L L G N V L V C V L A H H F G K E F T P P V Q A A Y Q K V V A G V A N A L A H K Y H * > MYOGLOBIN - HUMAN G L S D G E W Q L V L N V W G K V E A D I P G H G Q E V L I R L F K G H P E T L E K F D K F K H L K S E D E M K A S E D L K K H G A T V L T A L G G I L K K K G H H E A E I K P L A Q S H A T K H K I P V K Y L E F I S E C I I Q V L Q S K H P G D F G A D A Q G A M N K A L E L F R K D M A S N Y K E L G F Q G * > LEGHEMOGLOBIN I - YELLOW LUPIN G V L T D V Q V A L V K S S F E E F N A N I P K N T H R F F T L V L E I A P G A K D L F S F L K G S S E V P Q N N P D L Q A H A G K V F K L T Y E A A I Q L E V N G A V A S D A T L K S L G S V H V S K G V V D A H F P V V K E A I L K T I K E V V G D K W S E E L N T A W T I A Y D E L A I I I K K E M K D A A * The line above is the last line read on test1.pro, as it contains end of fourth sequence. A sequence must obey following format.: A ">" on FIRST position on a line, the sequence starts on second line after that and continues until a "*" is encountered. Always check if the sequences obey the format and there are enough sequences on the specified file. Next line is the only line in user.tree, the fourth and optional inputfile. ((1,2)6,(3,4)5)7; This describes the topology of the phylogeny in test.tree. Next line is first line in test.ali, the first output file: 1 EMOGLOBIN ALPHA CHAIN - HUMAN AND CHIMPANZEES 2 EMOGLOBIN BETA CHAIN - HUMAN, CHIMPANZEES, AND GORILLA 3 YOGLOBIN - HUMAN 4 EGHEMOGLOBIN I - YELLOW LUPIN RELATIONSHIP BETWEEN PAIRS OF SEQUENCES: lower triangle distance between pairs upper triangle percent homology in alignment 1 2 3 4 1 45 27 16 2 218 25 17 3 313 339 20 4 381 394 379 weight of insertion-deletion of length k: 8 + k*3 number of sequences: 4 total number of pairwise comparisons used: 9 163 is the total length of alignment TABLES OF GENETIC EVENTS: lower triangle: symmetrisized substitutions upper triangle: mutation distance matrix C S T P A G N D E Q H R K M I L V F Y W C 2 4 4 4 3 4 5 6 5 4 4 6 4 4 4 4 3 3 3 S 0 1 2 1 1 1 2 3 3 3 3 3 4 4 4 2 3 3 4 T 0 7 2 1 2 2 3 3 3 4 3 2 3 3 4 2 4 4 5 P 0 3 0 1 2 4 3 3 3 3 3 4 4 4 3 2 3 4 4 A 2 8 6 4 1 3 2 2 3 4 4 3 3 4 4 1 3 4 4 G 0 5 5 2 13 3 2 2 4 5 3 4 5 4 4 2 4 4 3 N 0 3 1 0 2 1 1 2 3 2 3 2 5 4 5 4 4 3 6 D 0 3 1 2 4 3 5 1 2 3 4 3 4 5 5 3 5 4 6 E 0 1 2 2 2 8 2 8 2 4 3 2 4 5 5 2 4 5 5 Q 0 2 0 0 5 0 0 2 1 2 3 2 4 5 4 4 5 4 5 H 0 3 0 0 2 1 4 2 0 3 2 3 4 4 3 5 4 3 5 R 0 1 1 0 0 0 0 0 1 0 2 1 4 4 4 4 5 5 4 K 0 3 4 1 3 4 1 2 6 7 3 3 4 4 4 3 5 5 5 M 0 0 0 1 0 0 0 0 0 0 1 0 0 2 1 2 3 4 3 I 0 2 1 0 2 1 1 0 0 0 0 0 0 0 1 4 2 3 3 L 0 1 0 1 4 0 0 0 0 0 3 1 4 5 7 4 2 3 2 V 2 4 2 0 9 3 0 1 5 0 0 0 3 0 1 7 2 3 3 F 0 0 0 0 2 0 0 1 0 0 0 0 1 0 0 11 5 1 3 Y 0 0 0 0 0 0 2 0 0 0 0 1 0 0 1 2 1 3 3 W 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 INSERTION-DELETIONS : lengths: 1 2 4 6 numbers: 5 2 2 2 total weight of history: 794 ALIGNMENT OF SEQUENCES: V--LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF------DLSHGSAQVKGHGKKV 62 1 VH-LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKV 67 2 G-VLTDVQVALVKSSFEEFNANIPKNTHRFFTLVLEIAPGAKDLFSFLK--GSSEVPQNNPDLQAHAGKV 67 4 G--LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDLKKHGATV 68 3 * * * * * * AD-ALTNAVAHVDD----MPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDK 127 1 LG-AFSDGLAHLDN----LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQK 132 2 FKLTYEAAIQLEVNGAVASDATLKSLGSVHVSKGVVDAHFPVV-KEAILKTIK----EVVGDKWSEELNT 132 4 LT-ALGGILKKKGH----HEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPGDFGADAQGAMNK 133 3 * * * FLASVSTVLTSKYR------- 141 1 VVAGVANALAHKYH------- 146 2 AWTIAYDELAIIIKKEMKDAA 153 4 ALELFRKDMASNYK-ELGFQG 153 3 number of completely conserved sites: 9 Above line was the last line of test.ali Comments: Not all paiwise distance are calculated. The ones not calculated appear as -1 in the distance matrix. The numbers after each line in the alignment is the number of the last position in the sequence and the sequence number. When more than 10 sequences are analyzed a histogram of mutability will appear over the sequences. Next line is the first line of test.tree, the second output file: 1 EMOGLOBIN ALPHA CHAIN - HUMAN AND CHIMPANZEES 2 EMOGLOBIN BETA CHAIN - HUMAN, CHIMPANZEES, AND GORILLA 3 YOGLOBIN - HUMAN 4 EGHEMOGLOBIN I - YELLOW LUPIN ((1:94,2:124)6:167,(4:262,3:117)5:29)7:0; CAUTION: This tree is rooted assuming a clock, which might not be justified. The exact branch lengths can be found in the nested parenthesis representation just above. |--------------------------------------| | |---------5---------| | | | | | | | | | |---------6---------| | | | | | 3 | | | | | | | | | 1 | | 2 4 Above line was the last line of test.tree test.tree contains following information: four globins and the numbers that represent them in the program and the resulting phylogeny. The phylogeny says that 1 and 2 are sisters, the edge leading to 1 is 94 long, the edge leading to 2 is 124 long, the name of the common ancestor is called 6 and has an edge of length 167 leading to it. Analogous for 3 and 4. 7 is an arbitrary root for all of them and naturally has a 0-length edge above it. Comment: The unrooted tree resulting from this method has been rooted by assigning a root to the midpoint of the longest path found in the tree. Several factors can cause this rooting to be wrong: Unequal rates of evolution. Parsimony will underestimate the length of long branches more severly than the length of shorter branches, thus scewing the tree. In the above example, the true root should probably have been on the branch leading to leghemoglobin. Groups of sequences with many members will seem to have experienced more evolution than groups with fewer members. This factor is probably be of less importance than the first two mentioned. You can probably make a more beatiful drawing yourself. --------------------------- OUTPUT TO SCREEN. ------------------------------- analysis of 4 protein sequences gap penalty: 8 + k* 3 The program will reconstruct the phylogeny The alignment will be written on t.a The phylogeny will be written on t.t finished reading parameters finished reading sequences first time finished reading sequences second time calculated 4 pairwise distances of 6 possible finished constructing distance tree using 2 comparisons finished improving distance tree using 0 comparisons finished making and improving parsimony tree using 3 comparisons finished analysis If the first two lines don't appear immediately on the screen something is wrong, either the files were misnamed or the sequences didn't have the right format. The numbers appearing in the above example are naturally specific for that example. If the program cannot analyze the data, it should give a message about why not: "serious alloproblems" means that the computer was not able to allocate sufficient dynamic memory. is making bigger array NMAXL 21000 This means that the dynamical programming array has been increased to 21000 is expanding sequence graphs totall 40000 This means that the memory for sequences (and graphs) can now hold 40 000 nucleotides or amino acids. This increases as the computations progresses and ancestral sequences are determined, and must be stored. -------LIMITS IN THE PROGRAM------------------------------------------- There are a few internal limits in the program: If more then 40 sequences are analyzed, the program will not draw the tree. This is determined by the constant PHYSIZE in typedef.h If more than 22 sequences are analyzed, the program will not write out a distance matrix. This is determined by the constant MATSIZE in typedef.h -------NEW FEATURES RELATIVE TO 90,MAY version------------------------- It now reads signs that indicates uncertainty about which nucleotide or amino acid is actually at a given position. This seems to be especially relevant for sequences obtained by a consensus of three sequences determined by PCR. It used less memory and some bugs have been tracked down and corrected. --------------------FUTURE DEVELOPMENTS--------------------------------- :THIS PROGRAM I am in the middle of adding following two features: The requirement that sequences must be complete sequences is too restrictive, especially as laboratories increasingly sequences large pieces of genomic DNA. This restriction will thus be removed. An option will be included that allows genomic sequences, DNA with coding regions embedded, to be compared. I am convinced that this will be of increasing importance. This has been completed. It is just a question of getting the program sufficiently userfriendly and debugged. One interesting complication arising when allowing for DNA with coding readings is the occurance of overlapping reading frames. How do you for instance compare two pairs of proteins, that is coded for by the same DNA, or one protein with a pair of proteins coded by the same DNA. This can be solved by a generalization of the traditional dynamical programming algorithm. The problem arising is that when two proteins are coded for by the same DNA it is not possible to make an insertion (or deletion) without also making it in the other. Besides this I will later do: More information about the tree will be written out: For every edge following is written: DIST, means that this edge in the tree was determined solely by distance considerations. If the edge was not determined by distance considerations, the weights of the three posssible configurations will be written out, like (768,781,784). This means that the configuration in the tree chosen was 768, while the weights of the two trees that could obtained by rearranging the four subtrees radiating from edge concerned, were less parsimonious (781 and 784). The larger this difference is the more sure the parsimony criteria chooses the present configuration around this edge. It writes a tree with multiple events correction on branchlengths, which should lead to a better rooting of the tree. It draws a tree, that is the best fit of a molecular clock tree to the above tree, which potentially could be usefull in determining, which brances have experienced an acceleration of molecular evolution. I intend to continue to improve this program as I become aware of better algorithms or come up with ideas or options that should be incorporated in the program. Following is planned: Incorporate an option for using weighting in the alignment and phylogeny reconstruction. The central idea is to use the observed mutability of a position to assign a weight to it. The effect should be that highly conser- ved position will be aligned with similar residues, which is frequently not the case with present algorithms, that thus have serious difficulties in aligning very distant molecules correctly. Memory requirements of the program can be reduced and will be. There is no why this program should not be able to run on PCs. I will soon test it on PCs. The treedrawing program can be improved. An additional tree will be drawn, that has multiple events corrected branch- lengths which should give a better rooting in certain cases. Then you can choose the tree you prefer. The program condense.c, that could simplify large phylogenies and was mentioned in the articles on this program is not functional at present. It will soon be. :OTHER PROGRAMS Write a program that finds recombination events. The central idea is described in a forthcoming article in Mathematical Biosciences. I hope that this will be very useful as present methods, that only finds one phylogeny for a set of mole- cules, can fail seriously when used to analyze sequences that has experienced recombinations or gene conversions in their evolutionary history. This has almost been completed. I sincerely hope to be able to distribute early in 1991. If you want it you should contact me then. Write another program that analyzes a single string (DNA/Protein) for a dupli- cations and sorts out in what order they have occurred.