Documentation for SIMWALK version 1.50 dated 1995.09.01 A computer program for haplotype and location score analysis on pedigrees using random walk and simulated annealing algorithms. Written by Eric Sobel in collaboration with Kenneth Lange, Jeffrey R. O'Connell, and Daniel E. Weeks (c) 1995 The latest version of this Fortran 77 (ANSI standard) program can be obtained via anonymous ftp from watson.hgen.pitt.edu in the directory pub/simwalk (the URL is ftp://watson.hgen.pitt.edu/pub/simwalk). For more detailed distribution information, see the file README.150 . We are maintaining a user e-mail list, so please register by sending e-mail to dweeks@watson.hgen.pitt.edu or daniel.weeks@well.ox.ac.uk. ABSTRACT: (HAPLOTYPE ANALYSIS) The program SIMWALK performs a random walk in the space of legal genetic descent states of a pedigree often containing only partial phenotyping of any number of codominate marker loci and, optionally, one trait locus. The program's input is the pedigree and locus data and a marker map. A first legal state is found using an iterative genotype elimination technique. Using simulated annealing during a random walk gives an estimate for the genetic descent state with the largest likelihood, i.e., the best haplotype vector for the pedigree, which is the output. (LOCATION SCORE ANALYSIS) With access to the general pedigree analysis computer package MENDEL version 3.30 or later, SIMWALK can perform a location score analysis. Location scores indicate the relative likelihood of several positions of the trait locus among the marker loci given the pedigree data and the marker map. In the location score analysis, with the estimate for the most likely genetic descent state as the initial position, a random walk is performed using the Metropolis acceptance criterion. By sampling from this random walk, a number of completely typed representative pedigrees is obtained, proportional to their true likelihood. These pedigrees are then used to estimate the location score curve for the original pedigree. COMPILING INSTRUCTIONS: Executable versions of this program are available for many common platforms at the distribution site. However, since the source code is also available at the distribution site, one can create one's own executable given a Fortran 77 compiler for your computer system. If one does NOT have access to the MENDEL package, then to create the SIMWALK executable capable of haplotype analysis simply compile together the two files SIMWALK.F and NOMENDEL.F . (Under Unix, after obtaining the file Makefile from the distribution site, simply type the command 'make' to create a haplotyping SIMWALK.) If one does have access to the MENDEL package, then to create the SIMWALK executable capable of haplotype and location score analysis simply compile together the two files SIMWALK.F and MENDEL.F . (When using the Language Systems Fortran compiler to create a Macintosh executable, optionally, for ease of use of the executable, locate the string 'MAC!' in the file SIMWALK.F and uncomment the indicated lines.) DATA CONSTRAINTS: Due to the nature of Fortran 77 (e.g., the lack of dynamic memory allocation) constraints on the data must be included in the program. These upper bounds can be increased by altering the code (at the 'PARAMETER' statements) and then recompiling. The program will inform the user if the data exceeds any of the upper bounds of the program. There is no limit on the number of pedigrees which can be analyzed. The default major constraints [and their PARAMETER names] are: LOCUS DATA Maximum number of marker loci (not including trait) = 23 [MXMKLC] Maximum number of alleles per locus = 24 [MXMKAL] Maximum number of phenotypes per locus = 8 [MXMKPH] Maximum number of genotypes per phenotype = 8 [MXMKGN] PEDIGREE DATA Maximum number of founders per pedigree = 32 [MXFNDR] Maximum number of generations per pedigree = 16 [MXDPTH] Maximum number of people per pedigree = 128 [MXPEO ] Maximum number of children per person = 16 [MXKID ] Maximum number of spouses per person = 7 [MXSPOU] INPUT FILES: There are three input files: the locus file; the pedigree file; and the BATCH.DAT file which contains the user's choices for the program's parameters. Available at the distribution site are example files called respectively: LOCUS.DAT, PEDIGREE.DAT and BATCH.DAT . In general these files follow the same format as required by MENDEL. Please also see the accompanying file FORMATS.TXT which contains a brief description of MENDEL's file format specifications. The locus file is in the same format required by MENDEL except: all loci must be autosomal; all marker loci must be codominant; the trait locus, if present, must be the initial locus; the trait allele names must be at most 3 characters long; the trait genotypes must be unordered. The following two conditions, which are not internally verified, are also required for the TRAIT locus only: If some allele appears within one phenotype in combination with more than one allele, then all genotypes containing that allele must be compatible with that phenotype, e.g., 1/1 & 1/2 => 1/*, where * is a wildcard representing any trait allele. If two none overlapping genotypes appear within one phenotype, then all genotypes must be compatible with that phenotype, e.g., 1/1 & 2/3 => */*. The pedigree file is in the same format required by MENDEL. The BATCH.DAT file is similar in format to that required by MENDEL, i.e., the data are contained in a series of menu-driven choices: each instruction is formatted as a BLANK LINE followed by a line containing the MENU ITEM NUMBER, in I6 format, followed by the DATA values. Each data value is on a separate line, except in item #10. Unless otherwise noted each menu item has only one data value. The order in which the menu items appear in the BATCH.DAT file is arbitrary. Menu item #10 is REQUIRED, all others are optional. If menu item #4 is set to no, i.e., one wishes to find haplotypes as opposed to finding location scores, then the following menu items will have no effect: #1, #11, #12, #14(part 2), #15(part 2), #16(part 2), #17(part 2), and #19. The menu items are: #1) Problem title. format A40 [Default value: Linkage Analysis By Random Walk] #2) Locus input file name. format A12 [Default value: LOCUS.DAT] #3) Pedigree input file name. format A12 [Default value: PEDIGREE.DAT] #4) Should a location score analysis be performed? format A1 (Y or N) [Default value: N (i.e., only haplotying)] #5) An integer label for this run of the program. This label will be appended onto the names of the output files to make them unique. For example, if the label is nn, then for pedigree number mmm the haplotype analysis will be in file HAPLO-nn.mmm format I6 [Default value: 1] #6) Female symbol and male symbol (NOT case sensitive). format A1 [Default values: F and M] (Number of lines of data = 2.) #7) Number of quantitative variables. format I6 [Default value: 0] #8) Is there a trait locus listed in the locus and pedigree files? If so, it must be the initial locus. format A1 (Y or N) [Default value: Y] #9) Reordering of the MARKER loci from the order in the input locus and pedigree files to the genomic order. It is assummed the trait, if present, is already in the initial position. The trait is considered to be in position 0. The markers are said to be in positions 1,...,#-of-marker-loci. Labelling the lines of this menu item 0,1,...,#-of-MARKER-loci: line 0 has the number of markers, not including the trait; line j has the GENOMIC marker position for the marker appearing in the input files at position j. All output files and messages use the genomic ordering for the marker loci. format I6 (all lines) [Default values: same order as input] (Number of lines of data = number-of-MARKER-loci + 1.) #10) Recombination frequencies between the markers in their GENOMIC order, i.e., after they have been reordered, if necessary (see above). These parameters are REQUIRED. The markers are said to be in positions 1,...,#-of-marker-loci. Labelling the lines of this menu item 0,1,...,(#-of-MARKER-loci)-1: line 0 has the number of markers, not including the trait; line j has the recombination frequency between the GENOMIC MARKERS j and j+1, for females and then males. format I6 (line 0) & 2F8.5 (other lines) [No default values] (Number of lines of data = number-of-MARKER-loci) #11) Is this a continuation of a previous analysis whose results are in the partial-results file and into which one wishes to include additional pedigrees? (The pedigree file should now contain only the additional pedigrees and the locus and batch files should be identical to the earlier run except for this menu item and perhaps menu item #5. By changing the run-label in menu item #5 no output files will be overwritten. All references to the number of a pedigree in any output file or error message will reflect all previous pedigrees which were part of this continuation.) format A1 (Y or N) [Default value: N] #12) Number of sampled pedigrees to find for each original pedigree format I6 [Default value: 1000] #13) Number of parallel runs, i.e., the number of complete runs starting from the initial pedigree. At completion the single best result found over the set of parallel runs is reported. format I6 [Default value: 1] #14) Multiplicative factor for the number of steps: (1) between temperature changes during simulated annealing and (2) between realizations during the location score random walk (The number of steps = max{1000, MF*TA*P} where MF=this multiplicative factor & TA=total number of alleles over all markers & P=number of people in the current pedigree.) format I6 [Default values: 10 and 10] (Number of lines of data = 2.) #15) Mean number of transitions per step: (1) during simulated annealing and (2) during the location score random walk. format F6.2 [Default values: 2 and 2] (Number of lines of data = 2.) #16) Fraction of time the next transition within the same step will pivot on a neighboring person and locus of the previous pivot (1) during simulated annealing and (2) during the location score random walk. format F6.2 [Default value: 0.5 and 0.5] (Number of lines of data = 2.) #17) Multiplicative factor of the relative weight given untyped people versus typed people when choosing the pivot person: (1) during simulated annealing and (2) during the location score random walk. format I6 [Default value: 10 and 10] (Number of lines of data = 2.) #18) Output the individual pedigrees into the files INPED-nn.mmm ? The pedigrees will reflect any reordering of the loci, any renaming of the alleles and any obligate phenotype additions. Here nn is the integer label for this run of the program and mmm is the number of the pedigree in this run. format A1 (Y or N) [Default value: Y] #19) Output the location scores computed from each original pedigree individually in files SCORE-nn.mmm ? Here nn is the integer label for this run of the program and mmm is the number of the pedigree in this run. format A1 (Y or N) [Default value: N] #20) Output simulated annealing results in files HAPLO-nn.mmm ? Here nn is the integer label for this run of the program and mmm is the number of the pedigree in this run. format A1 (Y or N) [Default value: Y] #21) Create the PDRAW-nn.DAT file containing the estimate of the best haplotype vector for each pedigree in a format compatible with PEDPREP and thus with Ped/Draw? format A1 (Y or N) [Default value: N] #22) Include the trait locus during the simulated annealing, i.e., include the trait in the haplotype analysis? The trait is placed midway in each requested marker interval; see the following menu item to specify the range of intervals. format A1 (Y or N) [Default value: N] #23) First and last marker intervals in which to place the trait during annealing, where 'j'=interval between markers j & j+1. Under the default values, there will be (#-of-marker-loci)+1 runs, each placing the trait locus in a different interval. Upon completion, the haplotyping results with the trait locus in the best supported interval is reported. Clearly this menu item is only relevant if the trait is to be included in the haplotype analysis (see previous menu item). format I6 [Default values: 0 and number-of-marker-loci] (Number of lines of data = 2.) #24) Number of haplotypes such that if a pedigree has more than this number of haplotypes with at least two recombinants each, then it will be placed in the RERUN-nn.PED file. format I6 [Default value: 2] #25) Number of temperature changes in simulated annealing. format I6, [Default value: 1000] #26) Factor by which the temperature changes in simulated annealing format F6.2, [Default value: 0.99] #27) Initial temperature. format F6.2 [Default value: 500.0] #28) Number of pre-simulated annealing steps. format I6 [Default value: 0] #29) Number of random steps between free runs. format I6 [Default value: 0 (i.e., no free runs allowed)] #30) Random seeds: three integers from the interval [1, 30000]. format I6 [Default values: 27713, 2321 and 18777] (Number of lines of data = 3.) #40) Accept the problem, i.e., end of data file. OUTPUT FILES: All output files reflect any requested reordering of the marker loci. Also for those markers with allele names longer than 3 characters (2 characters if the PDRAW-nn.DAT file is requested to be created), all their alleles are renamed sequentially starting with 1. After the program runs, with the user-specified label nn, several of the files from the following list may be available. (GENERAL OUTPUT FILES:) The ERROR.OUT file contains any error messages which were generated. The run completed SUCCESSFULLY only if this file does NOT exist after the run finishes. The INPED-nn.mmm files contain, in MENDEL pedigree file format, the original pedigrees, one per file. The pedigrees will reflect any reordering of the loci, any renaming of the alleles and any obligate phenotype additions to the pedigree. Creation of these files is controlled through menu item #18. Here nn is the integer label for this run of the program and mmm is the number of the pedigree. (HAPLOTYPING OUTPUT FILES:) The HAPLO-nn.mmm files contain the result of the simulated annealing haplotype analysis on each of the original pedigrees. The PDRAW-nn.DAT file contains, in a form suitable for PEDPREP, all the best-haplotype pedigrees from the simulated annealing runs. Creation of this file can be controlled through menu item #21. The QUICK-nn.ALL file contains a quick view of the haplotype analysis for each pedigree. The allele source information for each non- founder is given, showing the locations of the recombination events. The RERUN-nn.BAT file contains, in this program's BATCH.DAT format, the program's menu items necessary to rerun this program using the RERUN-nn.LOC and RERUN-nn.PED files. The RERUN-nn.LOC file contains, in a form suitable for rerunning this program, the locus file exhibiting the reordered marker loci. The RERUN-nn.PED file contains, in a form suitable for rerunning this program, the original pedigrees whose best-haplotypes had above the user-specified (in menu item #24) number of recombinants. The TABLE-nn.OUT file contains a summary table of the results of the haplotype analysis. (LOCATION SCORE OUTPUT FILES:) The PARTIALR.OUT file contains, in a form suitable for rerunning this program with additional pedigrees, the total location scores found up to the last completed pedigree, i.e., the partial results. The SCORE-nn.ALL file contains, in MENDEL output format, the overall location scores and the SCORE-nn.mmm files contain the location scores computed from each original pedigree individually. Creation of the latter files is controlled through menu item #19. The TRANS-nn.OUT file contains some statistics on the transitions attempted during the location score random walk and their effects. Several additional files are generated during execution then deleted. (LEGENDS FOR HAPLOTYPING OUTPUT FILES:) In the HAPLO-nn.mmm files the best-haplotype pedigree is written in MENDEL format with the following information included in order, for each person at each locus. The inferred maternal allele. A separator which indicates the recombination events in the SUBSEQUENT interval: | = no recombination; / = recombination in maternal haplotype; \ = recombination in paternal haplotype; + = recombination in both haplotypes. The inferred paternal allele. An asterisk if the phase at this locus is NOT fixed by the parents. The source of the maternal allele: 1 = mother's maternal allele; 2 = mother's paternal allele. The source of the paternal allele: 1 = father's maternal allele; 2 = father's paternal allele. The phenotype at this locus in the original pedigree file. Following the pedigree data in the HAPLO-nn.mmm files are some summary statistics on this estimate of the best haplotype vector. In the QUICK-nn.ALL file each pedigree is included. For each non-founder there are two lines of data after the trait phenotype. The first line is the maternal marker haplotype's source information and the second is the paternal marker haplotype's source information. A '1' indicates a grand-maternal origin for the allele at this locus; a '2' indicates a grand-paternal origin for the allele at this locus. Thus a change in either haplotype from 1 to 2 or from 2 to 1 indicates a recombination event in that marker interval. In the PDRAW-nn.DAT file the output is similar to the HAPLO-nn.mmm files except that the asterisk used to designate whether the inferred phase is fixed may take on three values: ! = original phenotype was unknown but inferred phase is fixed; * = original phenotype was known but inferred phase is NOT fixed; & = original phenotype was unknown and inferred phase is NOT fixed. When the PDRAW-nn.DAT file is processed by PEDPREP and then the pedigree displayed by the Macintosh program Ped/Draw, these symbols are visible while the original phenotype is not visible. USAGE NOTES: Please see the file EXAMPLE.TXT for an annotated example haplotyping session using SIMWALK. Since SIMWALK uses simulated annealing to search a space of often immense size, it may not converge to the best answer on the first run. It may be necessary to run SIMWALK several times on your data in order to be assured of finding the optimal haplotype configuration. If you do rerun the program with the same data and parameters, then remember to alter the seeds to the random number generator (see menu item #30); otherwise the results will be identical. Also you may wish to change the run label (see menu item #5) so that the new results do not overwrite the old output files. To use SIMWALK on data in LINKAGE-format, first extract the disease locus and the codominant marker loci using the program lsp from the LINKAGE package; this creates the files datafile.dat and pedfile.dat. Next run LINKMEND to convert from LINKAGE-format to MENDEL-format. This will create the files locus.dat and pedm.dat. (Remember that Unix systems have case sensitive file names.) Now create a BATCH.DAT file following the instructions above. Finally, run SIMWALK. To draw the pedigree data using the Macintosh program Ped/Draw, have SIMWALK produce a file called PDRAW-nn.DAT (see menu item #21). Run this file through PEDPREP to generate a Ped/Draw-format data file. (The programs LINKMEND and PEDPREP may be obtained via anonymous ftp from watson.hgen.pitt.edu .) REFERENCES: If you publish results generated by SIMWALK, then please cite the first two articles from the following reference list. Sobel E, Lange K, O'Connell JR and Weeks DE (1995) Haplotyping algorithms; in "Genetic Mapping and DNA Sequencing" (IMA Volumes in Mathematics and its Applications, Speed TP and Waterman MS, editors) Springer-Verlag, New York (in press). Weeks DE, Sobel E, O'Connell JR and Lange K (1995) Computer programs for multilocus haplotyping of general pedigrees Am J Hum Genet 56:1506-1507. Sobel E and Lange K (1993) Metropolis sampling in pedigree analysis Stat Meth in Med Res 2:263-282. Lange K and Sobel E (1991) A random walk method for computing genetic location scores Am J Hum Genet 49:1320-1334. Lange K and Matthysse S (1989) Simulation of pedigree genotypes by random walks Am J Hum Genet 45:959-970. Lange K, Weeks DE and Boehnke M (1988) Programs for pedigree analysis: MENDEL, FISHER and dGENE Genet Epidemiol 5:471-472. Lange K and Goradia T (1987) An algorithm for automatic genotype elimination Am J Hum Genet 40:250-256. Finally, please send any bug reports, queries, suggestions or comments to one of the addresses below. Thank you, -- Eric Sobel and Dan Weeks -- _________________________________________________________________ Daniel E. Weeks The Wellcome Trust Centre Department of Human Genetics for Human Genetics University of Pittsburgh University of Oxford Crabtree Hall, Room A310 Windmill Road 130 DeSoto Street Oxford OX3 7BN Pittsburgh, PA 15261 U.K. U.S.A. Tel: (+44) 865 740 043 (desk) Tel: (+44) 865 742 441 (main) Tel: 1 412 624-3066 Fax: (+44) 865 742 196 Fax: 1 412 624-3020 e-mail: daniel.weeks@well.ox.ac.uk e-mail: dweeks@watson.hgen.pitt.edu Eric Sobel Department of Biomathematics School of Medicine University of California, LA 10833 LeConte Avenue Los Angeles, CA 90095-1766 U.S.A. Tel: 1 310 825-9623 Fax: 1 310 825-8685 e-mail: esobel@ucla.edu