FASTMAP DOCUMENTATION by David Curtis FASTMAP implements an algorithm to provide an approximate multipoint lod score for a disease against a number of markers from supplied two point lod scores. At time of writing this algorithm has been accepted for publication in Human Heredity (Curtis D & Gurling HMD. A procedure for combining two-point lod scores into a summary multipoint map. Human Heredity 1993; 43; 173-185). You should refer to this for an account of how FASTMAP works and an evaluation of its performance with real and simulated linkage data. The algorithm, program and source code are made freely available, though the source code may not be commercially exploited. However please cite this publication when writing up any work for which you have found FASTMAP useful. FASTMAP takes as input two-point lod scores from a number of markers and as output produces a table of estimated multipoint lods scores, a graph file suitable for graphing these with the Shareware program EASIGRAF (supplied with EASISTAT) and a debugging file which contains additional information about the approximations made. The approximation is produced very quickly, at least in relation to the time taken to produce a full multipoint. Overall the approximation is unbiased and is usually quite accurate, although occasionally there can be be a fairly large difference from the true multipoint lod scores as produced for example by LINKMAP. The version currently distributed may be regarded to some extent as a prototype, although I think I have now got it working about as well as I am going to. I would be extremely interested to hear any comments concerning it performance. I also hope that others better qualified than myself may be able to develop the basic algorithm further and I would be glad to assist anyone in explaining how the program is supposed to work. This documentation contains some additional notes about the implementation which were not included in the article submitted for publication. Also included with these files are some more detailed breakdowns of FASTMAP's performance in different simulations, contained in the file FMAPEVAL.DOC. COPYRIGHT I hold the copyright to the source code. I hereby authorise anyone to use, make adjustments to and redistribute this source code provided only that they do not do so for profit and that my original contribution is ackowledged, and that any alterations from the original are clearly marked. Anyone who wishes to distribute the code or programs compiled from it for profit may only do so with prior agreement from me. However the algorithm and ideas embodied in the source code may be freely used by anybody for any purpose. Naturally I would hope that such a person would acknowledge my contribution, and in particular I would urge anyone who finds the procedure helpful to cite the relevant reference. I would also be grateful if anyone who did come up with any useful improvements might keep me informed of them, although I would be very happy to see others take over development of this idea. PROGRAM INPUT Input is either from the keyboard (standard input which may be redirected) or from an input file specified on the command line, e.g.: fastmap (then input is typed in interactively) or: fastmap < input.dat or: fastmap input.dat When input is from standard input the program prompts the user for the required values, but the format of the input is identical regardless of whether it is from the keyboard or a file. Line 1: One to three filenames. The first is for the tabulated output file of lod score[s] at each map position. The second filename if specified is a graph file for input into the EASIGRAF or ACE/gr program. The third filename if specified contains debugging information which reports various aspects of the estimates obtained by the program. Line 2: Values for the minimum and maximum distances (in centimorgans) of the map over which lod scores are to be calculated. If in the next line a number of fixed distances are given, then the only effect of these two values is to define the horizontal scaling of the graph. Line 3: Either: one number, which consists of the number of equidistant points at which the lod score is to be evaluated between the minimum and maximum distance given above. Or: several values giving specific distances (in centimorgans) at which the lod score is to be evaluated. Line 4: The number of pedigrees for which data will be input. If only total lod scores are available then enter 1 here. However FASTMAP should perform better if the individual lod scores are available for each pedigree. You can get these from MLINK by setting byfamily to true and then recompiling. Line 5: The name of the disease locus (up to 20 characters), followed optionally by values for the "reliability" with which genotype predicts phenotype. If no value for reliability is input then the program will choose best-fitting values for each pedigree. If one value is input then this value will be used for every pedigree. Alternatively, a number of values equal to the number of pedigrees may be input, in which case each pedigree can be assigned a different value. There then follows for each marker (which should be entered in the order they appear on the map): One line: The name of the marker (up to 20 characters), followed by one value giving the position of the marker on the map (in centimorgans) followed by either one value giving the probability that the marker will be informative for a given meiosis, or alternatively a number of allele frequencies (which should sum to 1) from which a conventional PIC value is calculated by the program. Second and subsequent lines (one for each pedigree): A number of pairs of values for recombination fraction (in ascending order) and observed two-point lod score. To indicate that a marker was uninformative this line should consist of two zeros (separated by a space). If the marker was not tested in a particular pedigree this should be indicated by leaving the line completely blank. Input finishes when the end of file is reached, or when a blank line is encountered instead of a line describing the next marker. Information pertaining to each marker must be entered in the order in which the markers appear along the map - the markers must be in order of ascending distance. PROGRAM CONSTANTS: The following constants are defined in fastmap.h: MAXPEDS - the maximum number of pedigrees to be used MAXMARKERS - the maximum number of markers to be used MAXPAIRS - the maximum number of pairs of values for recombination fraction/lod score to be entered on each line MAXDISTS - the maximum number of specified distances at which the lod score can be evaluated (this has no effect on the number of equidistant points between the minimum and maximum if that option is chosen instead) MINFRACTION - value specifying fraction of information from a given marker which can be discarded, and fractional overlap between markers which can be ignored If desired these constants can be altered and the program recompiled. NOTES ABOUT INPUT 1. Reliability values The "reliability" value is the probability of observing the "expected" phenotype for a given genotype in one offspring of an informative phase known meiosis - the combined probability of the offspring not being a nonpenetrant carrier nor a phenocopy. It can take values between 0.5 and 1. In the context of the complex pedigree from which the two-point lod scores are obtained, it provides some measure of the extent to which the disease genotype is known for each individual, given all the phenotypic information in the pedigree. In a large complex pedigree, this reliability value may be relatively high despite penetrance values being low or phenocopy rates high. This is because there can often be a fairly high degree of certainty of an individual's genotype, for example because of the pattern of illness in his children. The effect of different reliability values is to alter the sharpness of curvature of graph of expected lod score against recombination fraction. High values produce more sharply peaked curves which (if there are any apparent recombinants) go down to minus infinity at zero recombination, lower values produce flattened out curves. If a reliability value is not specified for a pedigree, FASTMAP will find the value which gives the best fit to the input lod score values for all the markers. (Note that reliability values can only be fitted if at least one marker contains more than two pairs of recombination fraction/lod score values, otherwise a reliability value of 1 will be chosen.) If you are dealing with an incompletely penetrant disease or one with phenocopies you should begin by letting FASTMAP generate fitted values for the reliability. Such a fitted value is constrained to lie between 0.51 and 0.99. If you are dealing with a fully- penetrant trait then you may wish to specify a reliability of 1. Fitting the reliability values takes a considerable amount of time compared to the rest of the procedure. FASTMAP outputs the values that it has chosen, and if you find that with different markers the same pedigree always produces about the same reliability value then you can save time by specifying this value in the input file. If every pedigree has the same value then you can just specify one value instead of one for each pedigree. I find that with moderately complex pedigrees a value of 0.99 is appropriate even when dealing with a disease with fairly low penetrance. 2. PIC values, etc Normally, for each marker FASTMAP calculates a conventional PIC value from input allele frequencies. This is supposed to provide a value for the proportion of meioses informative for the disease locus which can be expected to be also informative for the marker. However the user does have the option of entering this probability directly, and there are probably two circumstances when you may wish to do this. The first case in which this is desirable is when the two-point lod score has been derived from more than one allelic system. If there are two polymorphic systems at the same locus, or very close to each other, then it may be preferable to calculate two-point lod scores with them jointly (e.g. with MLINK) rather than to enter the results separately into FASTMAP. In this case a joint PIC value should be calculated, for the probability that at least one system will be informative at a given locus. This is PIC=1-(1-PIC1)*(1-PIC2). (The PIC values can be obtained by inspection of the debugging file after the individual markers have been entered with their allele frequencies.) The second case when one might want to consider not using the conventional PIC is to my mind much more dubious, and is when dealing with a recessive disease. It is true that for certain types of mating the PIC value does not give the true probability for a meiosis to be informative. For example if two parents who are carriers of a recessive disease have the same genotype and are heterozygotes, and if the disease is known to be in phase with the same marker allele in each parent, then if the child is affected but is heterozygous for the marker we can conclude that there has been one recombinant and one nonrecombinant meiosis. However for a dominant disease we would not be able to conclude anything from the situation of two such heterozygote parents (one affected) producing a heterozygote child. There is thus a case for using a slightly higher value than the conventional PIC when dealing with recessive diseases. However the difference from the conventional PIC is small. It is maximal for a two-allele system with equal allele frequencies, when I calculate that the proportion of matings between to carriers producing affected offspring which are informative is 0.469, compared with a conventional PIC of 0.375. However when dealing with a complex pedigree information will additionally be obtained from other types of matings for which the ordinary PIC is probably more appropriate. I would conclude that the size of the effect is likely to be negligible in practice. This view is to some extent supported by the simulations carried out with a recessive disease, which used conventional PIC values but demonstrated performance which was overall at least as good as for a dominant disease. Nevertheless, the option to enter values other than the PIC is available to the user if desired. 3. Recombination fractions and lod scores FASTMAP fits a number of recombinant and nonrecombinant meioses to the observed two-point lod scores, and may fit a reliability value as well. There are three distinct ways in which this fitting is accomplished, depending on the number of pairs of values which are entered for recombination fraction and lod score. If only one pair of values is entered then this is taken to be for the recombination fraction at which the maximum lod score is obtained. An exact number of recombinant and nonrecombinat meioses which would produce this maximum lod can readily be calculated, contingent on a reliability value. It is only possible to use this form of input if there is an available lod score at some recombination fraction which is positive. In addition it is not possible to fit a reliability value which depends on the curvature of the lod score graph. If two pairs of values are entered then again it is possible to find an exact solution which would produce a lod score curve going through these two points. Again the solution is contingent on the reliability value specified, which cannot be fitted. This option can be used even when the lod scores are all negative. However I would advise against only entering two pairs of values. The reason is that the shape of the actual and fitted curves may not be exactly the same, and it is easy to imagine that producing a solution which passes exactly through the two points specified may be wildly inaccurate at other recombination fractions. When more than two pairs of values are entered, numbers of meioses are chosen to produce a line which most closely approximates to the points specified. This closeness is in the sense that the sum of squares distance between points on the line and observed lod score values is minimised. In this situation a reliability value can be fitted as well as the number of recombinant and nonrecombinant meioses. Because of the way the closeness of fit is measured, it is possible to bias the fitting to give more priority to some recombination fractions than others. For example if many pairs of values at small recombination fractions were entered then more atttention would be paid to getting the line to fit well at small recombination fractions than large ones. Actually, since lod scores at large recombination fractions are relatively small anyway, it is the lod scores at smaller recombination fractions which generally have more effect on the values eventually arrived at. Lod scores at very small recombination fractions can be very large indeed, so you are (strongly) advised to omit these (e.g. at recombination fractions less than 0.01). To summarise, my advice for the information to input would be a series of lod score values at different recombination fractions ranging from 0.01 to 0.4. FASTMAP was evaluated using lod scores at 0.01, 0.05, 0.1, 0.2, 0.3 and 0.4 and this gave satisfactory results. If three or more pairs of values are given for at least one of the markers then this allows a reliability value to be fitted to the shape of the curve. Avoid entering strongly negative values at very low recombination fractions to avoid distorting the fitted curve too wildly (the price of this is that the estimate may be inaccurate very close to the marker positions, but this is unavoidable). EXAMPLE INPUT FILE: UPM6DF.OUT UPM6DF.GRP UPM6DF.DBG -20 60 100 3 UP MS5H 0 .2 .2 .2 .2 .2 0.010 -0.9811 0.050 -0.5865 0.100 -0.3641 0.200 -0.1526 0.300 -0.0566 0.400 -0.0127 0.010 -2.8312 0.050 -1.9729 0.100 -1.4685 0.200 -0.8191 0.300 -0.4055 0.400 -0.1670 0.010 -2.4945 0.050 -1.7574 0.100 -1.1036 0.200 -0.3999 0.300 -0.1076 0.400 -0.0125 L6-3 21.2 .43 .57 0.010 -0.0902 0.050 -0.0747 0.100 -0.0579 0.200 -0.0316 0.300 -0.0138 0.400 -0.0034 0.010 -1.5290 0.050 -0.9314 0.100 -0.6415 0.200 -0.3675 0.300 -0.2186 0.400 -0.1029 0.010 0.0000 0.050 0.0000 0.100 0.0000 0.200 0.0000 0.300 0.0000 0.400 0.0000 HD2G 42.4 .24 .76 0.010 -0.0007 0.050 -0.0006 0.100 -0.0005 0.200 -0.0003 0.300 -0.0001 0.400 -0.0000 0.010 -2.3743 0.050 -1.4876 0.100 -0.9106 0.200 -0.3696 0.300 -0.1342 0.400 -0.0296 0.010 -0.8302 0.050 -0.7315 0.100 -0.4721 0.200 -0.0962 0.300 0.0329 0.400 0.0282 OUTPUT FILES FASTMAP produces up to three output files with the names specified on the first line of the input file. 1. Table output The first file ouptut is a simple table of distance against lod score - total lod score and a breakdown by pedigree. Because the lod score may be evaluated at large number of positions (100 in the example above) the pedigrees are arranged in columns, rather than rows as might seem more natural. 2. Graph file output The latest version of FASTMAP allows preparation of graph files for one of two graphing programs, EASIGRAF which runs under DOS or ACE/gr which runs on workstations and terminals using the X graphics system. a) EASIGRAF graphs The second file, if specified, is a graph file for input into EASIGRAF, a Shareware graphing program supplied with the EASISTAT package (obtainable from me or the same source as you acquired FASTMAP). This displays a graph of lod score against distance - again both the total lod score and for each pedigree. A neat feature is that it also displays each marker on the same graph. It is run by specifying the name of the graph file on the command line, e.g.: EASIGRAF filename.grp Please consult the EASISTAT documentation for details on how various aspects of the display may be altered. Essentially, you can use the "Axes" menu to control aspects of the labelling and scaling of the X and Y axes, and the "Data" menu to control which columns are displayed from the graph file (the first column corresponds to map distance, the second to total lod and subsequent columns for each pedigree's lod score). If you wish to only display the total lod score this can be done by pressing D for the "Data" menu, then pressing 5 to select select XY columns, then entering 1,2 to graph the second column against the first. Then keep pressing Enter to return to the main menu. There are a couple of points worth mentioning specifically. The marker labels are implemented as "floating titles" for EASIGRAF, which means they always appear in the same position on the screen. This means that if you change the horizontal scale of the graph the marker labels will no longer be in the correct position (you can change the vertical scale with no problems). When the graph file is first read in by EASIGRAF the horizontal scale is determined by the minimum and maximum distances which were entered to FASTMAP on line 2 of the input file. If the data is regraphed (for instance if you use the "Data" menu to graph just the total lod score against distance, columns 1 and 2 of the graph file) then the graph will be rescaled. The new minimum and maximum distances will then be determined by the smallest and largest distances for which a lod score was calculated. If you selected the option to calculate scores at equidistant points between the minimum and maximum, then the scale of the graph will be unchanged. However if lod scores were only calculated for specific points then the smallest and largest of these distances will determine the new scale and the floating titles may appear in the wrong place. If you wish to change the horizontal scaling of the graph, the best way to do it is to run FASTMAP again with different minimum and maximum distances specified, otherwise the floating titles for the markers will appear in the wrong place. Another point about the marker labels is that if the markers are close together then the labels may overwrite each other. To fix this just alter the vertical position of the relevant floating title. Select "Edit titles" from the "Titles" menu, then select "Edit TITLEF's". Go through pressing Enter till you get to the desired label. Leave the text unchanged, but backspace and change the Y value for the position (e.g. from 0.0 to 0.1) and retype the rotation to 90. Then press Enter and Escape appropriately to return to the main menu. The marker label will be moved up a bit, clear of the other labels. b) ACE/gr graphs ACE/gr is a graphing program which runs on workstations and terminals using X (the command used to run this program can be either xvgr or xmgr). ACE/gr was written by Paul Turner and is available in source form from ftp.ccalmr.ogi.edu in CCALMR/pub/acegr. If desired then the graph file specified as the second output file can be produced in a format suitable for display by this program rather than by EASIGRAF (ACE/gr is more powerful than EASIGRAF and will produce higher quality output). In order to specify that the graph file should be produced in ACE/gr format, FASTMAP must be run with the command line switch -x (or under DOS /x). The format is as follows: fastmap [input.dat] [-x[labelpos]] (Unix) fastmap [input.dat] [/x[labelpos]] (MSDOS) The -x switch can be followed immediately by a number (labelpos) which determines the position of the marker names on the graph. By default the names of the markers will be appear on the graph at a height equal to a lod score of -10, but this can be changed by specifying a different value for labelpos. For example, to have the marker names appear above the graph at a height equal to a lod score of 3, one would enter: fastmap input.dat -x3 If you have the example files EPLDALL.INP and UPM6DF.INP then appropriate commands are: fastmap epldall.inp -x1 and: fastmap upm6df.inp -x-17 The graph is very similar to the one produced for EASIGRAF. However with ACE/gr it is possible to rescale the graph both vertically and horizontally because the marker names are placed using the same coordinate system as the data values (instead of occupying fixed points on the screen as with EASIGRAF). In order to display the graph run the program (called xvgr or xmgr) and select the "Read sets" command from the "File" menu. Read in the file. Click on the autoscale button (AS) and the graph of lod scores should be displayed with the markers in the appropriate positions. As with EASIGRAF, it is possible to make adjustments to the final graph either by using the facilities of the program or by editing the graph file before reading it in. 3. Debug file The output from this is fairly complex, and should be studied in conjunction with the source code and description of the algorithm. A detailed description of its contents is given later in the documentation. LOD SCORES ASSUMING HETEROGENEITY As well as simply totalling lod scores across pedigrees, it is possible to automatically calculate lod scores under the assumption of heterogeneity - for example that a locus may influence susceptibility to a disease in only a certain proportion of families. This proportion is conventionally termed alpha, and desired values of alpha can be specified using the command line switch -a (or under DOS /a). The format is as follows: fastmap [input.dat] [-aalpha1 [-aalpha2 ...]] (Unix) fastmap [input.dat] [/aalpha1 [/aalpha2 ...]] (MSDOS) Any number of alpha values (up to 10) can be provided. For example to obtain lod scores under the assumptions that 60% or 80% of families might be linked one would enter: fastmap input.dat -a0.6 -a0.8 The adjusted lod scores are appended to the others in both the table and graph files. To obtain clear graphs you will probably want to switch off display of the individual lod scores, either by editing the graph files or by using the relevant functions of the graphing programs themselves. Note that there is some debate concerning the statistical properties of the lod score under the assumption of heterogeneity as a test for linkage. In addition, the properties of the FASTMAP approximation have not been explored with regard to this situation. The mean lod score for each family obtained by FASTMAP is fairly unbiased with respect to the true multipoint lod score, and this means that the total lod score will also be unbiased. However, it is possible that if the variance of individual FASTMAP lod scores were markedly increased or reduced compared to the true lod scores then the adjusted lod score obtained under the assumption of heterogeneity might be different to what it would be if a full multipoint analysis were performed. USING FASTMAP IN PRACTICE Supplied with these files is a utility program called TABLE which produces the pairs of recombination fractions and lod scores needed to input to FASTMAP. It is run on the output of MLINK, although it does assume that the output from each two- point analysis will be in a separate results file. To get these pairs TABLE is run with the /I switch, e.g.: TABLE filename.res /I This would make a new file called filename.inp containing the pairs of values at recombination fractions between 0.01 and 0.4. Of course you would still have to input the additional information about the number of pedigrees, etc. Still things can be made even easier. The setup I have is to have different files containing one line of information about each locus (its name, position and allele frequencies) in one subdirectory. So there might be a file called F13A.INP with the following contents: F13A -50 .2 .2 .2 .2 .2 (You do have to be careful that the file has one and only one line feed at the end of it, otherwise you would get extraneous blank lines in your input file to FASTMAP.) Then one can have a couple of simple batch files along the lines of: SETUPINP.BAT echo %1.out %1.grp >%1.inp echo %2 %3 >>%1.inp echo %4 >>%1.inp echo %5 >>%1.inp echo %6 %7 >>%1.inp and: ADDINP.BAT type d:\ls4\%2.inp >>%1.inp table %3.res /i type %3.inp >> %1.inp These assume that the one line files for each locus are in the directory D:\LS4. Then a batch file which will take all the relevant two-point results files, prepare them to make an input file for FASTMAP and run FASTMAP could look like this: DOFAST6.BAT CALL SETUPINP EPHDALL -80 60 100 25 EPHD CALL ADDINP EPHDALL F13A EPHDF13A CALL ADDINP EPHDALL 6S89 EPHD6S89 CALL ADDINP EPHDALL 6109 EPHDF109 CALL ADDINP EPHDALL 6105 EPHDF105 CALL ADDINP EPHDALL 6S10 EPHD6S10 CALL ADDINP EPHDALL C4 EPHDC4 CALL ADDINP EPHDALL DQA EPHDDQA CALL ADDINP EPHDALL TCTE EPHDTCTE FASTMAP EPHDALL.INP The call to SETUPINP.BAT produces the first few lines EPHDALL.INP, with no "reliability" value specified. The following lines, call ADDINP.BAT for each marker uses it to take the one line locus description in D:\LS4\F13A.INP etc. and add it to EPHDALL.INP, then run table on EPHDF13A.RES etc. and add e.g. EPHDF13A.INP onto EPHDALL.INP. Finally FASTMAP is run with EPHDALL.INP as input. Of course you don't have to go to these lengths, but as you grow more familiar with FASTMAP you might like to bear these examples in mind. Gary Williams at HGMP Harrow has produced the following equivalent shell scripts to prepare input files under Unix. To produce the filename.inp files the TABLE program should be run with a -i switch under Unix: table filename.res -i Then the following scripts are equivalent to the batch files described above. File setupinp: #!/bin/csh -f echo $1.out $1.grp > $1.inp echo $2 $3 >> $1.inp echo $4 >> $1.inp echo $5 >> $1.inp echo $6 $7 >> $1.inp File addinp: #!/bin/csh -f cat $2.inp >> $1.inp table $3.res -i cat $3.inp >> $1.inp File dofast6: #!/bin/csh -f setupinp ephdall -80 60 100 25 ephd addinp ephdall f13a ephdf13a addinp ephdall 6s89 ephd6s89 addinp ephdall 6109 ephdf109 addinp ephdall 6105 ephdf105 addinp ephdall 6s10 ephd6s10 addinp ephdall c4 ephdc4 addinp ephdall dqa ephddqa addinp ephdall tcte ephdtcte fastmap ephdall.inp All these script files should be made executable by the command: chmod +x filename PROBLEMS WITH FASTMAP If FASTMAP seems to be producing poor approximations to be performing poorly, there are a number of things you may want to look at. Certainly it may be helpful to examine the debugging file to see if any information gives a clue as to what may be happening. You can check how good FASTMAP is at fitting to the supplied two-point data by only inputting the data for one marker at a time and checking to see how closely the output corresponds to the input. If you have supplied a "reliability" value then it would be worth removing this and letting FASTMAP fit to the supplied lod score values with the reliability uconstrained. Make sure that whenever possible you enter information by pedigree, rather than as total lod scores summed over all pedigrees. However there are some occasions when FASTMAP will not produce a very good approximation, for example if there just happens to be an unexpectedly large number of recombinations between markers, or if two markers just happen to be informative for all the same matings, and so on. I would be interested to see examples of such bad performance, to see if there are any further improvements which could be made. DETAILED CONTENTS OF DEBUG FILE It contains the following information: For each marker, the proportion of meioses for which it is expected to be informative. (This may either be input directly by the user, or is the PIC value calculated from the allele frequencies supplied instead.) All the following information is repeated once for each pedigree. The reliability value is output, which may be supplied by the user or fitted by the program. For each marker the estimated equivalent number of recombinant and nonrecombinant meioses that would produce lod scores close to those observed is output. The total estimated number of meioses informative for the disease locus is output (based on the estimated number of informative meioses for each marker and the probability of each marker being informative). For each marker, based on this total, the fraction of meioses for which that marker is deemed to be actually informative. The following information is repeated once for every interval on the map. Information pertaining to each marker to the right of the disease locus goes into one column, and each to the left in a row. The information consists of the number of recombinant meioses which are expected to be informative for a given marker, and for no other marker between it and the disease locus. The top row and leftmost column are for the meioses which are only informative for a marker in the right group or in the left group (but not both). In the top row the number of nonrecombinants with the each right hand marker is printed just above and to the left of the number of recombinants. In the left most column the numbers of nonrecombinants with the each marker is two lines above the number of recombinants. The first set of values, which concerns the first interval, will all be in one row. The first pair of numbers is the estimated number of nonrecombinants and recombinants for the first marker. The second pair relates to the second marker, but excludes those meioses for which the first marker is expected to have already been informative, and so on. Reading down each column and along each row into the table one can see the meioses which are expected to informative for a marker in the lefthand group and in the righthand group simultaneously. These meioses are categorised as to whether they are nonrecombinant or recombinant for each marker. Here is an example debug file containing information about 1 pedigree and 3 markers: DQA.prob_inf=0.600000 C4.prob_inf=0.600000 6S10.prob_inf=0.600000 ped 1, "reliability" = 0.990: DQA: 0.837 nonrec, 0.000 rec C4: 0.837 nonrec, 0.000 rec D6S10: 0.000 nonrec, 1.599 rec Estimated total informative meioses for ped 1: 1.914425 DQA.fraction_used: 0.437224 C4.fraction_used: 0.437224 D6S10.fraction_used: 0.835266 0.837 0.471 0.000 0.000 0.000 0.597 0.471 0.000 0.000 1.006 0.423 0.364 0.000 0.000 0.049 0.000 0.000 0.000 0.000 0.000 0.000 1.006 0.294 0.000 0.543 0.000 0.000 0.000 0.421 0.000 0.048 0.000 0.000 0.000 0.000 1.599 0.294 0.000 0.000 0.000 The first three lines say that each marker had a probability of 0.6 of being informative (this information had been entered directly). The "reliability" value was set to be 0.99. From the observed lod scores, the estimated equivalent numbers of meioses were 0.837 nonrecombinants with no recombinants for the first two markers, and 1.599 recombinants with no nonrecombinant meioses for the third. The estimated total number of potentially informative meioses in the whole pedigree was 1.91, yielding the stated values for the fractions for which each marker actually was informative. (So the third marker, with a higher estimated total number of meioses, turned out to be slightly more informative than expected, while the first two were slightly less.) The first row shows the likely distribution of these meioses. The first marker has 0.837 nonrecombinants. The second marker has 0.471 remaining from its original 0.837 once we have excluded the ones for which the first was informative. By the time we get to the third marker there remain 0.597 of its recombinant meioses for which neither of the first two were informative. (We expect that some of the meioses which were nonrecombinant at the position of the first and/or second markers may have become recombinant by the time we get to the third. Incidentally, although the distances are not shown in the debugging file there is a recombination fraction of 0.01 between the first two markers and 0.04 between the second and third.) Now we move on to the next interval. Here we see that there are 0.364 for which the first two markers are both nonrecombinant. The first marker is now in the leftmost group. There are another 0.049 meioses for which it is nonrecombinant and the third marker is recombinant, and there are 0.423 meioses for which it is nonrecombinant and no other marker is informative. The third marker is also recombinant for 1.006 meioses which are not informative for either of the first two. In the next interval we again see the 1.006 recombinant meioses for which only the third marker is informative. There are 0.543 meioses for which it is recombinant and the second marker is nonrecombinant, and another 0.048 for which the first marker is nonrecombinant. There are 0.294 meioses for which the second marker is nonrecombinant and the third uninformative. There are 0.421 meioses for which the first marker is nonrecombinant and both the others noninformative. In the final interval all markers are now in the lefthand group. We begin with the third marker which has 1.599 recombinant meioses. Excluding these, there remain 0.294 meioses which are nonrecombinant for the second marker. On this occasion we estimate that there are no meioses which are informative for the first marker and neither of the others. NOTES ABOUT IMPLEMENTATION FASTMAP.EXE is a DOS executable which should run on any IBM PC compatible running MSDOS. If a maths coprocessor is present it will speed up calculations, but it is not required. I have been running it on a 486 which gives good performance - estimated multipoints using 25 pedigrees and 8 markers with reliability values to be fitted by the program were produced in 70 seconds. The Sun SPARCServer I have access to produced the same results in 13 seconds. The file FASTMAP.C is supplied and should compile OK on most compilers with little if any modification. I have compiled it with the Zortech DOS compiler and on a Sun. If you compile it on a DOS machine you may want to ensure that a large stack is provided, and you should use a large memory model so there is room for the data tables. FASTMAP.H begins with a few #defines to control compilation. You may want to modify these for your own compiler. The issues are whether the compiler can accept ANSI C/C++ style prototypes, whether it can use enums (this is pretty unimportant), and where to find a prototype for calloc (mine is in stdlib.h). There may be also be compiler specific ways to modify the stack size, and with the Zortech compiler this is accomplished with the _stack=30000 statement. Some libraries contain the function index() instead of strchr(). Both do the same thing, so you may need to use the "#define strchr index" statement. As well as declaring functions and variables, the header file defines a few program constants (listed above) which can be changed if desired. A general point about coding style is that I have tended to keep a fair amount of information in structures, which are passed to functions either by value or reference. This largely reflects my exposure to C++ and an attempt to make the code somewhat object- orientated. This and other factors may mean that the code is not as efficient as it could be, but on the other hand it should make it easier to modify if improvements can be found for the basic algorithm. Another slight inefficiency may be the liberal use of doubles rather than floats. A major reason for this is that I have used the old-fashioned argument-passing style so that the code will be compatible with K & R compilers. However ANSI compilers will then report errors if arguments are declared as floats (what happens is that the all float arguments are passed as doubles and that ANSI compilers will not make the automatic cast back to float when this style of argument-passing is used). Since using doubles does not actually incur a prohibitive overhead, I have tended to use them throughout to avoid having to worry about this problem. I have now commented the code fairly comprehensively, and I hope that in conjunction with the paper it should be possible to work out what is going on. AVAILABILITY FASTMAP is available directly from me on receipt of a formatted floppy disk. However I would prefer people to obtain it from one of the software libraries on Internet listed below. The EASISTAT package is available from the same sources, but requires another 720 K of disk space, so if you wish to obtain it from me then please enclose the appropriate number of extra formatted disks. gene-server: Internet gene-server@bchs.uh.edu BITNET/ EARN gene-server%bchs.uh.edu@CUNYVM UUCP gene-server@bchs.UUCP (new style) Send mail with Subject: SEND DOS HELP Anonymous ftp: ftp.bchs.uh.edu (in /pub/gene-server/dos) The following are mirror sites for the above collection. European: Anonymous FTP: nic.funet.fi E-mail: mailserver@nic.funet.fi Send mail message: HELP European EMBL server: NetServ@EMBL-Heidelberg.DE Send mail message: DIR DOS_SOFTWARE Anonymous ftp: ftp.embl-heidelberg.de (/pub/software/dos) Manager: Rainer Fuchs, Fuchs@EMBL-Heidelberg.DE Problems: NetHelp@EMBL-Heidelberg.DE USA anonymous FTP: ftp.bio.indiana.edu Please feel very free to contact me (email preferred) with comments, questions, etc. I would be very interested in people's views on how well it performs and how useful (or not) it is. Dave Curtis Academic Department of Psychiatry St Mary's Hospital Medical School Praed Street London W2 1NY, England Phone: 071 725 1638 Janet: dcurtis@UK.AC.CRC Elsewhere: dcurtis@CRC.AC.UK EARN/Bitnet: dcurtis%CRC@UKACRL Usenet: ...!mcsun!ukc!mrccrc!D.Curtis