MFLINK documentation MFLINK is a simple program which automates setting up likelihood calculations for linkage analysis using a variety of different transmission models, and then collating the results in order to produce a (nearly) model free lod score as described in our paper: Curtis D, Sham PC. (1995) Model-free linkage analysis using likelihoods. Am J Hum Genet 57: 703-716. Please see this paper for a full discussion of the methodology and interpretation of results. VERSION HISTORY 1.6 Fixed bug which stopped some pedigrees working with vitesse. 1.5 Fixed bug which might have caused a problem with X-linked analyses (though apparently didn't). 1.4 Fixed bug to allow X-linked analyses. 1.2 Added VITESSE compatibility. 1.1 Documented grid.dat. Removed blank lines from end of example files. NECESSARY FILES MFLINK needs three data files, pedfile.dat, linked.dat and unlinked.dat. It also needs a copy of the MLINK program which has been renamed to be called NOSCORE. Ideally, this will have been recompiled with the "score" constant set to be false. This prevents MLINK from outputting the log likelihoods with the test interval set to have a recombination fraction of 50%. However if recompiling MLINK is a problem, then the standard version of MLINK can be used. The first set of log likelihoods produced by MLINK (the ones with the recombination fraction set to 50%) will then simply be ignored by MFLINK when it reads the MLINK output. The only disadvantage is that calculating these likelihoods makes the procedure take twice as long. You can get a copy of MLINK called NOSCORE just by changing to the directory containing it and copying the program file to have the appropriate name. If you are using MSDOS then you can issue the following command: COPY MLINK.EXE NOSCORE.EXE Or under Unix: cp mlink noscore If you do not have the appropriate permission to access the program file or wish to save disk space then you can set up a Unix script file called noscore or a DOS batch file called NOSCORE.BAT which contains the single command "mlink". Alternatively, under Unix you can set up a symbolic link called noscore which links to the mlink executable. For example, on my system the LINKAGE executables are kept in the /packages/fastlink/bin directory, so I can do this by issuing the following command from a directory on my search path: ln -s /packages/fastlink/bin/mlink noscore However you set up the NOSCORE executable, it is important that it lies on the search path so that it can be run without having to specify the name or the directory in which it resides. MFLINK runs NOSCORE by executing the C statement: system("noscore") This will work under DOS and Unix (and I assume other operating systems) provided the NOSCORE executable is on the path. NB!!! It essential that, when MLINK is compiled to produce NOSCORE, the constant "byfamily" is set to be true. This means that NOSCORE will output the log likelihoods for each individual family rather than the log likelihoods totalled over all families. Failing to set "score" to make MFLINK run twice as slowly, but if "byfamily" is not set to true then the program will fail completely. DATA FILES The three data files consist of pedfile.dat, which is a standard LINKAGE pedigree file suitable for input to MLINK, and two MLINK locus data files called linked.dat and unlinked.dat. The two files both relate to pedfile.dat, but differ in their description of the recombination fractions between loci and sometimes the locus order. MFLINK is used to test a specified position on the genetic map (see the paper for full details). The first file, linked.dat, describes the situation where the affection locus is at this test position, while unlinked.dat describes the situation when the affection locus is unlinked to the marker(s). The description of the transmission model for the affection locus is used only to obtain the population prevalence for the disease. MFLINK calculates the prevalence which would be produced from the allele frequencies and penetrance values provided, and then uses this value to construct its own transmission models. The transmission model parameters provided are not used for any other purpose, and should simply be chosen to produce the desired prevalence value. Please note however that no penetrance values should be set to 0 or 1, because doing this might cause UNKNOWN to detect an impossible segregation pattern. Please note also that there can only be one liability class for the affection locus. Here is an example linked.dat file: 2 0 0 5 << no loci, risk locus, sexlinked(if 1) 0 0.0 0.0 0 << mut locus, mut rate, haplotype freq(if 1) 1 2 << order of loci 1 2 << affection, #alleles [MDPN] 0.9 0.1 << gene freqs 1 << number of liability classes 0.00100 0.00100 0.50000 3 4 << numbered alleles, #alleles [5119] 0.25 0.25 0.25 0.25 << gene freqs 0 0 0.01 1 2 1 This file describes an affection and marker locus, the affection locus phenotypes being listed first in pedfile.dat. After the definition of the second locus comes the information concerning the relative positions of the two loci: here the recombination fraction between them is set to 0.01, indicating that the test position is at a recombination fraction of 0.01 with the marker, which might be appropriate for example if the marker were at a candidate gene. The corresponding unlinked.dat file would appear as follows: 2 0 0 5 << no loci, risk locus, sexlinked(if 1) 0 0.0 0.0 0 << mut locus, mut rate, haplotype freq(if 1) 1 2 << order of loci 1 2 << affection, #alleles [MDPN] 0.9 0.1 << gene freqs 1 << number of liability classes 0.00100 0.00100 0.50000 3 4 << numbered alleles, #alleles [5119] 0.25 0.25 0.25 0.25 << gene freqs 0 0 0.5 1 2 1 It can be seen that this file is identical to linked.dat except that the recombination fraction is set to 0.5, indicating non- linkage. The situation is slightly more complicated if it is desired to test a position between flanking markers, because then the locus order must be changed to indicate non-linkage. Suppose we set linked.dat to test a position midway between two markers: 3 0 0 5 << no loci, risk locus, sexlinked(if 1) 0 0.0 0.0 0 << mut locus, mut rate, haplotype freq(if 1) 2 1 3 << order of loci 1 2 << affection, #alleles [MDPN] 0.995 0.005 << gene freqs 1 << number of liability classes 0.00500 0.50000 0.50000 3 5 << numbered alleles, #alleles [PFCC] 0.2 0.3 0.1 0.1 0.3 << gene freqs 3 5 << numbered alleles, #alleles [DRCC] 0.2 0.1 0.3 0.1 0.3 << gene freqs 0 0 0.05 0.05 1 2 1 Now in order to indicate non-linkage, we need to set the affection locus to be on one or other side of the two linked markers, at a recombination fraction of 0.5, so we change the specified locus order as well as the values for the recombination fractions: 3 0 0 5 << no loci, risk locus, sexlinked(if 1) 0 0.0 0.0 0 << mut locus, mut rate, haplotype freq(if 1) 2 3 1 << order of loci 1 2 << affection, #alleles [MDPN] 0.995 0.005 << gene freqs 1 << number of liability classes 0.00500 0.50000 0.50000 3 5 << numbered alleles, #alleles [PFCC] 0.2 0.3 0.1 0.1 0.3 << gene freqs 3 5 << numbered alleles, #alleles [DRCC] 0.2 0.1 0.3 0.1 0.3 << gene freqs 0 0 0.1 0.5 1 2 1 The recombination fraction between the two markers has been set to 0.1, while the affection locus is at a recombination fraction of 0.5 with the second marker. If you are used to using LCP or DOLINK then the most convenient way to produce these data files is probably to set up a conventional linkage analysis and then run it with the "nodelete" parameter, e.g.: pedin nodelete When this parameter is provided the linkage data files are not deleted by the shell script, and you will be left with the pedfile.dat and datafile.dat files which were used in the conventional analysis. It is then a fairly simple matter to edit datafile.dat to produce linked.dat and unlinked.dat. (The latest version of DOLINK will incorporate a feature to set up MFLINK analyses automatically.) RUNNING MFLINK Once you have set up the three data files correctly, begin by copying either linked.dat or unlinked.dat to be called datafile.dat. Then run the UNKNOWN program, which produces speedfile.dat and ipedfile.dat ready to be used by MLINK (which we have renamed NOSCORE). At this point, you might like to try running NOSCORE directly to see that there are no errors. If it works OK, then run MFLINK, and it will automatically set up and run all the necessary likelihood calculations. What MFLINK does is to carry out likelihood calculations (using NOSCORE) under conditions of linkage and non-linkage for a range of transmission models. These models range between Mendelian recessive and a null effect and then between Mendelian dominant and a null effect. For each transmission model, MFLINK copies unlinked.dat to be called datafile.dat, but first alters the affection locus specification to reflect the desired transmission model, and then calls NOSCORE. It then reads in the log likelihoods for each family output by NOSCORE (these are contained in outfile.dat). Then it copies linked.dat to be called datafile.dat, again respecifying the transmission model parameters of the affection locus. It runs NOSCORE again and inputs the log likelihoods produced. For each transmission model, it thus obtains the overall log likelihood under the hypothesis of non-linkage (obtained by totalling all the family log likelihoods from the first set output), the log likelihood under the hypothesis that all families are linked (obtained by summing the log likelihoods from the second set) and the log likelihood assuming a proportion are linked (obtained by using the standard admixture formula maximised over alpha). MFLINK provides the following output in mflink.out: The log likelihoods under non-linkage, linkage and admixture for each transmission model. (These are often negative infinity for the Mendelian dominant model.) The maximum lod score assuming homogeneity, defined as the maximum difference between the log likelihoods under linkage and non-linkage for any transmission model. This is maximised over one parameter, the heterozygote penetrance. The maximum lod score assuming admixture, defined as the maximum difference between the log likelihoods under admixture and non- linkage for any transmission model. This is maximised over two parameters, the heterozygote penetrance and the proportion of linked families. The "model-free" lod score, defined as the difference between the maximum log likelihood obtained under non-linkage for any transmission model and the maximum log likelihood obtained under admixture for any model. It has one degree of freedom, the proportion of linked families. Please see the paper for some discussion of the interpretation of these results. COMPILATION The source code is C and consists of mflink.c, linkfile.c and linkfile.h. Various constants are defined in linkfile.h and mflink.c, but I doubt any will need changing except possibly MAXPEDS in mflink.c. This is the maximum number of pedigrees in the dataset and is currently set to 200, although the DOS executable was actually compiled with a value of 100. I hope it should be straightforward to compile these files into an executable called mflink: cc mflink.c linkfile.c -o mflink -lm Let me know if there are any snags. I'll try to produce a proper makefile sometime, though it hardly needs one. EXAMPLE FILES Once you have provided a version of MLINK called NOSCORE by one of the means listed above and have if necessary compiled MFLINK itself then you should be able to run it on the three example files supplied called pedfile.dat, linked.dat and unlinked.dat. Just copy one of the locus datafiles to be called datafile.dat, run UNKNOWN and then run MFLINK with these files present in the current working directory and you should see it run through the likelihood calculations. The output file mflink.out will be created. On my system, these example files yield a maximum lod of 0.643, a maximum admixture lod (lod2) of 1.414, and a "model- free" lod score of 1.414. These results may vary somewhat depending on the version of LINKAGE or FASTLINK you have. Note that although in this case the "model-free" lod score is the same as the admixture lod score maximised over transmission models the former is obtained with only one degree of freedom (alpha) whereas the latter incorporates two degrees of freedom (alpha and the heterozygote penetrance, which defines each transmission model). Thus in this example the "model-free" lod score might be taken to provide (a little) more evidence in favour of linkage than the admixture lod maximised over models. X-LINKAGE MFLINK handles X-linked datafiles by following the procedure above to define each female transmission model and then taking the male penetrances simply to be equal to the female homozygote penetrances. This would produce non-constant values for the overall male prevalence, but hopefully this would not be a serious problem in practice. USING MFLINK WITH VITESSE As of version 1.2, MFLINK has been set up to work correctly with VITESSE, the rapid linkage analysis program developed by Jeff O'Connell and Dan Weeks. To invoke MFLINK to use VITESSE rather than MLINK, use the -v switch on the command line: mflink -v This causes mflink to run with two minor differences. Firstly, it will read the file voutfile.dat which VITESSE produces rather than outfile.dat which MLINK produces. Secondly, it will run a program called VNOSCORE rather than NOSCORE, so you must rename VITESSE to be called VNOSCORE using one of the methods described above. With the version I have of VITESSE, it isn't actually possible to recompile it to not calculate the likelihoods at theta=0.5. There is a constant called "lodscore" but setting this to FALSE doesn't have the desired effect and may have some undesirable ones, so VNOSCORE is in fact just the ordinary VITESSE program you would use anyway, but with a different name. However presumably it may be possible to recompile future versions of VITESSE so that the likelihood is only calculated for the recombination fractions specified. When you are using MFLINK with VITESSE there is no need to copy either of the locus data files to be called datafile.dat or to run UNKNOWN first because VITESSE does not use the output from UNKNOWN. Apart from these minor differences, MFLINK should run with VITESSE in the same way as it runs with MLINK. CUSTOMISED MODELS By default, the transmission models tested are as described above. Five dominant, five recessive and a null effect model are tested. However it is possible to specify a different testing procedure if desired. One way to do this is simply to alter the number of dominant and recessive models tested and this is done by specifying the -n switch on the command line, for example to evaluate 10 of each instead of 5 one would enter: mflink -n10 This would result in a finer search. Note that there is no space preceding the number. A more flexible way to specify which models are tested is to create a file called grid.dat which is placed in the same directory as the other data files. If MFLINK finds such a file, then it will automatically read it and evaluate the sets of models specified. Each line in grid.dat specifies a set of models according to the following format: s0,s1,s2 e0,e1,e2 n Here s0,s1,s2 consists of the penetrance values for the starting model and e0,e1,e2 the penetrance values for the end model, and n models are tested "equally spaced" between these models. The n models tested include the starting model but do not include the finishing model. To (hopefully) clarify this a bit, here's how grid.dat would look to specify the default analyses for a disease with population prevalence of 0.1: 0,0,1 0.1,0.1,0.1 5 0.1,0.1,0.1 0.1,0.1,0.1 1 0,1,1 0.1,0.1,0.1 5 The first line specifies that one starts with a Mendelian recessive model and then also evaluates another 4 (making 5 in total) models "towards" but not including the model of null effect. The second line specifies that the null effect model itself be evaluated since the starting and ending points both define this model - the starting point will be evaluated and the ending point won't. The third line specifies that 5 dominant models be evaluated, again not including the null effect model. Now suppose that one were worried that one might miss something by not evaluating codominant models. Then one might wish to create the following grid.dat file: 0,0,1 0.1,0.1,0.1 5 0.1,0.1,0.1 0.1,0.1,0.1 1 0,1,1 0.1,0.1,0.1 5 0,0.5,1 0.1,0.1,0.1 5 The additional final line means that a range of codominant models will be evaluated, again beginning with a model having complete homozygote penetrance and no phenocopies, and moving towards, but not including, the model where the locus has no effect on risk. Other reasons for creating a grid.dat file rather than just using the defaults might include wishing to restrict the analysis to only a limited set of plausible models, for example only dominant models or only models with a certain maximum penetrance. Alternatively one might wish to specify a wide variety of models in order to more fully cover the parameter space. However, as we discuss in our paper, we doubt that restricting the search to the default dominant and recessive models is likely to incur much risk of failing to generate a positive result. Please note that if you do search over a wider range of models then the number of degrees of freedom incorporated in producing the maximised lod scores may be increased, since the search is no longer one-dimensional. However the "model-free" lod score itself would still only incorporate one degree of freedom. ABOUT THE ZIP FILE The zip file containing the MFLINK distribution was created under MSDOS. This has two implications: all the file names are stored in upper case, and the text files (the source and example files) have a carriage return and linefeed character at the end of each line rather than just having a linefeed. If you are unzipping the archive on a Unix system you will probably find it more convenient to convert the filenames to lower case - on my system this is done by running unzip with the -L switch. You may also wish to convert the text files by stripping out the extraneous carriage return characters, though leaving them in may not have any ill effects - on my system this is done by running unzip with the -aa switch. I'll try to keep up-to-date copies of the MFLINK package at John Attwood's ftp site, ftp.gene.ucl.ac.uk, in /pub/packages/dcurtis. Sometimes there may be a slightly more recent version available via my homepage, by following the "software" link. Please feel very free to provide feedback - the software is in active development and I'm keen to see how it can be improved. Dave Curtis - dcurtis@hgmp.mrc.ac.uk http://www.gene.ucl.ac.uk/~dcurtis/ Dept Psychological Medicine, Institute of Psychiatry, De Crespigny Park, London SE5 8AF, UK. +44 171 919 3536