Likelihood - Based Disequilibrium Analysis Programs (2/21/96) ---------------------------------------------------------------------- UPDATE - 21 February 1996: Support interval calculations added for lambda and theta in dismult. Maximization routines improved in DISMULT and DISLAMB Haplotype Relative Risk extensions included to read LINKAGE files. I wish to acknowledge some assistance in detecting a bug in the maximization routine discovered by Michael Knapp, and some programming advice from Alejandro Shaffer. ---------------------------------------------------------------------- The algorithms used in these programs are described in detail in the following paper (which should be cited whenever these programs are used): Terwilliger, JD "A Powerful Likelihood Method for the Analysis of Linkage Disequilibrium Between Trait Loci and One or More Polymorphic Marker Loci" American Journal of Human Genetics 56:777-787 (1995). Terwilliger, JD "Response to Sham et al." Am J Hum Genet (1996) (In Press). In this document I will only discuss technical issues regarding their use. Versions are included for either DEC machines or SUN machines. Where other UNIX versions differ, you may need to modify some of the file handling routines on your own. Compiled code is provided with statically linked libraries in case you have no Pascal compiler on yout system for OSF, SUNOS, and SOLARIS. In SUN Pascal, you should compile as follows: pc -L -O4 -o dislamb dislamb.p pc -L -O4 -o dismult dismult.p and for DEC Pascal versions, under OSF it is recommended to compile with pc -O4 -C all -o dislamb dislamb.p pc -O4 -C all -o dismult dismult.p And under VMS, just compile and link with pas dislamb/nocheck link dislamb/notraceback pas dismult/nocheck link dismult/notraceback Then everything should work. If you still have trouble, contact me by sending EMAIL to joe@well.ox.ac.uk DISLAMB is to be used for single locus association analysis. Just enter the data interactively - if the instructions are confusing let me know and I will try to make it more clear. Consider the following dataset, using an assumption that the disease allele frequency is 0.001 (so that case sample provides little information for estimation of population allele frequencies), and using the nonparametric model (this is in general recommended...). This locus has 13 alleles, as follows: CASE 10 48 12 1 3 17 2 5 1 2 3 4 1 CONTROL 8 8 10 3 3 15 2 7 3 4 2 5 2 Since the result was significant, I allowed it to estimate the recombination fraction between marker and disease under different assumptions about alpha and n - note that these estimates are very much dependent on your assumptions, and that your population is in equilibrium, so do not place to much faith in these estimates of map distance, as mentioned in the paper - support intervals are given (normally I would use the same stringency as the test in constructing support intervals for lambda - typically 3-lod-units. It is important to note that the support intervals for theta given do not take into account the user- input age of mutation in this population. To construct a support interval allowing for this, you must separately compute support intervals for each combination of age and heterogeneity parameter you wish to admit. ************************************************************* * * * Program DISLAMB - Version 2.1 (2/3/96) * * * * AJHG 56:777-787 (1995) * * * ************************************************************* Disease allele frequency = 0.00100000 ========================================================================================= CASE | 10. | 48. | 12. | 1. | 3. | 17. | 2. | 5. | 1. | 2. | 3. | 4. | 1. | CONTROL | 8. | 8. | 10. | 3. | 3. | 15. | 2. | 7. | 3. | 4. | 2. | 5. | 2. | ========================================================================================= Estimated parameters for likelihood ratio test: Allele frequencies: Allele H0: H1 1 0.09944751 0.12594046 2 0.30939227 0.12355539 3 0.12154696 0.15425412 4 0.02209945 0.02805155 5 0.03314917 0.04207248 6 0.17679558 0.22436222 7 0.02209945 0.02804928 8 0.06629834 0.08413984 9 0.02209945 0.02804972 10 0.03314917 0.04207317 11 0.02762431 0.03530303 12 0.04972376 0.06311307 13 0.01657459 0.02103565 Lambda 0.00000000 0.361496 LRT Chi-Square = 19.87098 p-value = 0.000004200154388 Lambda = 0.361496 SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST 2 x n table Chi-square = 26.27953 P-value = 0.009797348469226 NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST 3-lod-unit support interval for lambda-hat : (0.09,0.58) Alpha = 0.70 N = 100. Theta-hat = 0.006587 Support Interval = (0.001853,0.020142) In this output file, you first get the results for the likelihood-based association test, which in this case is significant, with a p-value of 0.0000042 (significant because p < 0.0001). The estimate of lambda is 0.3615, which is evident of a not very strong association, even though the p-value is quite low. This is an important distinction, since the p-value tells you whether or not there is an association, while lambda serves as a measure of how strong the association is! The program also performd a more standard chi-square test of independence on the 2 x 13 table one could use to compare case and control allele frequency distributions. Here, that is not significant, having a p-value of only 0.0098 (it is assumed here that you have not proven the existence of a gene through linkage, which normally requires a p-value of 0.0001 (= Lod Score of 3), so you should have the same stringency in your association test - if not more, since there is more potential to have a multiple testing problem with association studies...). Afterwards, you can see how the value of lambda = 0.3615 can be translated into recombination fraction under certain assumptions about the age of this mutation in this population, and its degree of homogeneity. DISMULT does the multipoint analysis of association. The algorithms are explained in the aforementioned paper, and to use it requires an input file called diseqin.dat, which is described below. Again it is not recommended to use the genotype data, but rather allele data, since it is more robust and straightforward. If you have detailed questions, send me a message. Program DISMULT does an association analysis for multiple loci jointly, assuming a fixed map of the marker loci in either cM or kb (with some assumption about the relationship between cM and kb, since disequilibrium is a function of genetic, not physical distance. The input file should be called DISEQIN.DAT 1) On the first line, enter the Disease Allele frequency (if this is smaller than 0.01, it makes little difference what it is as it is used simply to estimate marker allele frequencies). 2) On the next line enter the number of markers 3) on the next line, enter your assumption about the number of centiMorgans/Megabase (assuming the distances are to be entered in kilobases. If you are entering the data in centiMorgans, then the value of this variable should be 1000 (since the distances are assumed to be entered in kb...) If this number is too small, n will be very large, and if this is too large, the estimates of n will be very small. 4) On one line Enter the distances between markers in kb (or in cM if you have entered 1000 as the conversion factor in the preceding line). D(1,2) D(2,3) D(3,4) ... D(n-1,n) 5) On the next line, enter the number of analysis points per intermarker interval (must be <= constant maxpos), and usually 5 is sufficient. This is analogous to number of evaluations in LINKMAP program of LINKAGE package. (It should be noted that in the outer intervals, 10 steps are taken over 10cM out from the map flanking markers) 6) Minimum Value for n - the decay parameter - usually 10 or 20 is best. If you find the maximum likelihood at a boundary value of n, try either lowering it, or making the markers more tightly linked (i.e. alter the assumption about cM/Mb). Note that for points outside the map, this will typically be at or near the boundary, since disequilibrium is found a large distance away, i.e. at (presumably) the first marker locus or later... 7) Maximum Value for n - the decay parameter - usually 250 - 500 is fine. If you hit a boundary value, raise this no. or make the markers more widely spaced... When this value is high, there is a rapid decay in the association as you move small distances from the point you are evaluating the likelihood at. 8) For each marker enter the data in the following format for allele data NUMBER OF ALLELES AT MARKER CASE(1) CONTROL(1) CASE(2) CONTROL(2) CASE(3) CONTROL(3) ... 9) Name the file DISEQIN.DAT, save it, and run the program DISMULT, Output is in a file called diseq.out Example DISEQIN.DAT 0.022 Disease allele frequency 3 3 Markers 1 cM/Mb 500 120 Physical Distance between Markers in Kb 5 5 Points per Interval 20 Minimum n 800 Maximum n 2 Number of alleles at locus 1 48 28 Case(1) Control(1) 58 59 3 Locus 2 70 10 6 44 10 40 5 Locus 3 73 53 69 35 1 10 2 14 3 31 Output file DISEQ.OUT from the above problem: To make a graph just plot the Map position against the "Lod Score" or -2ln(LR) which is interpreted conservatively as 0.5* Chi-square(1df). Quantities like alpha and n are the vertical and horizontal control parameters described in the paper, and the interval and position are given so that you can see where each marker is (i.e. everytime you have position 0, you have a new marker locus, the number of which is given in the Interval column). ************************************************************* * Program DISMULT - Version 2.1 (2/5/96) * * AJHG 56:777-787 (1995) * * Multilocus Linkage Disequilibrium Analysis * ************************************************************* LOCUS 1 ==============|=====|=====| ALLELE: | 1 | 2 | --------------|-----|-----| CASE: | 48 | 58 | CONTROL: | 28 | 59 | ==============|=====|=====| LOCUS 2 ==============|=====|=====|=====| ALLELE: | 1 | 2 | 3 | --------------|-----|-----|-----| CASE: | 70 | 6 | 10 | CONTROL: | 10 | 44 | 40 | ==============|=====|=====|=====| LOCUS 3 ==============|=====|=====|=====|=====|=====| ALLELE: | 1 | 2 | 3 | 4 | 5 | --------------|-----|-----|-----|-----|-----| CASE: | 73 | 69 | 1 | 2 | 3 | CONTROL: | 53 | 35 | 10 | 14 | 31 | ==============|=====|=====|=====|=====|=====| Alpha N Inter. Position Map Position Lod Score -2ln(LR) 1.000000 390. 0 0 -4.25866 0.00000 0.00000 1.000000 20. 0 1 -0.10034 6.07393 27.97147 1.000000 20. 0 2 -0.08935 7.45386 34.32631 1.000000 20. 0 3 -0.07859 8.97969 41.35302 1.000000 20. 0 4 -0.06807 10.57762 48.71173 1.000000 20. 0 5 -0.05776 12.08476 55.65239 1.000000 20. 0 6 -0.04766 13.16856 60.64346 0.916548 20. 0 7 -0.03775 13.36366 61.54191 0.760124 20. 0 8 -0.02804 13.35679 61.51031 0.631592 20. 0 9 -0.01852 13.35003 61.47918 0.525249 20. 0 10 -0.00917 13.34339 61.44857 0.437886 20. 1 0 0.00000 13.33684 61.41844 0.439703 20. 1 1 0.00100 13.65798 62.89733 0.445691 28. 1 2 0.00199 13.95389 64.26002 0.999995 314. 1 3 0.00299 18.45487 84.98783 0.999998 426. 1 4 0.00400 22.16551 102.07595 0.782908 800. 1 5 0.00500 23.83323 109.75606 0.782976 800. 2 0 0.00500 23.83322 109.75605 0.883834 800. 2 1 0.00524 22.65050 104.30941 0.959204 800. 2 2 0.00548 18.32883 84.40738 0.521664 142. 2 3 0.00572 15.80475 72.78358 0.485274 89. 2 4 0.00596 15.11418 69.60335 0.452230 59. 2 5 0.00620 14.68567 67.63001 0.452230 59. 3 0 0.00620 14.68567 67.63001 0.821114 62. 3 1 0.01537 14.69946 67.69351 0.999994 44. 3 2 0.02472 14.63657 67.40389 0.999995 30. 3 3 0.03424 14.50510 66.79845 0.999998 23. 3 4 0.04395 14.40457 66.33549 1.000000 20. 3 5 0.05386 14.17518 65.27912 1.000000 20. 3 6 0.06396 13.02926 60.00196 1.000000 20. 3 7 0.07427 11.41072 52.54832 1.000000 20. 3 8 0.08479 9.69005 44.62431 1.000000 20. 3 9 0.09555 8.04766 37.06086 1.000000 20. 3 10 0.10654 6.56494 30.23266