Changes: 2/14/96 - Added support for sexlinked as well as autosomal data Corrected minor bug in hrrlamb and hrrmult maximization routine (pointed out by Michael Knapp - cf Terwilliger(1996a) ). Improved formatting for output files. Executables made available with statically linked libraries (for those without pascal compilers) Improved stability Documentation for ANALYZE program. This program set simplifies the computation of various statistics to test for linkage and/or linkage disequilibrium between a disease locus and a series of marker loci in a 2-point fashion. It is necessary for the user to have his favorite version of MLINK, UNKNOWN, and LSP from the LINKAGE package (Lathrop et al, 1984) or FASTLINK (Cottingham et al, 1993) (installed on the machine for these programs to work. This collection of programs simplifies the setup and analysis of human pedigree data, and outputs the results of linkage analysis, sibpair analysis, haplotype relative risk analysis, TDT analysis, and some other variations on the above themes in a detailed form, and a summary table, including some simple multilocus disequilibrium and linkage based statistics. How to use the program: INPUT FILES: pedin.dat - LINKAGE format pedigree file (processed by MAKEPED) with the disease locus first, followed by a series of "allele numbers" marker loci in chromosomal order (if they are not in chromosomal order, they can be put in the right order using the "extract" program included with this package). datain.dat - LINKAGE format parameter file (generated with PREPLINK) in either MLINK, ILINK, or LINKMAP format (LODSCORE format is not acceptable). Loci must be in chromosomal order (with the trait locus first) and must all be of the allele numbers locus type. Correct intermarker recombination fractions should be correctly indicated on the next to last line of the file (this can all be done in PREPLINK). Be sure to specify the presumed mode of inheritance in the parameter file for the parametric analyses to be conducted. OUTPUT FILES: analyze.final - Detailed output from each analytical method - discussed in detail below. summary.out - Summarized table of all essential statistics for each locus in chromosomal order with map distances indicated in the first column - discussed in detail later. OTHER PROGRAMS NEEDED: MLINK, UNKNOWN, LSP - You must have your favorite version of MLINK, UNKNOWN, and LSP installed on your computer somewhere in the path. You can use either LINKAGE (Lathrop et al, 1984) or FASTLINK (Cottingham et al, 1993) in any version you have available. INSTRUCTIONS: From the directory with the pedin.dat and datain.dat files, just type analysis to run the program in foreground, or type nohup analysis > analysis.log & to run the program in background - be sure all programs are in the path (i.e. in your bin directory, for example). Interpretation of the Output Files: analyze.final This file contains the detailed statistical analysis of your disease and marker loci with the following sections: FOR EACH MARKER LOCUS 1) Lod Score Analysis with Heterogeneity 2) TDT Analysis 3) Haplotype Relative Risk Analysis 4) Sibpair Analysis 5) Lod Score Analysis on Nuclear Pedigrees With Heterogeneity THEN, MULTILOCUS ANALYSES 6) Polylocus Lod Score Analysis 7) Multilocus Haplotype Relative Risk Analysis summary.out This file contains a tabular summary of the statistical analysis using each of the above methods, but in a much condensed and simplified format for quick examination and summary of the results of each test. Interpretation of Sample Dataset There is a sample dataset in the 2-point/sample/autosomal directory created when you uncompress the tar file containing the program code. There is a pedigree file called pedin.dat and a parameter file called datain.dat. To run the analysis you can simply type analysis and the programs will run (assuming you have installed them correctly and put them in the path on your machine). Let us look at the main output file, analyze.final, and see what each section means. Just as in the final.out file you get when running LINKAGE through LCP (Lathrop et al, 1984), the lsp.log file is attached to let you know which locus you are currently looking at. ******************************************************************************** MLINK Pedigree File : pedin.tpd Parameter File : datain.tdt Output Pedigree File : pedfile.dat Output Parameter File : datafile.dat Log File : lsp.log Stream File : lsp.stm Date Run : 17-Oct-95 16:50:21 Sex Difference : 0 Recomb. Fraction to Vary : 1 Increment Value : 0.02000000 Stop Value : 0.48000000 Locus Order : 1 2 Male Recomb. Fractions : 0.00000000 ******************************************************************************** The only important information here is the locus order - this tells you that currently you are examining locus 1 (the trait) against locus 2 in your input files. If you are using the most recent version of LSP, and have included locus names in the datain.dat file, they will be found on this line as well. The other information is irrelevant. It is important to note that there is a preprocessing program run which renumbers all pedigrees sequentially and all individuals within pedigrees sequentially as well. Therefore, if UNKNOWN should complain about an inconsistency, for example, the pedigree number given will be the sequential pedigree number - from the renumbered file, pedin.tpd. This has no effect on the results, but if you are trying to debug your data, it is important to bear this in mind. The reason this is done is to simplify the programming and memory requirements for the analysis programs. 1) LOD SCORE ANALYSIS WITH HETEROGENEITY Output of analysis with HOMOG For significance testing use the following standard critical values: HOMOGENEITY: Z(theta) > 3 HETEROGENEITY: Z(alpha,theta) > 3.3 Under homogeneity - Maximum Lod Score = 0.909022 Theta = 0.380 Under heterogeneity - Maximum lod score = 3.849152 Theta = 0.100 Proportion of Linked Families, Alpha = 0.350 Under Heterogeneity there is significant evidence of linkage. 3.3-lod-unit support interval for alpha and theta is as follows: Alpha: ( 0.01, 1.00) Theta: ( 0.00, 0.44) This part of the file summarizes the essential linkage information obtained from analysis using the parametric model defined in your datain.dat for the trait locus. The analysis is done using MLINK (Lathrop et al, 1984), letting the recombination fraction vary from 0 to 0.5 in steps of 0.02, so here we see that the maximum lod score occurs at recombination fraction 0.38, with said lod score being 0.91. Then, a heterogeneity analysis is performed using the HOMOG program (Ott, 1991), allowing alpha (the proportion of linked families in the dataset) to vary between 0 and 1 in steps of 0.01. In this case, we can see that the maximum lod score has risen to 3.85 when allowed for the presence of some unlinked families in our dataset. At the top of this section it explains that the conventional criteria have been used to interpret the significance of these lod scores - under homogeneity a lod score of 3 (1000:1 likelihood ratio) is taken as significant evidence of linkage (Morton, 1955; Bailey, 1961; Chotai, 1984), while if this is not significant, it has been suggested (Ott, 1991) that a lod score of 3.3 (likelihood ratio of 2000:1) be used as a critical value to compensate for the added free parameter, alpha. Note that this is by convention, and in reality the distribution of this latter statistic is very complicated (cf. Davies, 1977; Faraway et al, 1993). Because the critical limit for our test has been surpassed, I have included the support interval for our estimated values of alpha and theta - here I used a 3.3 unit support interval to be consistent with the test statistic critical value (cf. Terwilliger and Ott, 1994). Looking at the same section for the second marker locus (locus 3), we see the following: Output of analysis with HOMOG For significance testing use the following standard critical values: HOMOGENEITY: Z(theta) > 3 HETEROGENEITY: Z(alpha,theta) > 3.3 Under homogeneity - Maximum Lod Score = 5.582551 Theta = 0.280 Under homogeneity there is significant evidence of linkage. Test for heterogeneity GIVEN Linkage Chi-square for heterogeneity = 33.510799 Theta = 0.080 Alpha = 0.390 Significant evidence of heterogeneity at p < 0.0001 level! In this example, there was significant evidence of linkage under the assumption of homogeneity - the lod score being 5.58 at recombination fraction 0.28. In this situation, there was no need to look at the lod score test for linkage and homogeneity jointly, since we have already demonstrated that there is linkage. In this situation, a test of heterogeneity given linkage is performed with HOMOG (Smith, 1963; Ott, 1991), and the statistic presented is 2ln(L(theta,alpha)/L(theta,alpha=1)) which is asymptotically distributed as a chi-square statistic with 1 degree of freedom - the test is further one-sided, since the null hypothesis, alpha = 1, is compared with the one-sided alternative alpha < 1. In this case, there was significant evidence of heterogeneity at the 0.0001 level (estimated theta = 0.08, and alpha = 0.39). 2) TDT Analysis ******************************************************** ***** ***** ***** You are using TDTLIKE - Alpha Test Version ***** ***** for computing TDT-like likelihood ratio ***** ***** statistics based on an algorithm of ***** ***** J. Terwilliger (AJHG 56:777-787 (1995)) ***** ***** ***** ******************************************************** Locus 1 Alleles which appear at least 5 times shown. Multiple test corrected ORIG # CASE CONTROL TDT One-Sided P-Value 1 209 129 18.9349117279 0.0000397356 2 46 83 10.6124029160 1.0000000000 3 76 129 13.7024393082 1.0000000000 4 91 93 0.0217391308 0.9880542409 5 87 75 0.8888888955 0.6593401134 Multiallelic Statistic - Based on Terwilliger (AJHG - March 1995) Maximum Likelihood Estimate of TDT Lambda = 0.62000 -2ln(L) difference = 15.8929973831 P-Value = 0.0000338507 This is the summary of results of analysis with the TDT (Spielman et al, 1993). This test looks at all affected offspring of heterozygous parents for a given allele - it compares the frequency with which they transmit said allele to their affected children with the frequency with which they transmit the other allele to their affected children. In other words, if you have a sample affected kids, with a parent with genotype 1/?, and there are X affecteds who received the 1 allele from these heterozygous parents, and Y affecteds who got the other allele (which may be different in each case), the TDT test statistic is of the form (X-Y)^2/(X+Y), which is asymptotically distributed as a chi-square statistic with one degree of freedom - again, this test is one-sided because we are only interested in cases where a specific allele is transmitted more frequently than expected to the affected offspring(Ewens and Spielman, 1993). Some thought might be given to what is actually being tested with the TDT. In the absence of linkage, it is random which allele segregates to any given affected individual from a heterozygous 1/? parent, so under the hypothesis of no linkage, clearly X = Y, and the test is valid, since all meiotic events are independent in this case - even within a pedigree. If you have a sample of unrelated affected kids, and their heterozygous parents, in the absence of allelic association, it is random which allele is "in phase" with the disease allele in these heterozygous parents, so under this hypothesis as well, all observations are independent (even if there were linkage), and it is a valid test of the hypothesis of no allelic association. However, in extended pedigrees, if there is linkage, the different affected kids are not independent relative to the hypothesis of no allelic association, and the test is not a valid one for this null hypothesis. To see this, consider the situation in which you have a set of sibpairs, and there is 0 recombination between marker and disease, and a fully penetrant recessive disease - then all affected sibs would have the identical marker alleles received from the heterozygous parents. Then, effectively you are counting the same parental alleles twice. The effect would be the same as if you had a standard case-control association test, and decided to double the number of counts of each allele in case and control samples - clearly one would never conclude that this was a valid test procedure, and this is the same effect as you have here when there is linkage, so to say that a positive TDT in multiplex families is evidence of association is a misnomer - the only null hypothesis you can reject validly with this test is that of no linkage, especially in a small set of large pedigrees (Ott, 1989; Ewens and Spielman, 1995; Terwilliger, 1996b) In the statistical analysis outlined above, I have not used the chi-square approximation to compute the p-values, but rather, since sample sizes are sometimes a bit too small for the chi-square approximation to hold, I have computed exact p-values from a binomial distribution, looking at the probability of having the allele under study transmitted X or more times to the affected offspring of heterozygous parents out of (X+Y) opportunities. Then, I have corrected for the fact that multiple tests have been performed (if there are 5 alleles at a locus, there are approximately 4 "independent" tests - therefore the p-value presented is the probability of such an extreme result occurring for one of the n alleles at a given locus, rather than assuming you had only tested the one single allele. Only alleles which have had at least 5 opportunities to be transmitted are included, to reduce in an unbiased way, the number of necessary tests one should then correct for. This can be made higher or lower by altering a program constant, min, in the file tdtlikena.p and then recompiling the program. At the bottom, I have also performed a likelihood ratio based TDT test (Terwilliger, 1995;1996b) which considers all the alleles jointly and allows that one of them would be transmitted preferentially to affected offspring, and not the others. The test is parametrized such that the probability of transmitting said allele is equal to lambda, which is then estimated by maximum likelihood, where lambda is constrained to be > 0.5, lambda = 0.5 being the null hypothesis in this nonparametric linkage test. 2ln(L(lambda)/L(lambda=0.5)) is assumed to be asymptotically one-sided chi-square with one df. In this case, there is significant evidence of linkage, with allele 1 being preferentially transmitted to affected offspring of 1/? parents, with a p-value of 0.00004 in the standard multiple test corrected TDT, and using my likelihood ratio test, the p-value is 0.00003, though this test does not provide any determination of which allele is actually preferentially transmitted, but since this TDT is performed using all affected individuals in multiplex pedigrees, no conclusions about allelic association can be made, so this information is really somewhat irrelevant in this test. 3) Haplotype Relative Risk Analysis ************************************************************* * * * Program HRRLAMB - Version 2.1 (1/31/96) * * * * AJHG 56:777-787 (1995) * * * ************************************************************* Disease allele frequency = 0.01000000 ========================================= CASE | 43. | 1. | 13. | 12. | 11. | CONTROL | 27. | 10. | 14. | 18. | 11. | ========================================= Estimated parameters for likelihood ratio test: Allele frequencies: Allele H0: H1 1 0.43750000 0.34604554 2 0.06875000 0.08098217 3 0.16875000 0.19567004 4 0.18750000 0.21743885 5 0.13750000 0.15986340 Lambda 0.00000000 0.291021 LRT Chi-Square = 4.40863 p-value = 0.017888760783279 Lambda = 0.291021 NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST 2 x n table Chi-square = 12.25782 P-value = 0.015533597133660 NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST The next set of output is for the haplotype relative risk test for allelic association (Rubinstein et al, 1981; Falk and Rubinstein, 1987; Terwilliger and Ott, 1992). This test is very much related to the TDT with two salient differences - first, only one affected individual is used per pedigree - so they are independent relative to the null hypothesis of no association; second, all alleles transmitted from parents (heterozygous or homozygous) to the affected child are included in the case sample, and all remaining (i.e. not transmitted) alleles in the parents of the affected children are included in the control sample (cf. Falk and Rubinstein, 1987; Ott, 1989; Terwilliger and Ott, 1992). In this case, the affected child selected in any given pedigree is chosen as the first affected individual who is himself typed at the marker locus and has both parents typed at the marker as well. If no affected child in a pedigree meets these criteria, then the program searches for the first child who is himself typed and has one parent typed at the marker locus - in this case there would be two alleles in the case sample ( the two alleles of the affected child) and one allele in the control sample (the not transmitted allele from the genotyped parent). If no affected child meets this criterion in a pedigree, then the first affected child who is typed himself is taken as the case sample, and no matching control is allowed for - the effect of this sampling procedure is that the case sample will be, in general, larger than the control sample, but it avoids the necessity of throwing away information about linkage disequilibrium from independent unrelated affecteds, just because their parents are not available for testing and thus no controls are available. The number of times each allele appears in the case and control samples is summarized in a brief 2 x n table, and two tests are performed of the null hypothesis of no association. First, a likelihood ratio test is performed (Terwilliger, 1995) which implies a linkage disequilibrium model with one associated allele only, where that allele has frequency p + lambda(1-p) in the case sample, where p is the population frequency of that allele, and lambda is a measure of the strength of association. Under the null hypothesis, lambda = 0, and the probability of this allele is the same in case and control samples. If lambda = 1, there is complete association, and the frequency of this allele = 1 in the case sample. Lambda is estimated by maximum likelihood with this program, and the p-value is indicated under the overly conservative assumption that the statistic follows a one-sided chi-square distribution with 1 df. The second test performed is that standard 2 x n table chi-square test, which has n-1 df (where there are n alleles whose frequencies are compared in case and control samples). This test is less powerful by far when there is only one associated allele (i,e, linkage disequilibrium from a founder effect), but can be more sensitive when there are higher order associations with different alleles. The analysis package was written with the assumption that the analyses are part of a genome screening effort to identify new disease loci - for that reason when the p-values are larger than 0.0001 the program says the result is not significant (association tests must be conducted at AT LEAST the same stringency as linkage tests when the goal is to prove the existence of a new disease locus - the reason for this is that under the null hypothesis of no association, there is no correlation between the results of association tests at even tightly linked markers, while for linkage they are strongly correlated, so multiple testing is even a more important factor when association tests are being performed). If you have proven through linkage that a locus is linked to a marker, and you want to test for association given linkage, you may feel free to reduce the stringency of the test as you see fit - remember that p-values are similar to the odds at the racetrack - they give you an idea of how to gamble your resources - if you are satisfied with a p-value of 0.05 to invest your resources in molecular search for a disease gene, feel free to go ahead - but realize there is quite a large risk of false positive results - especially in association tests (Hill and Weir, 1995; Terwilliger, 1996b). Note that for the first marker locus there was a strongly positive TDT result (rejecting the hypothesis of no linkage) but no significant evidence of allelic association - this is not inconsistent, remember, since the TDT in multiplex pedigrees is not, itself, a test of association (Ewens and Spielman, 1995; Terwilliger, 1996b)! Let us look briefly at the second marker locus output as well: ******************************************************** ***** ***** ***** You are using TDTLIKE - Alpha Test Version ***** ***** for computing TDT-like likelihood ratio ***** ***** statistics based on an algorithm of ***** ***** J. Terwilliger (AJHG 56:777-787 (1995)) ***** ***** ***** ******************************************************** Locus 1 Alleles which appear at least 5 times shown. Multiple test corrected ORIG # CASE CONTROL TDT One-Sided P-Value 1 271 71 116.9590606689 0.0000000000 2 81 169 30.9759998322 1.0000000000 3 172 284 27.5087718964 1.0000000000 Multiallelic Statistic - Based on Terwilliger (AJHG - March 1995) Maximum Likelihood Estimate of TDT Lambda = 0.79000 -2ln(L) difference = 122.5419672207 P-Value = 0.0000000000 ************************************************************* * * * Program HRRLAMB - Version 2.1 (1/31/96) * * * * AJHG 56:777-787 (1995) * * * ************************************************************* Disease allele frequency = 0.01000000 ============================= CASE | 39. | 9. | 32. | CONTROL | 1. | 23. | 56. | ============================= Estimated parameters for likelihood ratio test: Allele frequencies: Allele H0: H1 1 0.25000000 0.02719901 2 0.20000000 0.25945108 3 0.55000000 0.71334991 Lambda 0.00000000 0.474385 LRT Chi-Square = 50.69729 p-value = 0.000000000000561 Lambda = 0.474385 SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST 2 x n table Chi-square = 48.77045 P-value = 0.000000000025682 SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST In this example, for the second marker locus, there was a strongly positive TDT result, and also a strongly positive HRR result - in fact, notice that allele 1 appears to be preferentially transmitted in the TDT test, and appears to be associated in the HRR test. When the HRR is significant (i.e. there is strong evidence of linkage disequilibrium) the TDT test should normally be significant as well, since it is difficult to imagine a model in which you would have allelic association, but not also linkage. The inverse is not true, though, as illustrated with the example of the first marker. Here, you also see the expected pattern that the LRT test has a pvalue that is 2 orders of magnitude smaller than the more general 2 x 3 table chi-square. This effect is much stronger as the number of alleles at the marker locus increases, when there is linkage disequilibrium caused by founder effect. 4) SIBPAIR ANALYSIS ********************************************************************* * Program SIBPAIR - Sibpair analysis on Nuclear Families * ********************************************************************* P-values 0 1 2 | SHARED NOT | 0 vs 2 Mean test | | NA/NA 6.0 15.0 7.0 | 34.0 34.0 | 0.390755713 1.000000000 NA/A 20.7 38.8 17.2 | 102.3 116.0 | 0.284681141 0.177501172 A/A 17.3 53.0 28.3 | 158.7 116.7 | 0.051785618 0.005694087 Overall Chisquare For 0-2 Sharer Comparisons = 2.97342 p-value = 0.1130575 Overall Chisquare For Mean Test Comparisons = 7.26225 p-value = 0.0132432 Affected Sib-Pair Mean Test P-value = 0.005694087 Next, the program performs a sibpair analysis (Penrose, 1935) on all sibpairs in the dataset. When there are multiple sibs in a sibship this program weights the sibpairs according to the sibship size in the following manner - 2 sibs = 1 sibpair, 3 sibs = 2 sibpairs, 4 sibs = 3 sibpairs. This weighting has been selected to conform with the information content of phase unknown nuclear pedigrees in linkage analysis. A detailed discussion of this can be found in (Blackwelder and Elston, 1985; Suarez and van Eerdewegh, 1984). The way this is done is to compute effectively the average number of alleles shared and not shared between all sibs and a given proband - where the average is taken over all affected sibs who could be considered the proband. Further, I allow for the possibility to look at unaffected sibs as well, as has been recommended by Eaves and Meyer (1994) and Risch and Zhang(1995) independently. To extend the same weighting logic to these unaffecteds as well, I consider that a sibship of size n has (n-1) effectively independent sibships, as outlined above - any of the affected sibs could be taken as the proband, for example. Then there are m-1 affected sibpairs, if there are m affected sibs. If there are k unaffected sibs, and any affected sib could be our choice of proband, there are k discordant NA/A sibpairs possible for comparison with any given proband. Then, I estimate the number of shared alleles by taking the average of the results from comparing the unaffecteds with each of the affecteds as proband in turn. Thus a sibship with m affecteds and k unaffecteds would contribute (m-1) affected sibpairs and k discordant affected/not affected sibpairs to the linkage test. The only time there will be NA/NA sibpairs used in the analysis is when there are only unaffected sibs in a given sibship - the rationale for this is that if we only have n-1 independent sibpairs in a sibship of size n, it is wise to apportion them in the most powerful manner - since NA/NA sibpairs would be the least informative for most complex diseases (i.e. they are unaffected possibly due to low penetrance, or lack of environmental exposure, etc.) the most powerful partition of the available data is to use only A/A and A/NA pairs when possible, since they give the most bang for the buck. Eaves and Meyer(1994) originally recommended looking at "very unaffected" individuals, say from the lower tail of some continuous distribution for the trait as the unaffected sample, and the extreme upper tail as the affected sample, with all other sibs being labelled as unknown at the trait locus - unknown sibs being discarded from the analysis. When parents are not genotyped at the marker locus, the program computes the likelihood of each possible genotype for the untyped parents, and computes the number of alleles shared identical by descent (IBD) in a sibpair as the average over all possible parental genotype combinations, weighted by their conditional probabilities given the known data. When a parent is homozygous, there is no information about linkage, so those data are left out, since we cannot tell whether the sibpair does or does not share an allele IBD from that parent. For this reason, one can see there is more data available when looking at number of alleles shared vs not shared, when compared with the number of sibpairs sharing 0, 1, or 2 alleles - if one parent is homozygous, you cannot tell how many alleles that sibpair shares IBD, and assuming that there is 50% chance of sharing the allele IBD and 50% chance of the allele being not IBD tends to dilute the power of the linkage test, and biases strongly the estimate of proportion of alleles shared IBD toward the null hypothesis value of 50%. The output in the table above gives the results of a number of plausible statistical tests - comparing first, the number of sibpairs sharing 2 alleles IBD [X] with the number sharing 0 IBD [Y] by the statistic [X-Y]^2/[X+Y] which is distributed as one-sided chi-square with one df. For concordant pairs, X > Y under the alternative hypothesis, and for discordant pairs X < Y under the alternative. Underneath, I have included a "overall chi-square" statistic for this test as well, which is effectively the sum of the chi-square statistics for A/A sibpairs and A/NA sibpairs (NA/NA sibpairs are left out since they are present in this analysis only when there are sibships where everyone is unaffected), and is assumed to follow a chi-square distribution with 2df. The other test, the mean test compares the number of alleles IBD among all sibpairs with the number of alleles not shared IBD among all sibpairs. Since homozygous parents provide no information, and with a marker of 70% heterozygosity, 42% of all nuclear pedigrees will have only one heterozygous parent and one homozygous parent, the amount of data for which we can identify whether at least one allele was shared IBD by a sibpair is much larger than the proportion of the dataset where we can determine how many alleles were shared IBD from both parents. For this reason, the mean test is generally more powerful in this implementation than the comparison described above - the statistics are of the same form, only now X is the number of alleles shared IBD, and Y is the number of alleles not shared IBD in the equations presented above. The last line of the section reiterates the most important piece of statistical information in general, the p-value for the chi-square test comparing the number of alleles shared and not shared among all affected sib-pairs in a dataset. In complex diseases where the "unaffecteds" have not been selected as the lower tail of some continuous distribution - (for example, they are unaffected even given presence of lots of risk factors, and have a phenotype that is more normal than normal - i.e. low blood pressure as unaffected if hypertension were affected) - there is little power in comparisons other than looking for increased sharing among affected sibs. In this example, the first marker locus showed no significant evidence of linkage - of course, in sibpair analysis you need to have the same stringency as in linkage analysis, p < 0.0001 to call a test result positive. The second marker locus showed much more significant evidence of linkage in the sibpair test, with a p-value of < 0.000000001 as shown below: ********************************************************************* * Program SIBPAIR - Sibpair analysis on Nuclear Families * ********************************************************************* P-values 0 1 2 | SHARED NOT | 0 vs 2 Mean test | | NA/NA 2.0 8.0 4.0 | 24.0 25.0 | 0.207109630 0.943191707 NA/A 22.2 39.0 12.8 | 103.0 128.0 | 0.057323437 0.049996715 A/A 6.7 43.3 43.0 | 201.7 89.3 | 0.000000129 0.000000000 Overall Chisquare For 0-2 Sharer Comparisons = 29.06830 p-value = 0.0000002 Overall Chisquare For Mean Test Comparisons = 46.06913 p-value = 0.0000000 Affected Sib-Pair Mean Test P-value = 0.000000000 5) LOD SCORE ANALYSIS ON NUCLEAR PEDIGREES WITH HETEROGENEITY Output of analysis with HOMOG For significance testing use the following standard critical values: HOMOGENEITY: Z(theta) > 3 HETEROGENEITY: Z(alpha,theta) > 3.3 Under homogeneity - Maximum Lod Score = 3.375684 Theta = 0.220 Under homogeneity there is significant evidence of linkage. Test for heterogeneity GIVEN Linkage Chi-square for heterogeneity = 0.011600 Theta = 0.220 Alpha = 0.970 No significant evidence of heterogeneity. In the case where you presume multiple genes are responsibile for a given disease phenotype, there may be intrafamilial heterogeneity as well as interfamilial heterogeneity, and thus the recombination fraction may appear to be different in different parts of a pedigree, as different disease loci may be segregating in different sibships - to control for this, we eliminate all extended pedigree phase information, and break large multiplex pedigrees into all component nuclear pedigrees in our subsequent linkage analysis - which assumes the same mode of inheritance specified in the datain.dat file. Interpretation of this output is identical to the standard lod score analysis - note that even under hmogeneity there is strong evidence of linkage in this analysis (while there was NO evidence of linkage when the extended pedigrees were analyzed above). This effect is due to elimination of potentially erroneous phase information which relies heavily on the assumed mode of inheritance for the disease locus in extended pedigrees. This analytical approach is intuitively related to affected sibpair strategies (Terwilliger and Ott, 1994; Ginns et al, 1995). 6) POLYLOCUS LOD SCORE ANALYSIS Locus Map Position 2 0.000000 3 0.005025 OUTPUT FROM PROGRAM POLYLOCUS Analysis with Polylocus method of Terwilliger and Ott Genetic Epidemiology 10(6):477-82 (1993) Primary Locus is now Locus Number 2: Polylocus Maximum Lod Score = 1.243432 Theta = 0.380000 Polylocus Maximum Lod Score with Heterogeneity = 5.639449 Theta = 0.060000 Alpha = 0.300 MULTIPOINT Polylocus Maximum Lod Score = 0.601381 Map Position = -0.458145 MULTIPOINT Polylocus Maximum Lod Score with Heterogeneity = 5.621749 Map Position = 0.057705 Alpha = 0.290000 This is the portion of the polylocus analysis output for which marker locus 2 was the primary locus to be analyzed. Polylocus linkage analysis is simply a standard parametric linkage analysis in which for pedigrees in which a given marker is untyped or uninformative, the next nearest marker is used in the analysis instead, to recover some of the lost linkage information without the known biases endemic to multipoint linkage analysis, or the computational demands true multipoint analysis imposes (Baron et al, 1987; Terwilliger and Ott, 1993). The heterogeneity test is performed on the lod scores as if the substitute marker(s) are at the same point as the primary marker locus of interest, using the same algorithm as HOMOG (Smith, 1963; Ott, 1991) described above. The Multipoint Polylocus Lod score method is based on the same principal except that the marker map is presumed known and some correction is made for it, following a recommendation of Schaid and Elston(1994), with some slight simplifications - here since one locus is used per pedigree and I assume the marker map is fixed and the Haldane mapping function applies, The lod scores are determined at certain "map positions" from each marker and then at each map position, the lod scores are added together to create the mathematical equivalent of standard multipoint lod scores when only one marker were used in any given pedigree. Interpretation is as for normal lod scores, with the caveat that for the regular polylocus lod scores, the "theta" does not have much meaning, since the markers used are at different points in the genome, perhaps as much as 10% recombination fraction away. Adding more meioses, up to this point still adds power to the analysis, but when you are more than 10cM away, it becomes counterproductive to add more markers, as if they were at the same point - in those situations, multipoint polylocus lod scores may be substantially more powerful. 7) Multilocus Haplotype Relative Risk Analysis ************************************************************* * Program HRRMULT - Version 2.1 (2/5/96) * * AJHG 56:777-787 (1995) * * Multilocus Haplotype Relative Risk Analysis * ************************************************************* Alpha N Inter. Position Map Position Lod Score -2ln(LR) 1.000000 485. 0 0 -4.25866 0.00000 0.00000 1.000000 20. 0 1 -0.10034 4.41145 20.31549 1.000000 20. 0 2 -0.08935 5.51186 25.38307 1.000000 20. 0 3 -0.07859 6.76540 31.15581 1.000000 20. 0 4 -0.06807 8.10525 37.32604 1.000000 20. 0 5 -0.05776 9.34492 43.03493 1.000000 20. 0 6 -0.04766 10.15622 46.77110 0.903380 20. 0 7 -0.03775 10.26342 47.26480 0.749131 20. 0 8 -0.02804 10.25928 47.24572 0.622254 20. 0 9 -0.01852 10.25520 47.22692 0.517765 20. 0 10 -0.00917 10.25120 47.20852 0.431452 20. 1 0 0.00000 10.24726 47.19038 0.431175 20. 1 1 0.00100 10.41172 47.94774 0.430277 20. 1 2 0.00200 10.56382 48.64819 0.999986 396. 1 3 0.00301 11.47395 52.83950 0.604671 239. 1 4 0.00402 11.61899 53.50741 0.474436 142. 1 5 0.00503 11.61904 53.50765 0.474436 142. 2 0 0.00503 11.61904 53.50765 0.999994 87. 2 1 0.01420 11.50964 53.00385 0.999997 46. 2 2 0.02355 11.23010 51.71653 0.999996 31. 2 3 0.03307 11.07330 50.99442 0.994983 23. 2 4 0.04278 10.97677 50.54991 1.000000 20. 2 5 0.05268 10.85818 50.00378 1.000000 20. 2 6 0.06278 10.08478 46.44213 1.000000 20. 2 7 0.07309 8.81444 40.59199 1.000000 20. 2 8 0.08362 7.38083 33.99000 1.000000 20. 2 9 0.09437 6.01230 27.68767 1.000000 20. 2 10 0.10536 4.80767 22.14015 The final analytical method employed by analysis is multilocus Haplotype Relative Risk analysis (Rubinstein et al, 1981; Falk and Rubinstein, 1987; Terwilliger and Ott, 1992), following an extension of the likelihood ratio test algorithm I used above (Terwilliger, 1995). In this method the transmitted and not transmitted alleles at each locus are analyzed jointly in a multilocus test for linkage disequilibrium assuming one allele per locus to be associated with the disease. The statistic presented is of the form of a likelihood ratio test based on some assumptions about the rate of decay of linkage disequilibrium as a function of map distance and age of the disease allele in the study population (cf. Terwilliger, 1995). The last column is the most relevant statistic, which is very conservatively interpretable as following a chi-square distribution with 1df in a one-sided manner. The previous column "Lod Score" is the same statistic on a common logarithmic scale, which may be more intuitive for those coming from a genetics background. This method has been used in a number of studies to find the most likely fine map location of disease genes based on linkage disequilibrium data (cf. Haberhausen et al, 1995; Nikali et al, 1995; Raha-Chowdhury et al, 1995). In this case there is strong evidence for linkage disequilibrium when both loci are considered jointly (p = 0.00000000000013), and the most likely location of the disease gene is near to the second locus (as was also pretty clear from the linkage results). OVERALL CONDENSED OUTPUT FILE summary.out Summary of statistical results from ANALYZE Autosomal Data: Loc x(cM) Z(t) Theta HetChi Z(a,t) Theta Alpha Sibpair TDT HRR-LRT HRR-2xn 2 0.00000 0.909 0.38 3.849 0.10 0.35 0.005694 0.000034 0.017889 0.015534 3 0.00502 5.583 0.28 33.511 0.08 0.39 0.000000 0.000000 0.000000 0.000000 Loc x(cM) N-Z(t) N-HetChi N-Z(a,t) P-Z(t) P-HetChi P-Z(a,t) MP-Z(t) MP-HetChi MP-Z(a,t) 2 0.00000 3.376 0.012 1.243 5.639 0.601 5.622 3 0.00502 8.841 0.337 5.583 33.511 5.505 33.607 Maximum Multilocus HRR = 53.50765 ~ Chi-Square(1) - One-sided Loc = Locus Number x(cM) = Map Distance from the first marker (Locus 2) in Haldane cM Z(t) = Maximum Lod Score HetChi = Chi-Square for Heterogeneity when Z(t) >= 3 Z(a,t) = Maximum Lod Score with Heterogeneity - when Z(t) < 3 Sibpair = p-value for Affected Sib Pair Mean Test TDT = p-value for Likelihood-Based TDT HRR-LRT = p-value for Likelihood-Based Haplotype Relative Risk HRR-2xn = p-value for 2 x n Contingency Table Chi-Square HRR test N-Z(t) = Maximum Lod Scores When Extended Pedigrees are Broken into Nuclear Pedigrees N-HetChi = Chi-square for Heterogeneity (Nuclear Pedigrees) N-Z(a,t) = Maximum Lod Score with Heterogeneity (Nuclear Pedigrees) P-Z(t) = Polylocus Maximum Lod Score P-HetChi = Chi-square for Heterogeneity (Polylocus) P-Z(a,t) = Maximum Polylocus Lod Score with Heterogeneity MP-Z(t) = Maximum Multipoint Polylocus Lod Score MP-HetChi = Chisquare for Heterogeneity (Multipoint Polylocus) MP-Z(a,t) = Maximum Multipoint Polylocus Lod Score with Heterogeneity The above summary file presents the salient statistical information for all marker loci from all of the aforementioned statistical tests. The first column gives the map position of each marker locus (with the first marker fixed at map position 0). Then the lod score, estimated recombination fraction, heterogeneity test statistic, estimated values of theta and alpha from the heterogeneity test - then the p-values from the nonparametric tests, sibpair (affected sib pair mean test), TDT (likelihood statistic), HRR (likelihood statistic, and 2 x n table chi-square statistic). The following table contains the remaining statistics, linkage on nuclear pedigrees + heterogeneity; polylocus linkage + heterogeneity; multipoint polylocus linkage _ heterogeneity. In each linkage analysis situation, when there was a lod score under homogeneity greater than 3, the chi-square statistic for heterogeneity given linkage is presented, and when there is not significant evidence of linkage the lod score with both theta and alpha as free parameters is given instead of the heterogeneity test since linkage has not been established yet. NONPARAMETRIC ONLY OPTION There is an alternative to the full analysis package which only performs the nonparametric analyses (and is therefore much faster than the full linkage analysis). To use this, follow the same instructions as described above, but instead of running the analysis program, use nonparametric instead. To run it in foreground - type nonparametric To run it in background - type nohup nonparametric > nonparametric.log & The output files are similar to those described above - and are called nonparam.final - equivalent to analyze.final nonparam.summ - equivalent to summary.out from analysis. All output in the nonparam.final is as described for analysis above, and the example output file nonparam.summ is shown below for this same dataset, which should be self- explanatory since the same data are shown in summary.out above: Summary of statistical results from NONPARAM Autosomal data: P-VALUES -------------------------------------------- Loc x(cM) Sibpair TDT HRR-LRT HRR-2xn 2 0.00000 0.005694 0.000034 0.017889 0.015534 3 0.00503 0.000000 0.000000 0.000000 0.000000 Maximum Multilocus HRR = 53.50765 ~ Chi-Square(1) - One-sided Loc = Locus Number x(cM) = Map Distance from the first marker (Locus 2) in Haldane cM Sibpair = p-value for Affected Sib Pair Mean Test TDT = p-value for Likelihood-Based TDT HRR-LRT = p-value for Likelihood-Based Haplotype Relative Risk HRR-2xn = p-value for 2 x n Contingency Table Chi-Square HRR test USEFUL ACCESSORY PROGRAMS: DOWNFREQ - Estimates allele frequencies for marker loci and allows the user to downcode the data - especially useful is the option to eliminate alleles which never occur in your complete dataset to streamline the calculations EXTRACT - reorders the loci in your pedigree and parameter files to put them in chromosome order - this is very important if you plan to use the polylocus or multipoint HRR results, and is used for determination of map positions in the summary.out and nonparam.summ files. (they are described briefly in accompanying documents) REFERENCES: Linkage Analysis with Heterogeneity: Bailey, N.T.J. (1961) "Introduction to the mathematical theory of genetic linkage." Oxford:Clarendon Press. Chotai, J. (1984) "On the lod score method in linkage analysis." Ann Human Genet 48:359-378. Cottingham, R.W., R.M. Idury, A.A. Schaffer (1993) "Faster sequential genetic linkage computations." Am J Hum Genet 53:252-263. Davies, R.B. (1977) "Hypothesis testing when a nuisance parameter is present only under the alternative." Biometrika 64:247-254. Faraway, J.J. (1993) "Distribution of the admixture test for the detection of linkage under heterogeneity." Genetic Epidemiology 10:75-83. Lathrop, G.M., J.M. Lalouel, C. Julier et. al. (1984) "Strategies for multilocus linkage analysis in humans." PNAS 81:3443-3446. Morton, N.E. (1955) "Sequential tests for the detection of linkage." Am J Hum Genet 7:277-318. Ott, J. (1991) "Analysis of Human Genetic Linkage", 2nd edition. Baltimore, Johns Hopkins University Press. Smith, C.A.B. (1953) "The detection of linkage in human genetics." J R Statist Soc 15B:153-184. Smith, C.A.B. (1963) "Testing for heterogeneity of recombination fraction values in human genetics." Ann Hum Genet 27:175-182. Terwilliger, J.D., and J. Ott (1994) "Handbook of Human Genetic Linkage" Johns Hopkins University Press. TDT Analysis: Ewens, W.J., and R.S. Spielman (1995) "The transmission/disequilibrium test: history, subdivision, and admixture." Am J Hum Genet 57:455-464. Ott, J. (1989) "Statistical properties of the haplotype relative risk." Genetic Epidemiology 6:127-130. Spielman, R.S., R.E. McGinnis, and W.J. Ewens (1993) "Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am J Human Genet 52:506-516 Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci" Am J Human Genet 56:777-787 Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press) Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage and linkage disequilibrium for one or more polymorphic marker loci" (In Review) HRR Analysis: Falk, C.T., and P. Rubinstein (1987) "Haplotype Relative Risks: an easy reliable way to construct a proper control sample for risk calculations." Ann Hum Genet 51:227-233. Hill, W.G., and B.S. Weir (1994) "Maximum-likelihood estimation of gene location by linkage disequilibrium." Am J Hum Genet 54:705-714. Ott, J. (1989) "Statistical properties of the haplotype relative risk." Genetic Epidemiology 6:127-130. Rubinstein, P., M. Walker, C. Carpenter, et al. (1981) "Genetics of HLA disease associations: The use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates in juvenile diabetes from three racial groups." Hum Immunol 3:384. Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci" Am J Human Genet 56:777-787 Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press) Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage and linkage disequilibrium for one or more polymorphic marker loci" (In Review) Terwilliger, J.D. and J. Ott (1992) "A haplotype-based haplotype relative risk statistic" Human Heredity 42:337-346. Sibpair Analysis: Blackwelder, W.C. and R.C. Elston (1985) "A comparison of sib-pair linkage tests for disease susceptibility loci". Genetic Epidemiology 2:85-97 Eaves L., and J. Meyer (1994) "Locating human quantitative trait loci: guidelines for the selection of sibling pairs for genotyping." Behavior Genetics 24:443-455 Penrose, L.S. (1935) "The detection of autosomal linkage in data which consist of pairs of brothers and sisters of unspecified parentage." Ann Eugenics 6:133-38 Risch, N., and H. Zhang (1995) "Extreme discordant sib-pairs for mapping quantitative trait loci in humans." Science 268:1584-1589. Suarez, B.K., and P. Van Eerdewegh (1984) "A comparison of three affected sib-pair scoring methods to detect HLA-linked disease susceptibility genes." Am J Med Genet 18:135-146. Terwilliger, J.D.; W.D. Shannon; G.M. Lathrop; J.P. Nolan; L.R. Goldin; G.A. Chase; D.E. Weeks. (1996) "True and False Positive Peaks in Genome-wide Scans: Applications of Length-Biased Sampling to Linkage Mapping." (In Review) Linkage Analysis on Nuclear Pedigrees: Terwilliger, J.D. and J. Ott (1994) "Handbook of Human Genetic Linkage" Baltimore and London, Johns Hopkins University Press. Polylocus Analysis: Baron, M., N. Risch, R. Hamburger, B. Mandel, S. Kushner, M. Newman, D. Drumer, R.H. Belmaker (1987) "Genetic linkage between X-chromosomal markers and bipolar affective illness." Nature 326:289-292 Pekkarinen, P., J.D. Terwilliger, P.-E. Bredbacka, J. Lonnqvist, L. Peltonen (1995) "Evidence of a susceptibility locus for bipolar disorder on Xq24-q27.1 in a Finnish pedigree." Genome Research (In Press). Schaid D.J., and R.C. Elston (1994) "Combining two-point genetic linkage analyses using mapping functions." Genetic Epidemiology 11:1-17. Terwilliger, J.D. and J. Ott (1993) "A novel polylocus method for linkage analysis using the lod score or affected sib-pair methods" Genetic Epidemiology 10(6):477-82. Multilocus HRR: Falk, C.T., and P. Rubinstein (1987) "Haplotype Relative Risks: an easy reliable way to construct a proper control sample for risk calculations." Ann Hum Genet 51:227-233. Haberhausen, G., I. Schmitt, A. Kohler, U. Paters, S. Rider, J. Chelly, J.D. Terwilliger, A.P. Monaco, U. Muller (1995) "Assignment of the dystonia-parkinsonism syndrome locus, DYT3, to a small region within a 1.8-Mb YAC contig of Xq13.1" Am J Human Genet 57:644-650. Nikali, K. A. Suomalainen, J.D. Terwilliger, T. Koskinen, J Weissenbach, L Peltonen (1995) "A new locus for a hereditary ataxia identified by genotyping of four affected individuals." Am J Human Genet ??:???-??? Raha-Chowdhury, R., D.J. Bowen, C. Stone, J.J. Pointon, J.D. Terwilliger, J.D. Shearman, K.J.H. Robson, A. Bomford, M. Worwood (1995) "New polymorphic microsatellite markers place the hemochromatosis gene telomeric to D6S105." Human Molecular Genetics (In Press). Rubinstein, P., M. Walker, C. Carpenter, et al. (1981) "Genetics of HLA disease associations: The use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates in juvenile diabetes from three racial groups." Hum Immunol 3:384. Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage disequilibrium between trait loci and one or more polymorphic marker loci" Am J Human Genet 56:777-787 Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press) Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage and linkage disequilibrium for one or more polymorphic marker loci" (In Review) Terwilliger, J.D. and J. Ott (1992) "A haplotype-based haplotype relative risk statistic" Human Heredity 42:337-346.