Changes:

2/14/96 - Added support for sexlinked as well as autosomal data
          Corrected minor bug in hrrlamb and hrrmult maximization routine
          (pointed out by Michael Knapp - cf Terwilliger(1996a) ).
          Improved formatting for output files.
          Executables made available with statically linked libraries (for those
	     without pascal compilers)
	  Improved stability
	  

Documentation for ANALYZE program.

This program set simplifies the computation of various statistics to test
for linkage and/or linkage disequilibrium between a disease locus and a
series of marker loci in a 2-point fashion.  It is necessary for the user 
to have his favorite version of MLINK, UNKNOWN, and LSP from the LINKAGE
package (Lathrop et al, 1984) or FASTLINK (Cottingham et al, 1993)
(installed on the machine for these programs to work.  This collection of
programs simplifies the setup and analysis of human pedigree
data, and outputs the results of linkage analysis, sibpair analysis, 
haplotype relative risk analysis, TDT analysis, and some other variations
on the above themes in a detailed form, and a summary table, including
some simple multilocus disequilibrium and linkage based statistics.

How to use the program:

INPUT FILES:

pedin.dat - LINKAGE format pedigree file (processed by MAKEPED) with the
            disease locus first, followed by a series of "allele numbers"
            marker loci in chromosomal order (if they are not in chromosomal
            order, they can be put in the right order using the "extract"
            program included with this package).

datain.dat - LINKAGE format parameter file (generated with PREPLINK) in either
             MLINK, ILINK, or LINKMAP format (LODSCORE format is not acceptable).
             Loci must be in chromosomal order (with the trait locus first) and
             must all be of the allele numbers locus type.  Correct intermarker
             recombination fractions should be correctly indicated on the next to
             last line of the file (this can all be done in PREPLINK).  Be sure
             to specify the presumed mode of inheritance in the parameter file
             for the parametric analyses to be conducted.

OUTPUT FILES:

analyze.final - Detailed output from each analytical method - discussed in detail below.

summary.out - Summarized table of all essential statistics for each locus in chromosomal
              order with map distances indicated in the first column - discussed in detail
              later.


OTHER PROGRAMS NEEDED:

MLINK, UNKNOWN, LSP - You must have your favorite version of MLINK, UNKNOWN, and LSP 
	installed on your computer somewhere in the path.  You can use either LINKAGE
	(Lathrop et al, 1984) or FASTLINK (Cottingham et al, 1993) in any version you
	have available.


INSTRUCTIONS:  From the directory with the pedin.dat and datain.dat files, just type

               analysis   

               to run the program in foreground, or type

               nohup analysis > analysis.log &

               to run the program in background - be sure all programs are in the path
               (i.e. in your bin directory, for example).


Interpretation of the Output Files:

analyze.final

This file contains the detailed statistical analysis of your disease and marker loci with 
the following sections:

FOR EACH MARKER LOCUS
	1) Lod Score Analysis with Heterogeneity
	2) TDT Analysis
	3) Haplotype Relative Risk Analysis
	4) Sibpair Analysis
	5) Lod Score Analysis on Nuclear Pedigrees With Heterogeneity

THEN, MULTILOCUS ANALYSES
	6) Polylocus Lod Score Analysis
	7) Multilocus Haplotype Relative Risk Analysis

summary.out

This file contains a tabular summary of the statistical analysis using each
of the above methods, but in a much condensed and simplified format for
quick examination and summary of the results of each test.

Interpretation of Sample Dataset


        There is a sample dataset in the 2-point/sample/autosomal directory created
when you uncompress the tar file containing the program code.  There is a pedigree 
file called pedin.dat and a parameter file called datain.dat.  To run the analysis
you can simply type analysis and the programs will run (assuming you have installed
them correctly and put them in the path on your machine).  Let us look at the main
output file, analyze.final, and see what each section means.

Just as in the final.out file you get when running LINKAGE through LCP (Lathrop et al, 1984),
the lsp.log file is attached to let you know which locus you are currently looking at.


********************************************************************************

				     MLINK


	Pedigree File              : pedin.tpd
	Parameter File             : datain.tdt
	Output Pedigree File       : pedfile.dat
	Output Parameter File      : datafile.dat
	Log File                   : lsp.log
	Stream File                : lsp.stm

	Date Run                   : 17-Oct-95  16:50:21

	Sex Difference             : 0
	Recomb. Fraction to Vary   : 1
	Increment Value            : 0.02000000
	Stop Value                 : 0.48000000

	Locus Order                : 1 2
	Male Recomb. Fractions     : 0.00000000

********************************************************************************

The only important information here is the locus order - this tells you that currently
you are examining locus 1 (the trait) against locus 2 in your input files.  If you are
using the most recent version of LSP, and have included locus names in the datain.dat
file, they will be found on this line as well.  The other information is irrelevant.

It is important to note that there is a preprocessing program run which renumbers all
pedigrees sequentially and all individuals within pedigrees sequentially as well.  Therefore,
if UNKNOWN should complain about an inconsistency, for example, the pedigree number given will
be the sequential pedigree number - from the renumbered file, pedin.tpd.  This has no effect
on the results, but if you are trying to debug your data, it is important to bear this in mind.
The reason this is done is to simplify the programming and memory requirements for the analysis
programs.

1) LOD SCORE ANALYSIS WITH HETEROGENEITY

Output of analysis with HOMOG

For significance testing use the following standard critical values:
  HOMOGENEITY:        Z(theta) > 3
HETEROGENEITY:  Z(alpha,theta) > 3.3 


  Under homogeneity - Maximum Lod Score =   0.909022
                                  Theta =   0.380


Under heterogeneity - Maximum lod score =   3.849152
                                  Theta =   0.100
   Proportion of Linked Families, Alpha =   0.350

Under Heterogeneity there is significant evidence of linkage.

3.3-lod-unit support interval for alpha and theta is as follows:
 Alpha:  ( 0.01, 1.00)
 Theta:  ( 0.00, 0.44)


This part of the file summarizes the essential linkage information obtained
from analysis using the parametric model defined in your datain.dat for the
trait locus.  The analysis is done using MLINK (Lathrop et al, 1984), letting the 
recombination fraction vary from 0 to 0.5 in steps of 0.02, so here we see that 
the maximum lod score occurs at recombination fraction 0.38, with said lod score 
being 0.91.  Then, a heterogeneity analysis is performed using the HOMOG program
(Ott, 1991), allowing alpha (the proportion of linked families in the dataset)
to vary between 0 and 1 in steps of 0.01.  In this case, we can see that the maximum
lod score has risen to 3.85 when allowed for the presence of some unlinked families
in our dataset.  At the top of this section it explains that the conventional criteria
have been used to interpret the significance of these lod scores - under homogeneity
a lod score of 3 (1000:1 likelihood ratio) is taken as significant evidence of linkage 
(Morton, 1955; Bailey, 1961; Chotai, 1984), while if this is not significant, it 
has been suggested (Ott, 1991) that a lod score of 3.3 (likelihood ratio of 2000:1) 
be used as a critical value to compensate for the added free parameter, alpha.  Note 
that this is by convention, and in reality the distribution of this latter statistic 
is very complicated (cf. Davies, 1977; Faraway et al, 1993).  Because the
critical limit for our test has been surpassed, I have included the support interval for
our estimated values of alpha and theta - here I used a 3.3 unit support interval to be
consistent with the test statistic critical value (cf. Terwilliger and Ott, 1994). 

Looking at the same section for the second marker locus (locus 3), we see the following:

Output of analysis with HOMOG

For significance testing use the following standard critical values:
  HOMOGENEITY:        Z(theta) > 3
HETEROGENEITY:  Z(alpha,theta) > 3.3 


  Under homogeneity - Maximum Lod Score =   5.582551
                                  Theta =   0.280

Under homogeneity there is significant evidence of linkage.


   Test for heterogeneity GIVEN Linkage 
           Chi-square for heterogeneity =  33.510799
                                  Theta =   0.080
                                  Alpha =   0.390
Significant evidence of heterogeneity at p < 0.0001 level! 

In this example, there was significant evidence of linkage under the assumption of
homogeneity - the lod score being 5.58 at recombination fraction 0.28.  In this situation,
there was no need to look at the lod score test for linkage and homogeneity jointly, since
we have already demonstrated that there is linkage.  In this situation, a test of heterogeneity
given linkage is performed with HOMOG (Smith, 1963; Ott, 1991), and the statistic presented
is 2ln(L(theta,alpha)/L(theta,alpha=1)) which is asymptotically distributed as a chi-square
statistic with 1 degree of freedom - the test is further one-sided, since the null hypothesis,
alpha = 1, is compared with the one-sided alternative alpha < 1.  In this case, there was significant
evidence of heterogeneity at the 0.0001 level (estimated theta = 0.08, and alpha = 0.39).

2) TDT Analysis


********************************************************
*****                                              *****
*****  You are using TDTLIKE - Alpha Test Version  *****
*****  for computing TDT-like likelihood ratio     *****
*****  statistics based on an algorithm of         *****
*****  J. Terwilliger (AJHG 56:777-787 (1995))     *****
*****                                              *****
********************************************************

     Locus   1   Alleles which appear at least 5 times shown.

                                           Multiple test corrected
    ORIG #      CASE     CONTROL          TDT          One-Sided P-Value
         1       209       129       18.9349117279        0.0000397356
         2        46        83       10.6124029160        1.0000000000
         3        76       129       13.7024393082        1.0000000000
         4        91        93        0.0217391308        0.9880542409
         5        87        75        0.8888888955        0.6593401134

Multiallelic Statistic - Based on Terwilliger (AJHG - March 1995)

Maximum Likelihood Estimate of TDT Lambda =    0.62000
-2ln(L) difference =        15.8929973831
P-Value =         0.0000338507

This is the summary of results of analysis with the TDT (Spielman et al, 1993).
This test looks at all affected offspring of heterozygous parents for a given
allele - it compares the frequency with which they transmit said allele to their
affected children with the frequency with which they transmit the other allele
to their affected children.  In other words, if you have a sample affected kids,
with a parent with genotype 1/?,  and there are X affecteds who received the 1 allele
from these heterozygous parents, and Y affecteds who got the other allele (which 
may be different in each case), the TDT test statistic is of the form (X-Y)^2/(X+Y),
which is asymptotically distributed as a chi-square statistic with one degree of freedom 
- again, this test is one-sided because we are only interested in cases where a specific
allele is transmitted more frequently than expected to the affected offspring(Ewens and 
Spielman, 1993).

Some thought might be given to what is actually being tested with the TDT.  In the
absence of linkage, it is random which allele segregates to any given affected
individual from a heterozygous 1/? parent, so under the hypothesis of no linkage,
clearly X = Y, and the test is valid, since all meiotic events are independent
in this case - even within a pedigree.  If you have a sample of unrelated affected
kids, and their heterozygous parents, in the absence of allelic association, it is
random which allele is "in phase" with the disease allele in these heterozygous 
parents, so under this hypothesis as well, all observations are independent (even
if there were linkage), and it is a valid test of the hypothesis of no allelic association.  
However, in extended pedigrees, if there is linkage, the different affected kids are not
independent relative to the hypothesis of no allelic association, and the test is not 
a valid one for this null hypothesis.  To see this, consider the situation in which you
have a set of sibpairs, and there is 0 recombination between marker and disease, and a
fully penetrant recessive disease - then all affected sibs would have the identical
marker alleles received from the heterozygous parents.  Then, effectively you are counting
the same parental alleles twice.  The effect would be the same as if you had a standard
case-control association test, and decided to double the number of counts of each allele
in case and control samples - clearly one would never conclude that this was a valid test
procedure, and this is the same effect as you have here when there is linkage, so to say
that a positive TDT in multiplex families is evidence of association is a misnomer - the
only null hypothesis you can reject validly with this test is that of no linkage, especially
in a small set of large pedigrees (Ott, 1989; Ewens and Spielman, 1995; Terwilliger, 1996b)

In the statistical analysis outlined above, I have not used the chi-square approximation to
compute the p-values, but rather, since sample sizes are sometimes a bit too small for the
chi-square approximation to hold, I have computed exact p-values from a binomial distribution,
looking at the probability of having the allele under study transmitted X or more times to
the affected offspring of heterozygous parents out of (X+Y) opportunities.  Then, I have 
corrected for the fact that multiple tests have been performed (if there are 5 alleles at a locus,
there are approximately 4 "independent" tests - therefore the p-value presented is the probability
of such an extreme result occurring for one of the n alleles at a given locus, rather than
assuming you had only tested the one single allele.  Only alleles which have had at least 5 opportunities
to be transmitted are included, to reduce in an unbiased way, the number of necessary tests one
should then correct for.  This can be made higher or lower by altering a program constant, min,
in the file tdtlikena.p and then recompiling the program. 

At the bottom, I have also performed a likelihood ratio based TDT test (Terwilliger, 1995;1996b) which 
considers all the alleles jointly and allows that one of them would be transmitted preferentially 
to affected offspring, and not the others. The test is parametrized such that the probability of 
transmitting said allele is equal to lambda, which is then estimated by maximum likelihood, where 
lambda is constrained to be > 0.5, lambda = 0.5 being the null hypothesis in this nonparametric 
linkage test.  2ln(L(lambda)/L(lambda=0.5)) is assumed to be asymptotically one-sided chi-square
with one df. 

In this case, there is significant evidence of linkage, with allele 1 being preferentially transmitted
to affected offspring of 1/? parents, with a p-value of 0.00004 in the standard multiple test corrected
TDT, and using my likelihood ratio test, the p-value is 0.00003, though this test does not provide any
determination of which allele is actually preferentially transmitted, but since this TDT is performed
using all affected individuals in multiplex pedigrees, no conclusions about allelic association can be 
made, so this information is really somewhat irrelevant in this test.

3) Haplotype Relative Risk Analysis

*************************************************************
*                                                           *
*       Program HRRLAMB - Version 2.1   (1/31/96)           *
*                                                           *
*       AJHG 56:777-787 (1995)                              *
*                                                           *
*************************************************************


Disease allele frequency = 0.01000000

=========================================
CASE      | 43. |  1. | 13. | 12. | 11. |
CONTROL   | 27. | 10. | 14. | 18. | 11. |
=========================================

Estimated parameters for likelihood ratio test:
 Allele frequencies: 
    Allele            H0:             H1
         1     0.43750000     0.34604554
         2     0.06875000     0.08098217
         3     0.16875000     0.19567004
         4     0.18750000     0.21743885
         5     0.13750000     0.15986340
    Lambda     0.00000000       0.291021

LRT Chi-Square =         4.40863 p-value =    0.017888760783279 Lambda =   0.291021
NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST
2 x n table Chi-square =   12.25782 P-value =    0.015533597133660
NO SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST


The next set of output is for the haplotype relative risk test for allelic association
(Rubinstein et al, 1981; Falk and Rubinstein, 1987; Terwilliger and Ott, 1992).
This test is very much related to the TDT with two salient differences - first, only one
affected individual is used per pedigree - so they are independent relative to the null
hypothesis of no association; second, all alleles transmitted from parents (heterozygous
or homozygous) to the affected child are included in the case sample, and all remaining
(i.e. not transmitted) alleles in the parents of the affected children are included in 
the control sample (cf. Falk and Rubinstein, 1987; Ott, 1989; Terwilliger and Ott, 1992).  
In this case, the affected child selected in any given pedigree is chosen as the first affected
individual who is himself typed at the marker locus and has both parents typed at the 
marker as well.  If no affected child in a pedigree meets these criteria, then the program
searches for the first child who is himself typed and has one parent typed at the marker
locus - in this case there would be two alleles in the case sample ( the two alleles of the
affected child) and one allele in the control sample (the not transmitted allele from the
genotyped parent).  If no affected child meets this criterion in a pedigree, then the first
affected child who is typed himself is taken as the case sample, and no matching control
is allowed for - the effect of this sampling procedure is that the case sample will be, in
general, larger than the control sample, but it avoids the necessity of throwing away 
information about linkage disequilibrium from independent unrelated affecteds, just because
their parents are not available for testing and thus no controls are available.

The number of times each allele appears in the case and control samples is summarized in a
brief 2 x n table, and two tests are performed of the null hypothesis of no association.
First, a likelihood ratio test is performed  (Terwilliger, 1995) which implies a linkage
disequilibrium model with one associated allele only, where that allele has frequency
p + lambda(1-p) in the case sample, where p is the population frequency of that allele,
and lambda is a measure of the strength of association.  Under the null hypothesis,
lambda = 0, and the probability of this allele is the same in case and control samples.
If lambda = 1, there is complete association, and the frequency of this allele = 1 in the case
sample.  Lambda is estimated by maximum likelihood with this program, and the p-value is indicated
under the overly conservative assumption that the statistic follows a one-sided chi-square
distribution with 1 df.  The second test performed is that standard 2 x n table chi-square test,
which has n-1 df (where there are n alleles whose frequencies are compared in case and control
samples).  This test is less powerful by far when there is only one associated allele (i,e,
linkage disequilibrium from a founder effect), but can be more sensitive when there are higher
order associations with different alleles.  

The analysis package was written with the assumption that the analyses are part of a genome 
screening effort to identify new disease loci - for that reason when the p-values are larger than
0.0001 the program says the result is not significant (association tests must be conducted at 
AT LEAST the same stringency as linkage tests when the goal is to prove the existence of a new
disease locus - the reason for this is that under the null hypothesis of no association, there is
no correlation between the results of association tests at even tightly linked markers, while
for linkage they are strongly correlated, so multiple testing is even a more important factor
when association tests are being performed).  If you have proven through linkage that a locus
is linked to a marker, and you want to test for association given linkage, you may feel free
to reduce the stringency of the test as you see fit - remember that p-values are similar to
the odds at the racetrack - they give you an idea of how to gamble your resources - if you are
satisfied with a p-value of 0.05 to invest your resources in molecular search for a disease gene,
feel free to go ahead - but realize there is quite a large risk of false positive results - especially
in association tests (Hill and Weir, 1995; Terwilliger, 1996b).

Note that for the first marker locus there was a strongly positive TDT result (rejecting the hypothesis
of no linkage) but no significant evidence of allelic association - this is not inconsistent, remember,
since the TDT in multiplex pedigrees is not, itself, a test of association (Ewens and Spielman, 1995;
Terwilliger, 1996b)!

Let us look briefly at the second marker locus output as well:


********************************************************
*****                                              *****
*****  You are using TDTLIKE - Alpha Test Version  *****
*****  for computing TDT-like likelihood ratio     *****
*****  statistics based on an algorithm of         *****
*****  J. Terwilliger (AJHG 56:777-787 (1995))     *****
*****                                              *****
********************************************************

     Locus   1   Alleles which appear at least 5 times shown.

                                           Multiple test corrected
    ORIG #      CASE     CONTROL          TDT          One-Sided P-Value
         1       271        71      116.9590606689        0.0000000000
         2        81       169       30.9759998322        1.0000000000
         3       172       284       27.5087718964        1.0000000000

Multiallelic Statistic - Based on Terwilliger (AJHG - March 1995)

Maximum Likelihood Estimate of TDT Lambda =    0.79000
-2ln(L) difference =       122.5419672207
P-Value =         0.0000000000


*************************************************************
*                                                           *
*       Program HRRLAMB - Version 2.1   (1/31/96)           *
*                                                           *
*       AJHG 56:777-787 (1995)                              *
*                                                           *
*************************************************************


Disease allele frequency = 0.01000000

=============================
CASE      | 39. |  9. | 32. |
CONTROL   |  1. | 23. | 56. |
=============================

Estimated parameters for likelihood ratio test:
 Allele frequencies: 
    Allele            H0:             H1
         1     0.25000000     0.02719901
         2     0.20000000     0.25945108
         3     0.55000000     0.71334991
    Lambda     0.00000000       0.474385

LRT Chi-Square =        50.69729 p-value =    0.000000000000561 Lambda =   0.474385
SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY LRT TEST
2 x n table Chi-square =   48.77045 P-value =    0.000000000025682
SIGNIFICANT EVIDENCE OF LINKAGE DISEQUILIBRIUM BY 2 x N TABLE CHI-SQUARE TEST


In this example, for the second marker locus, there was a strongly positive TDT result,
and also a strongly positive HRR result - in fact, notice that allele 1 appears to
be preferentially transmitted in the TDT test, and appears to be associated in the
HRR test.  When the HRR is significant (i.e. there is strong evidence of linkage disequilibrium)
the TDT test should normally be significant as well, since it is difficult to imagine a model in
which you would have allelic association, but not also linkage.  The inverse is not true,
though, as illustrated with the example of the first marker.  Here, you also see the expected
pattern that the LRT test has a pvalue that is 2 orders of magnitude smaller than the more 
general 2 x 3 table chi-square.  This effect is much stronger as the number of alleles at the
marker locus increases, when there is linkage disequilibrium caused by founder effect.

4) SIBPAIR ANALYSIS

*********************************************************************
*    Program SIBPAIR - Sibpair analysis on Nuclear Families         *
*********************************************************************

                                                             P-values
              0      1      2   |  SHARED    NOT  |      0 vs 2     Mean test
                                |                 |
   NA/NA      6.0   15.0    7.0 |    34.0   34.0  |   0.390755713  1.000000000
   NA/A      20.7   38.8   17.2 |   102.3  116.0  |   0.284681141  0.177501172
    A/A      17.3   53.0   28.3 |   158.7  116.7  |   0.051785618  0.005694087

Overall Chisquare For 0-2 Sharer Comparisons =    2.97342  p-value =  0.1130575
Overall Chisquare For Mean Test  Comparisons =    7.26225  p-value =  0.0132432

Affected Sib-Pair Mean Test P-value =   0.005694087


Next, the program performs a sibpair analysis (Penrose, 1935) on all sibpairs in the dataset.  When
there are multiple sibs in a sibship this program weights the sibpairs according to the
sibship size in the following manner - 2 sibs = 1 sibpair, 3 sibs = 2 sibpairs, 4 sibs = 3 sibpairs.
This weighting has been selected to conform with the information content of phase unknown nuclear 
pedigrees in linkage analysis.  A detailed discussion of this can be found in  (Blackwelder and 
Elston, 1985; Suarez and van Eerdewegh, 1984).  The way this is done is to compute effectively 
the average number of alleles shared and not shared between all sibs and a given proband - 
where the average is taken over all affected sibs who could be considered the proband.  
Further, I allow for the possibility to look at unaffected sibs as well, as has been recommended 
by Eaves and Meyer (1994) and Risch and Zhang(1995) independently.  To extend the same weighting 
logic to these unaffecteds as well, I consider that a sibship of size n has (n-1) effectively
independent sibships, as outlined above - any of the affected sibs could be taken as the proband,
for example.  Then there are m-1 affected sibpairs, if there are m affected sibs.  If there are k
unaffected sibs, and any affected sib could be our choice of proband, there are k discordant NA/A
sibpairs possible for comparison with any given proband.  Then, I estimate the number of shared alleles
by taking the average of the results from comparing the unaffecteds with each of the affecteds as proband
in turn.  Thus a sibship with m affecteds and k unaffecteds would contribute (m-1) affected sibpairs and
k discordant affected/not affected sibpairs to the linkage test.  The only time there will be NA/NA sibpairs
used in the analysis is when there are only unaffected sibs in a given sibship - the rationale for this is
that if we only have n-1 independent sibpairs in a sibship of size n, it is wise to apportion them in the most
powerful manner - since NA/NA sibpairs would be the least informative for most complex diseases (i.e. they are
unaffected possibly due to low penetrance, or lack of environmental exposure, etc.) the most powerful partition
of the available data is to use only A/A and A/NA pairs when possible, since they give the most bang for the buck.
Eaves and Meyer(1994) originally recommended looking at "very unaffected" individuals, say from the lower tail of 
some continuous distribution for the trait as the unaffected sample, and the extreme upper tail as the 
affected sample, with all other sibs being labelled as unknown at the trait locus - unknown sibs being 
discarded from the analysis.  

When parents are not genotyped at the marker locus, the program computes the likelihood of each possible 
genotype for the untyped parents, and computes the number of alleles shared identical by descent 
(IBD) in a sibpair as the average over all possible parental genotype combinations, weighted by their 
conditional probabilities given the known data.  When a parent is homozygous, there is no information about 
linkage, so those data are left out, since we cannot tell whether the sibpair does or does not share an 
allele IBD from that parent.  For this reason, one can see there is more data available when looking at 
number of alleles shared vs not shared, when compared with the number of sibpairs sharing 0, 1, or 2 
alleles - if one parent is homozygous, you cannot tell how many alleles that sibpair shares IBD, and 
assuming that there is 50% chance of sharing the allele IBD and 50% chance of the allele being not IBD
tends to dilute the power of the linkage test, and biases strongly the estimate of proportion of alleles 
shared IBD toward the null hypothesis value of 50%.

The output in the table above gives the results of a number of plausible statistical tests - comparing first,
the number of sibpairs sharing 2 alleles IBD [X] with the number sharing 0 IBD  [Y] by the statistic 
[X-Y]^2/[X+Y] which is distributed as one-sided chi-square with one df.  For concordant pairs, X > Y under
the alternative hypothesis, and for discordant pairs X < Y under the alternative.  Underneath, I have included
a "overall chi-square" statistic for this test as well, which is effectively the sum of the chi-square statistics
for A/A sibpairs and A/NA sibpairs (NA/NA sibpairs are left out since they are present in this analysis only when
there are sibships where everyone is unaffected), and is assumed to follow a chi-square distribution with 2df.

The other test, the mean test compares the number of alleles IBD among all sibpairs with the number of alleles not
shared IBD among all sibpairs.  Since homozygous parents provide no information, and with a marker of 70%
heterozygosity, 42% of all nuclear pedigrees will have only one heterozygous parent and one homozygous parent,
the amount of data for which we can identify whether at least one allele was shared IBD by a sibpair is much larger
than the proportion of the dataset where we can determine how many alleles were shared IBD from both parents.
For this reason, the mean test is generally more powerful in this implementation than the comparison described 
above - the statistics are of the same form, only now X is the number of alleles shared IBD, and Y is the number
of alleles not shared IBD in the equations presented above.  The last line of the section reiterates the most
important piece of statistical information in general, the p-value for the chi-square test comparing
the number of alleles shared and not shared among all affected sib-pairs in a dataset.  In complex diseases where
the "unaffecteds" have not been selected as the lower tail of some continuous distribution - (for example, they are 
unaffected even given presence of lots of risk factors, and have a phenotype that is more normal than normal - i.e. 
low blood pressure as unaffected if hypertension were affected) - there is little power in comparisons other than 
looking for increased sharing among affected sibs.

In this example, the first marker locus showed no significant evidence of linkage - of course, in sibpair
analysis you need to have the same stringency as in linkage analysis, p < 0.0001 to call a test result positive.
The second marker locus showed much more significant evidence of linkage in the sibpair test, with a p-value of 
< 0.000000001 as shown below:

*********************************************************************
*    Program SIBPAIR - Sibpair analysis on Nuclear Families         *
*********************************************************************

                                                             P-values
              0      1      2   |  SHARED    NOT  |      0 vs 2     Mean test
                                |                 |
   NA/NA      2.0    8.0    4.0 |    24.0   25.0  |   0.207109630  0.943191707
   NA/A      22.2   39.0   12.8 |   103.0  128.0  |   0.057323437  0.049996715
    A/A       6.7   43.3   43.0 |   201.7   89.3  |   0.000000129  0.000000000

Overall Chisquare For 0-2 Sharer Comparisons =   29.06830  p-value =  0.0000002
Overall Chisquare For Mean Test  Comparisons =   46.06913  p-value =  0.0000000

Affected Sib-Pair Mean Test P-value =   0.000000000


5) LOD SCORE ANALYSIS ON NUCLEAR PEDIGREES WITH HETEROGENEITY


Output of analysis with HOMOG

For significance testing use the following standard critical values:
  HOMOGENEITY:        Z(theta) > 3
HETEROGENEITY:  Z(alpha,theta) > 3.3 


  Under homogeneity - Maximum Lod Score =   3.375684
                                  Theta =   0.220

Under homogeneity there is significant evidence of linkage.


   Test for heterogeneity GIVEN Linkage 
           Chi-square for heterogeneity =   0.011600
                                  Theta =   0.220
                                  Alpha =   0.970
No significant evidence of heterogeneity. 

In the case where you presume multiple genes are responsibile for a given disease phenotype,
there may be intrafamilial heterogeneity as well as interfamilial heterogeneity, and thus 
the recombination fraction may appear to be different in different parts of a pedigree, as
different disease loci may be segregating in different sibships - to control for this, we
eliminate all extended pedigree phase information, and break large multiplex pedigrees into
all component nuclear pedigrees in our subsequent linkage analysis - which assumes the same
mode of inheritance specified in the datain.dat file.  Interpretation of this output is identical
to the standard lod score analysis - note that even under hmogeneity there is strong evidence
of linkage in this analysis (while there was NO evidence of linkage when the extended pedigrees 
were analyzed above).  This effect is due to elimination of potentially erroneous phase information
which relies heavily on the assumed mode of inheritance for the disease locus in extended pedigrees.
This analytical approach is intuitively related to affected sibpair strategies (Terwilliger and Ott, 
1994; Ginns et al, 1995).

6) POLYLOCUS LOD SCORE ANALYSIS

          Locus        Map Position
              2            0.000000
              3            0.005025

OUTPUT FROM PROGRAM POLYLOCUS 
 Analysis with Polylocus method of Terwilliger and Ott 
 Genetic Epidemiology 10(6):477-82 (1993) 


Primary Locus is now Locus Number   2: 


Polylocus Maximum Lod Score =   1.243432
                      Theta =   0.380000

Polylocus Maximum Lod Score with Heterogeneity =   5.639449
                                         Theta =   0.060000
                                         Alpha =      0.300


MULTIPOINT Polylocus Maximum Lod Score =   0.601381
                          Map Position =  -0.458145

MULTIPOINT Polylocus Maximum Lod Score with Heterogeneity =   5.621749
                                             Map Position =   0.057705
                                                    Alpha =   0.290000


This is the portion of the polylocus analysis output for which marker locus 2 was
the primary locus to be analyzed.  Polylocus linkage analysis is simply a standard 
parametric linkage analysis in which for pedigrees in which a given marker is untyped
or uninformative, the next nearest marker is used in the analysis instead, to recover
some of the lost linkage information without the known biases endemic to multipoint
linkage analysis, or the computational demands true multipoint analysis imposes (Baron et al, 1987;
Terwilliger and Ott, 1993).  The heterogeneity test is performed on the lod scores as if
the substitute marker(s) are at the same point as the primary marker locus of interest,
using the same algorithm as HOMOG (Smith, 1963; Ott, 1991) described above.   The Multipoint
Polylocus Lod score method is based on the same principal except that the marker map is
presumed known and some correction is made for it, following a recommendation of Schaid 
and Elston(1994), with some slight simplifications - here since one locus is used
per pedigree and I assume the marker map is fixed and the Haldane mapping function applies,
The lod scores are determined at certain "map positions" from each marker and then at each map
position, the lod scores are added together to create the mathematical equivalent of 
standard multipoint lod scores when only one marker were used in any given pedigree. 
Interpretation is as for normal lod scores, with the caveat that for the regular polylocus
lod scores, the "theta" does not have much meaning, since the markers used are at different 
points in the genome, perhaps as much as 10% recombination fraction away.  Adding more meioses,
up to this point still adds power to the analysis, but when you are more than 10cM away, it
becomes counterproductive to add more markers, as if they were at the same point - in those
situations, multipoint polylocus lod scores may be substantially more powerful.  

7) Multilocus Haplotype Relative Risk Analysis


*************************************************************
*       Program HRRMULT - Version 2.1   (2/5/96)            *
*       AJHG 56:777-787 (1995)                              *
*       Multilocus Haplotype Relative Risk Analysis         *
*************************************************************

     Alpha     N   Inter.  Position   Map Position      Lod Score       -2ln(LR)
  1.000000    485.      0         0       -4.25866        0.00000        0.00000
  1.000000     20.      0         1       -0.10034        4.41145       20.31549
  1.000000     20.      0         2       -0.08935        5.51186       25.38307
  1.000000     20.      0         3       -0.07859        6.76540       31.15581
  1.000000     20.      0         4       -0.06807        8.10525       37.32604
  1.000000     20.      0         5       -0.05776        9.34492       43.03493
  1.000000     20.      0         6       -0.04766       10.15622       46.77110
  0.903380     20.      0         7       -0.03775       10.26342       47.26480
  0.749131     20.      0         8       -0.02804       10.25928       47.24572
  0.622254     20.      0         9       -0.01852       10.25520       47.22692
  0.517765     20.      0        10       -0.00917       10.25120       47.20852
  0.431452     20.      1         0        0.00000       10.24726       47.19038
  0.431175     20.      1         1        0.00100       10.41172       47.94774
  0.430277     20.      1         2        0.00200       10.56382       48.64819
  0.999986    396.      1         3        0.00301       11.47395       52.83950
  0.604671    239.      1         4        0.00402       11.61899       53.50741
  0.474436    142.      1         5        0.00503       11.61904       53.50765
  0.474436    142.      2         0        0.00503       11.61904       53.50765
  0.999994     87.      2         1        0.01420       11.50964       53.00385
  0.999997     46.      2         2        0.02355       11.23010       51.71653
  0.999996     31.      2         3        0.03307       11.07330       50.99442
  0.994983     23.      2         4        0.04278       10.97677       50.54991
  1.000000     20.      2         5        0.05268       10.85818       50.00378
  1.000000     20.      2         6        0.06278       10.08478       46.44213
  1.000000     20.      2         7        0.07309        8.81444       40.59199
  1.000000     20.      2         8        0.08362        7.38083       33.99000
  1.000000     20.      2         9        0.09437        6.01230       27.68767
  1.000000     20.      2        10        0.10536        4.80767       22.14015


The final analytical method employed by analysis is multilocus Haplotype Relative
Risk analysis (Rubinstein et al, 1981; Falk and Rubinstein, 1987; Terwilliger and Ott, 1992),
following an extension of the likelihood ratio test algorithm I used
above (Terwilliger, 1995).  In this method the transmitted and not transmitted 
alleles at each locus are analyzed jointly in a multilocus test for linkage disequilibrium
assuming one allele per locus to be associated with the disease.  The statistic presented
is of the form of a likelihood ratio test based on some assumptions about the rate of decay
of linkage disequilibrium as a function of map distance and age of the disease allele in the
study population (cf. Terwilliger, 1995).�The last column is the most relevant statistic,
which is very conservatively interpretable as following a chi-square distribution with 1df
in a one-sided manner.  The previous column "Lod Score" is the same statistic on a common
logarithmic scale, which may be more intuitive for those coming from a genetics background.
This method has been used in a number of studies to find the most likely fine map location
of disease genes based on linkage disequilibrium data (cf. Haberhausen et al, 1995; Nikali et al, 
1995; Raha-Chowdhury et al, 1995).  In this case there is strong evidence for linkage
disequilibrium when both loci are considered jointly (p = 0.00000000000013), and the most likely
location of the disease gene is near to the second locus (as was also pretty clear from
the linkage results).


OVERALL CONDENSED OUTPUT FILE summary.out

Summary of statistical results from ANALYZE

Autosomal Data:

  Loc     x(cM)    Z(t) Theta    HetChi  Z(a,t) Theta Alpha     Sibpair      TDT  HRR-LRT  HRR-2xn
    2   0.00000   0.909  0.38             3.849  0.10  0.35    0.005694 0.000034 0.017889 0.015534
    3   0.00502   5.583  0.28    33.511          0.08  0.39    0.000000 0.000000 0.000000 0.000000

  Loc     x(cM)   N-Z(t) N-HetChi N-Z(a,t)   P-Z(t) P-HetChi P-Z(a,t)  MP-Z(t) MP-HetChi MP-Z(a,t)
    2   0.00000    3.376    0.012             1.243             5.639    0.601               5.622
    3   0.00502    8.841    0.337             5.583   33.511             5.505    33.607

Maximum Multilocus HRR =   53.50765 ~ Chi-Square(1) - One-sided


Loc  = Locus Number
x(cM) = Map Distance from the first marker (Locus 2) in Haldane cM
Z(t) = Maximum Lod Score
HetChi = Chi-Square for Heterogeneity when Z(t) >= 3
Z(a,t) = Maximum Lod Score with Heterogeneity - when Z(t) < 3
Sibpair = p-value for Affected Sib Pair Mean Test
TDT = p-value for Likelihood-Based TDT
HRR-LRT = p-value for Likelihood-Based Haplotype Relative Risk
HRR-2xn = p-value for 2 x n Contingency Table Chi-Square HRR test
N-Z(t) = Maximum Lod Scores When Extended Pedigrees are Broken into Nuclear Pedigrees
N-HetChi = Chi-square for Heterogeneity (Nuclear Pedigrees)
N-Z(a,t) = Maximum Lod Score with Heterogeneity (Nuclear Pedigrees)
P-Z(t) = Polylocus Maximum Lod Score
P-HetChi = Chi-square for Heterogeneity (Polylocus)
P-Z(a,t) = Maximum Polylocus Lod Score with Heterogeneity
MP-Z(t) = Maximum Multipoint Polylocus Lod Score
MP-HetChi = Chisquare for Heterogeneity (Multipoint Polylocus)
MP-Z(a,t) = Maximum Multipoint Polylocus Lod Score with Heterogeneity


The above summary file presents the salient statistical information for all marker loci
from all of the aforementioned statistical tests.  The first column gives the map position
of each marker locus (with the first marker fixed at map position 0).  Then the lod score,
estimated recombination fraction, heterogeneity test statistic, estimated values of theta
and alpha from the heterogeneity test - then the p-values from the nonparametric tests,
sibpair (affected sib pair mean test), TDT (likelihood statistic), HRR (likelihood statistic,
and 2 x n table chi-square statistic).  The following table contains the remaining statistics,
linkage on nuclear pedigrees + heterogeneity; polylocus linkage + heterogeneity; multipoint
polylocus linkage _ heterogeneity.  In each linkage analysis situation, when there was a
lod score under homogeneity greater than 3, the chi-square statistic for heterogeneity given
linkage is presented, and when there is not significant evidence of linkage the lod score
with both theta and alpha as free parameters is given instead of the heterogeneity test since
linkage has not been established yet.

NONPARAMETRIC ONLY OPTION

There is an alternative to the full analysis package which only performs the nonparametric analyses
(and is therefore much faster than the full linkage analysis).  To use this, follow the same 
instructions as described above, but instead of running the analysis program, use nonparametric instead.

To run it in foreground - type

nonparametric

To run it in background - type

nohup nonparametric > nonparametric.log &

The output files are similar to those described above - and are called
nonparam.final - equivalent to analyze.final
nonparam.summ  - equivalent to summary.out from analysis.

All output in the nonparam.final is as described for analysis above, and the example
output file nonparam.summ is shown below for this same dataset, which should be self-
explanatory since the same data are shown in summary.out above:

Summary of statistical results from NONPARAM

Autosomal data: 

                                    P-VALUES
                   --------------------------------------------
  Loc     x(cM)     Sibpair         TDT     HRR-LRT     HRR-2xn
    2   0.00000    0.005694    0.000034    0.017889    0.015534
    3   0.00503    0.000000    0.000000    0.000000    0.000000


Maximum Multilocus HRR =   53.50765 ~ Chi-Square(1) - One-sided


Loc  = Locus Number
x(cM) = Map Distance from the first marker (Locus 2) in Haldane cM
Sibpair = p-value for Affected Sib Pair Mean Test
TDT = p-value for Likelihood-Based TDT
HRR-LRT = p-value for Likelihood-Based Haplotype Relative Risk
HRR-2xn = p-value for 2 x n Contingency Table Chi-Square HRR test

USEFUL ACCESSORY PROGRAMS:

DOWNFREQ - Estimates allele frequencies for marker loci and allows the user to downcode
the data - especially useful is the option to eliminate alleles which never occur in your
complete dataset to streamline the calculations

EXTRACT - reorders the loci in your pedigree and parameter files to put them in chromosome
order - this is very important if you plan to use the polylocus or multipoint HRR results,
and is used for determination of map positions in the summary.out and nonparam.summ files.

(they are described briefly in accompanying documents)

REFERENCES:

Linkage Analysis with Heterogeneity:

Bailey, N.T.J. (1961) "Introduction to the mathematical theory of genetic linkage."
	Oxford:Clarendon Press.

Chotai, J. (1984) "On the lod score method in linkage analysis." Ann Human Genet 48:359-378.

Cottingham, R.W., R.M. Idury, A.A. Schaffer (1993) "Faster sequential genetic linkage
	computations." Am J Hum Genet 53:252-263.

Davies, R.B. (1977) "Hypothesis testing when a nuisance parameter is present only
	under the alternative." Biometrika 64:247-254.

Faraway, J.J. (1993) "Distribution of the admixture test for the detection of linkage
	under heterogeneity." Genetic Epidemiology 10:75-83.

Lathrop, G.M., J.M. Lalouel, C. Julier et. al. (1984) "Strategies for multilocus 
     	linkage analysis in humans." PNAS 81:3443-3446.

Morton, N.E. (1955) "Sequential tests for the detection of linkage." Am J Hum Genet
	7:277-318.

Ott, J.  (1991)  "Analysis of Human Genetic Linkage", 2nd edition. Baltimore, Johns 
     	Hopkins University Press.

Smith, C.A.B. (1953) "The detection of linkage in human genetics." J R Statist Soc 15B:153-184.

Smith, C.A.B. (1963) "Testing for heterogeneity of recombination fraction values in human
	genetics." Ann Hum Genet 27:175-182.

Terwilliger, J.D., and J. Ott (1994) "Handbook of Human Genetic Linkage" Johns Hopkins
	University Press.

TDT Analysis:

Ewens, W.J., and R.S. Spielman (1995) "The transmission/disequilibrium test: history,
	subdivision, and admixture." Am J Hum Genet 57:455-464.

Ott, J. (1989) "Statistical properties of the haplotype relative risk." Genetic Epidemiology 6:127-130.

Spielman, R.S., R.E. McGinnis, and W.J. Ewens (1993) "Transmission test for linkage
     	disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus
     	(IDDM).  Am J Human Genet 52:506-516

Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage 
     	disequilibrium between trait loci and one or more polymorphic marker loci" Am J 
     	Human Genet 56:777-787

Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press)

Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage
     	and linkage disequilibrium for one or more polymorphic marker loci" (In Review)

HRR Analysis:

Falk, C.T., and P. Rubinstein (1987) "Haplotype Relative Risks: an easy reliable way to
	construct a proper control sample for risk calculations." Ann Hum Genet 51:227-233.

Hill, W.G., and B.S. Weir (1994) "Maximum-likelihood estimation of gene location by
	linkage disequilibrium." Am J Hum Genet 54:705-714.

Ott, J. (1989) "Statistical properties of the haplotype relative risk." Genetic Epidemiology 6:127-130.

Rubinstein, P., M. Walker, C. Carpenter, et al. (1981) "Genetics of HLA disease associations:
	The use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates
	in juvenile diabetes from three racial groups." Hum Immunol 3:384.

Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage 
     	disequilibrium between trait loci and one or more polymorphic marker loci" Am J 
     	Human Genet 56:777-787

Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press)

Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage
     	and linkage disequilibrium for one or more polymorphic marker loci" (In Review)

Terwilliger, J.D. and J. Ott (1992) "A haplotype-based haplotype relative risk 
     	statistic" Human Heredity 42:337-346.

Sibpair Analysis:

Blackwelder, W.C. and R.C. Elston (1985) "A comparison of sib-pair linkage tests 
     	for disease susceptibility loci". Genetic Epidemiology 2:85-97

Eaves L., and J. Meyer (1994) "Locating human quantitative trait loci: guidelines
	for the selection of sibling pairs for genotyping." Behavior Genetics 24:443-455

Penrose, L.S. (1935) "The detection of autosomal linkage in data which consist of
	pairs of brothers and sisters of unspecified parentage." Ann Eugenics 6:133-38

Risch, N., and H. Zhang (1995) "Extreme discordant sib-pairs for mapping quantitative trait
	loci in humans." Science 268:1584-1589.

Suarez, B.K., and P. Van Eerdewegh (1984) "A comparison of three affected sib-pair scoring
	methods to detect HLA-linked disease susceptibility genes." Am J Med Genet 18:135-146.

Terwilliger, J.D.; W.D. Shannon; G.M. Lathrop; J.P. Nolan; L.R. Goldin; G.A. Chase; D.E. Weeks.
   	(1996) "True and False Positive Peaks in Genome-wide Scans: Applications of Length-Biased
	Sampling to Linkage Mapping." (In Review)

Linkage Analysis on Nuclear Pedigrees:

Terwilliger, J.D. and J. Ott (1994) "Handbook of Human Genetic Linkage" Baltimore 
     	and London, Johns Hopkins University Press.

Polylocus Analysis:

Baron, M., N. Risch, R. Hamburger, B. Mandel, S. Kushner, M. Newman, D. Drumer, R.H. Belmaker
	(1987) "Genetic linkage between X-chromosomal markers and bipolar affective illness."
  	Nature 326:289-292

Pekkarinen, P., J.D. Terwilliger, P.-E. Bredbacka, J. Lonnqvist, L. Peltonen (1995)
	"Evidence of a susceptibility locus for bipolar disorder on Xq24-q27.1 in a
	Finnish pedigree." Genome Research (In Press).

Schaid D.J., and R.C. Elston (1994) "Combining two-point genetic linkage analyses
	using mapping functions." Genetic Epidemiology 11:1-17.

Terwilliger, J.D. and J. Ott (1993) "A novel polylocus method for linkage analysis 
     	using the lod score or affected sib-pair methods"  Genetic Epidemiology 
     	10(6):477-82.

Multilocus HRR:

Falk, C.T., and P. Rubinstein (1987) "Haplotype Relative Risks: an easy reliable way to
	construct a proper control sample for risk calculations." Ann Hum Genet 51:227-233.

Haberhausen, G., I. Schmitt, A. Kohler, U. Paters, S. Rider, J. Chelly, J.D. Terwilliger, 
	A.P. Monaco, U. Muller (1995) "Assignment of the dystonia-parkinsonism syndrome locus, 
	DYT3, to a small region within a 1.8-Mb YAC contig of Xq13.1" Am J Human Genet 57:644-650.

Nikali, K. A. Suomalainen, J.D. Terwilliger, T. Koskinen, J Weissenbach, L Peltonen (1995) 
	"A new locus for a hereditary ataxia identified by genotyping of four affected 
    	individuals." Am J Human Genet ??:???-???

Raha-Chowdhury, R., D.J. Bowen, C. Stone, J.J. Pointon, J.D. Terwilliger, J.D. Shearman,
	K.J.H. Robson, A. Bomford, M. Worwood (1995) "New polymorphic microsatellite markers
	place the hemochromatosis gene telomeric to D6S105." Human Molecular Genetics (In Press).

Rubinstein, P., M. Walker, C. Carpenter, et al. (1981) "Genetics of HLA disease associations:
	The use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates
	in juvenile diabetes from three racial groups." Hum Immunol 3:384.

Terwilliger, J.D. (1995) "A powerful likelihood method for the analysis of linkage 
     	disequilibrium between trait loci and one or more polymorphic marker loci" Am J 
     	Human Genet 56:777-787

Terwilliger, J.D. (1996a) "Response to Sham et al." Am J Hum Genet (In Press)

Terwilliger, J.D. (1996b) "Pedigree-based likelihood methods for analysis of linkage
     	and linkage disequilibrium for one or more polymorphic marker loci" (In Review)

Terwilliger, J.D. and J. Ott (1992) "A haplotype-based haplotype relative risk 
     	statistic" Human Heredity 42:337-346.