Documentation for DOWNFREQ program

DOWNFREQ is a very simple program which can be used to get
rough estimates of marker allele frequencies, and to downcode
the number of alleles at a marker locus based on their frequencies
in the pedigree data.  

INPUT FILES:

inped.dat - LINKAGE format pedigree file with disease followed by
	a series of allele numbers marker loci.

indata.dat - LINKAGE format pedigree file with disease followed by
	a series of allele numbers marker loci.

OUTPUT FILES:

Outfile.txt - summary of allele frequency estimates and post-downcoding
	allele frequency estimates, where relevant.

datain.dat - LINKAGE format parameter file with disease followed by
	a series of allele numbers marker loci with estimated allele
	frequencies, or allele frequencies corresponding to the 
	downcoded marker alleles after running DOWNFREQ.

pedin.dat (if downcoding option is used) - LINKAGE format pedigree
	file with downcoded marker genotypes if the downcoding option
	was used.  Otherwise, this file is not created if the program
	was used strictly for allele frequency estimation, as nothing
	would have changed from inped.dat.

ALLELE FREQUENCY ESTIMATION

When you run the DOWNFREQ program, you will see a screen like the following:
	

 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
 $$$                                                           $$$
 $$$          Program  DOWNFREQ - Version 1.1                  $$$
 $$$                                                           $$$
 $$$             Joseph D. Terwilliger   9/21/95               $$$
 $$$                                                           $$$
 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

This program estimates allele frequencies from pedigree data by
   1)   Allele frequencies in Founder Individuals, where typed 
   2)   All Individuals in all pedigrees - needed to see if alleles never occur 

Downcoding is also done in a rudimentary fashion by one of two simple methods: 
   1)   Simply eliminating alleles which never occur in this pedigree material 
   2)   Manual user-driven collapsing of alleles into allele classes 

Do you wish to downcode the data (Y/n)  ?

If you answer this question no, (for the sample data in the directory
2-point/sample/autosomal) you will get screen output like the following:

 Reading indata.dat
 Reading inped.dat


Frequency estimates for locus 2
                          Allele Frequency Estimates
                      ---------------------------------
    Allele  Total      Founders        All Individuals 
       1     662       0.310417            0.394048
       2     198       0.172917            0.117857
       3     287       0.189583            0.170833
       4     299       0.183333            0.177976
       5     234       0.143750            0.139286

Heterozygosity =   0.750576  Polymorphism Information Content =   0.714898


Frequency estimates for locus 3
                          Allele Frequency Estimates
                      ---------------------------------
    Allele  Total      Founders        All Individuals 
       1     393       0.085417            0.233929
       2     365       0.302083            0.217262
       3     922       0.612500            0.548810

Heterozygosity =   0.596883  Polymorphism Information Content =   0.530319
Which allele frequency estimates should be used? 
 1 = founders;  2 = all individuals; 
 3 = frequencies given in original indata.dat 

In this output, for each locus, it is indicated how many times each allele
occurred in the pedigree file (over all individuals), and then the allele
frequency estimates are given from two different methods - first they are
estimated from all typed founder individuals (if all your founders are typed,
this will give the maximum likelihood estimates of allele frequency exactly),
and then they are estimated from all individuals in all pedigrees - including
multiple sibs, and parent/child combinations, which are not really independent.
However, asymptotically this also will give an unbiased estimate of the allele 
frequencies, and when you have large sets of small pedigrees with founders often
untyped, this method may be more reliable than the former.  Further, in linkage
analysis, it is well known that the allele frequencies have a rather significant
effect on the lod score - that effect leading to an increased tendency to get
false positive evidence of linkage in situations where an allele which occurs
commonly in the population is called rare in the analysis (the typical effect
when using 1/n as the allele frequency for an n-allele marker locus, where n
is large...).  If you use this second option, you are well-insured against this 
problem, as if anything, there would be a tendency to overestimate the frequency
of alleles which might appear to be cosegregating with the trait rather than to 
make them too small.  There is often a naive assumption that you might have lower
power to detect linkage if you use the right allele frequencies than if you used 1/n
as your allele frequency estimates - however, the fact is that the difference in
the expected lod scores when there is linkage and when there is not becomes smaller
when you do this, so while you get more "high positive lod scores" when you use these
estimates, the power to discriminate the null hypothesis from the alternative
hypothesis is greatly diminished (cf. Ott, 1992; Terwilliger and Ott, 1994 (p272-273)).

You are then asked to select which allele frequency estimates to use in the
output file, datain.dat - you can select from either 

1) Allele frequency estimates from founder individuals who were typed
2) Allele frequency estimates from ALL individuals in ALL pedigrees who are typed
3) Use the originally specified allele frequencies from indata.dat

If you have all your founder individuals typed, you might be best to use
option 1.  If you have many untyped founders, option 2 may be your best hedge 
against false positives, and is the method I generally recommend for most datasets
(you can see the differences in the estimates in the tables above).  Option 3
would almost never be used - at least when there is no downcoding done, since
that would mean the program has not done anything at all!  

The resulting output file is called datain.dat and is ready for use in the analysis
package, or any other LINKAGE-based programs.  You should manually copy the file
inped.dat to pedin.dat to continue, since the program does not change anything in
this pedigree file when downcoding is not performed, so no pedin.dat file is written.

DOWNCODING MARKER SYSTEMS

The other use of the program is for downcoding marker allele systems in an unbiased
manner.  There are two options, as you will see.  If we go back to the opening screen:


 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
 $$$                                                           $$$
 $$$          Program  DOWNFREQ - Version 1.1                  $$$
 $$$                                                           $$$
 $$$             Joseph D. Terwilliger   9/21/95               $$$
 $$$                                                           $$$
 $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

This program estimates allele frequencies from pedigree data by
   1)   Allele frequencies in Founder Individuals, where typed 
   2)   All Individuals in all pedigrees - needed to see if alleles never occur 

Downcoding is also done in a rudimentary fashion by one of two simple methods: 
   1)   Simply eliminating alleles which never occur in this pedigree material 
   2)   Manual user-driven collapsing of alleles into allele classes 

Do you wish to downcode the data (Y/n)  ?

This time we would answer this question 'Y' (or just hit 'Enter') to say
we do wish to downcode the data.  The program then gives you two options as
follows:

Do you wish to : 
(1) Eliminate missing alleles or
(2) Collapse alleles into classes manually? 

The first option will keep exact lod scores,
while the second will be approximate, but will
allow for much greater reduction in number of alleles.


ELIMINATE MISSING ALLELES OPTION:

In this case, the first option would scan through your pedigree file,
and renumber the marker alleles, such that they are numbered sequentially
from 1 to m, and alleles which never occur are eliminated - as they are
irrelevant.  This kind of downcoding will preserve the exact lod scores
from the original marker locus (providing you use the follow-up option where
relevant to add one additional allele to correspond to "all the alleles which 
never occur in this pedigree data" as explained in Ott(1992)).  IT IS RECOMMENDED
TO RUN YOUR PEDIGREE MATERIAL THROUGH THIS FORM OF DOWNCODING IN EVERY INSTANCE
BEFORE RUNNING LINKAGE AS IT SAVES SUBSTANTIAL COMPUTING RESOURCES WHEN ALLELES
DO NOT OCCUR IN YOUR PEDIGREES, AND FURTHER THE PROGRAM WILL THEN GENERATE A 
PEDIN.DAT FILE FOR YOU EVEN IF THERE ARE NO ALLELES WHICH ARE ABSENT IN YOUR
PEDIGREE DATA!

COLLAPSE ALLELES INTO CLASSES MANUALLY OPTION:

In this option, you are given the opportunity to manually collapse the alleles
into classes according to their allele frequencies.  In other words, you could
look at the frequency with which certain alleles occur in your pedigree data
and lump sets of alleles together as if they were one allele - for example,
if you wanted to use our sample dataset, you could renumber the marker alleles
such that alleles 1,2,and 3 were collapsed into a single new allele -
there is some associated loss of information, of course, since 1/2 heterozygotes
would be considered to be homozygous x/x for the new "superallele" x.  That said,
it is typically possible to downcode alleles in such a way that you minimize the 
loss of information.  This is most useful when you want to undergo multipoint
linkage analysis, and want to use multiple markers jointly.

Consider, for a moment, that you have a set of six markers, each with 25 alleles,
and heterozygosity of 85%.  Typically, it is possible to look at the allele
frequency estimates, and collapse them down into four marker alleles with 
approximately equal allele frequencies of 25%, if you are careful enough in the
downcoding process.  Then, the new marker heterozygosity would be 75%, so you
would have lost a number of informative meioses with each marker.  However, in 
the LINKAGE programs, there is a bottleneck in terms of how large of a problem
can be analyzed, and that bottleneck is related to the number of possible haplotypes
the program must consider.  If you have a 25 allele marker locus and a 2 allele
disease locus this is 50 haplotypes.  If you wanted to consider 2 of these 25 allele
marker loci jointly, that would require 25 x 25 x 2 = 1250 haplotypes - on a PC,
for example, the upper bound on the number of haplotypes is about 128, so this would
be totally impossible.  However, if the number of alleles was reduced to 4, that
would require 4 x 4 x 2 = 32 haplotypes, which is no problem.  On large UNIX
workstations even, when the number of haplotypes exceeds a few hundred, the compute
time is so large as to be intractable as well.  Now, with these 4 allele loci, we could
use 4 of them jointly with only a need for 4 x 4 x 4 x 4 x 2 = 512 haplotypes.
Now, what is the probability, given there is now only 75% heterozygosity at
each of the loci, that at least one of the four markers is heterozygous in a given
meiosis?  That is equal to 1 - (1-0.75)^4 = 99.6%, so virtually all meioses will be
informative for at least one locus, whereas with the single 25 allele marker, only
85% of meioses were informative, and if we allowed for 1250 haplotypes and considered
2 of them jointly, we would still have only 97.75% of meioses informative at one of 
the two loci, and much greater computer resources are needed than for the 4 4 allele
marker loci.  It is of course desirable to eliminate missing alleles in this manner
as well by collapsing them into "superallele classes" with other alleles when doing
this stepwise manual downcoding - you would use the program and select a set of alleles
to collapse together, then the program updates the allele frequencies, and asks if you
want to continue - in this way you can do the downcoding stepwise until you have reached
the desired number of alleles for your final result.  As an example, I will lead
you through the downcoding of the first marker locus ("locus 2" in the indata.dat file),
such that we will just combine alleles 2 and 3 together first as follows:

Frequency estimates for locus 2
                          Allele Frequency Estimates
                      ---------------------------------
    Allele  Total      Founders        All Individuals 
       1     662       0.310417            0.394048
       2     198       0.172917            0.117857
       3     287       0.189583            0.170833
       4     299       0.183333            0.177976
       5     234       0.143750            0.139286

Heterozygosity =   0.750576  Polymorphism Information Content =   0.714898
Do you wish to collapse alleles at locus  2? (Y/n)

(Respond 'y' or 'Enter' to say yes)

Please list all alleles to be lumped together, followed by a 0 

(Here you would enter '2 3 0' to collapse alleles 2 and 3 into one new allele)

New alleles and corresponding frequencies: 
 All  Founders  All Inds 
   1  0.362500  0.288690
   2  0.310417  0.394048
   3  0.183333  0.177976
   4  0.143750  0.139286

Old Heterozygosity =   0.750576 Old Polymorphism Information Content =   0.714898
New Heterozygosity =   0.710308 New Polymorphism Information Content =   0.658822
Do you wish to collapse alleles at locus  2? (Y/n)

(Let us say that we wish to further downcode this locus such that we would combine 
"new" alleles 3 and 4 into a single class - so we answer 'Y' again here)

Please list all alleles to be lumped together, followed by a 0 

(Now we would enter '3  4  0' to collapse 3 and 4 together - the list of alleles
to be lumped together can be as long as you want, so long as it ends with a 0)

New alleles and corresponding frequencies: 
 All  Founders  All Inds 
   1  0.327083  0.317262
   2  0.310417  0.394048
   3  0.362500  0.288690

Old Heterozygosity =   0.750576 Old Polymorphism Information Content =   0.714898
New Heterozygosity =   0.660729 New Polymorphism Information Content =   0.586812
Do you wish to collapse alleles at locus  2? (Y/n)

(This time we would answer 'n', since we do not want to further downcode this locus.
Notice that the heterozygosity has dropped from 75% to 66% as a result of reducing 
the number of alleles from 5 to 3, and the PIC has decreased from 71% to 59% - these
are continually updated so you know how much you have lost from your efforts)

Final data for locus  2
                             Estimated Frequencies
     Original     New      Founders       All Inds 
         1         2       0.310417       0.394048
         2         3       0.172917       0.117857
         3         3       0.189583       0.170833
         4         1       0.183333       0.177976
         5         1       0.143750       0.139286

Old Heterozygosity =   0.750576 Old Polymorphism Information Content =   0.714898

Press <Enter> to Continue

New alleles and corresponding frequencies: 
      Allele    Founders       All Inds 
         1       0.327083       0.317262
         2       0.310417       0.394048
         3       0.362500       0.288690

Old Heterozygosity =   0.750576 Old Polymorphism Information Content =   0.714898
New Heterozygosity =   0.660729 New Polymorphism Information Content =   0.586812


*The first table above tells you which alleles have been renumbered to what from
the original data, and the second table give the new - updated alleles and their
frequencies, for your reference.  The same procedure can then be repeated for each
locus in your file.

At the end, you will again have the option of using your original allele frequency 
estimates from indata.dat, or having them estimated from the founders or from
all individuals in the pedigree set, as before.  This will generate datain.dat
and pedin.dat with the new downcoded dataset, and Outfile.txt containing the summary
of the downcoding and allele frequency estimation you have carried out.

REFERENCES:

Ott, J. (1992) "Strategies for characterizing highly polymorphic markers in human
	gene mapping." Am J Hum Genet 51:283-290

Terwilliger, J.D., and J. Ott (1994) "Handbook of Human Genetic Linkage" Johns Hopkins
	University Press.