Documentation for DOWNFREQ program DOWNFREQ is a very simple program which can be used to get rough estimates of marker allele frequencies, and to downcode the number of alleles at a marker locus based on their frequencies in the pedigree data. INPUT FILES: inped.dat - LINKAGE format pedigree file with disease followed by a series of allele numbers marker loci. indata.dat - LINKAGE format pedigree file with disease followed by a series of allele numbers marker loci. OUTPUT FILES: Outfile.txt - summary of allele frequency estimates and post-downcoding allele frequency estimates, where relevant. datain.dat - LINKAGE format parameter file with disease followed by a series of allele numbers marker loci with estimated allele frequencies, or allele frequencies corresponding to the downcoded marker alleles after running DOWNFREQ. pedin.dat (if downcoding option is used) - LINKAGE format pedigree file with downcoded marker genotypes if the downcoding option was used. Otherwise, this file is not created if the program was used strictly for allele frequency estimation, as nothing would have changed from inped.dat. ALLELE FREQUENCY ESTIMATION When you run the DOWNFREQ program, you will see a screen like the following: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$ $$$ $$$ Program DOWNFREQ - Version 1.1 $$$ $$$ $$$ $$$ Joseph D. Terwilliger 9/21/95 $$$ $$$ $$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ This program estimates allele frequencies from pedigree data by 1) Allele frequencies in Founder Individuals, where typed 2) All Individuals in all pedigrees - needed to see if alleles never occur Downcoding is also done in a rudimentary fashion by one of two simple methods: 1) Simply eliminating alleles which never occur in this pedigree material 2) Manual user-driven collapsing of alleles into allele classes Do you wish to downcode the data (Y/n) ? If you answer this question no, (for the sample data in the directory 2-point/sample/autosomal) you will get screen output like the following: Reading indata.dat Reading inped.dat Frequency estimates for locus 2 Allele Frequency Estimates --------------------------------- Allele Total Founders All Individuals 1 662 0.310417 0.394048 2 198 0.172917 0.117857 3 287 0.189583 0.170833 4 299 0.183333 0.177976 5 234 0.143750 0.139286 Heterozygosity = 0.750576 Polymorphism Information Content = 0.714898 Frequency estimates for locus 3 Allele Frequency Estimates --------------------------------- Allele Total Founders All Individuals 1 393 0.085417 0.233929 2 365 0.302083 0.217262 3 922 0.612500 0.548810 Heterozygosity = 0.596883 Polymorphism Information Content = 0.530319 Which allele frequency estimates should be used? 1 = founders; 2 = all individuals; 3 = frequencies given in original indata.dat In this output, for each locus, it is indicated how many times each allele occurred in the pedigree file (over all individuals), and then the allele frequency estimates are given from two different methods - first they are estimated from all typed founder individuals (if all your founders are typed, this will give the maximum likelihood estimates of allele frequency exactly), and then they are estimated from all individuals in all pedigrees - including multiple sibs, and parent/child combinations, which are not really independent. However, asymptotically this also will give an unbiased estimate of the allele frequencies, and when you have large sets of small pedigrees with founders often untyped, this method may be more reliable than the former. Further, in linkage analysis, it is well known that the allele frequencies have a rather significant effect on the lod score - that effect leading to an increased tendency to get false positive evidence of linkage in situations where an allele which occurs commonly in the population is called rare in the analysis (the typical effect when using 1/n as the allele frequency for an n-allele marker locus, where n is large...). If you use this second option, you are well-insured against this problem, as if anything, there would be a tendency to overestimate the frequency of alleles which might appear to be cosegregating with the trait rather than to make them too small. There is often a naive assumption that you might have lower power to detect linkage if you use the right allele frequencies than if you used 1/n as your allele frequency estimates - however, the fact is that the difference in the expected lod scores when there is linkage and when there is not becomes smaller when you do this, so while you get more "high positive lod scores" when you use these estimates, the power to discriminate the null hypothesis from the alternative hypothesis is greatly diminished (cf. Ott, 1992; Terwilliger and Ott, 1994 (p272-273)). You are then asked to select which allele frequency estimates to use in the output file, datain.dat - you can select from either 1) Allele frequency estimates from founder individuals who were typed 2) Allele frequency estimates from ALL individuals in ALL pedigrees who are typed 3) Use the originally specified allele frequencies from indata.dat If you have all your founder individuals typed, you might be best to use option 1. If you have many untyped founders, option 2 may be your best hedge against false positives, and is the method I generally recommend for most datasets (you can see the differences in the estimates in the tables above). Option 3 would almost never be used - at least when there is no downcoding done, since that would mean the program has not done anything at all! The resulting output file is called datain.dat and is ready for use in the analysis package, or any other LINKAGE-based programs. You should manually copy the file inped.dat to pedin.dat to continue, since the program does not change anything in this pedigree file when downcoding is not performed, so no pedin.dat file is written. DOWNCODING MARKER SYSTEMS The other use of the program is for downcoding marker allele systems in an unbiased manner. There are two options, as you will see. If we go back to the opening screen: $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ $$$ $$$ $$$ Program DOWNFREQ - Version 1.1 $$$ $$$ $$$ $$$ Joseph D. Terwilliger 9/21/95 $$$ $$$ $$$ $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ This program estimates allele frequencies from pedigree data by 1) Allele frequencies in Founder Individuals, where typed 2) All Individuals in all pedigrees - needed to see if alleles never occur Downcoding is also done in a rudimentary fashion by one of two simple methods: 1) Simply eliminating alleles which never occur in this pedigree material 2) Manual user-driven collapsing of alleles into allele classes Do you wish to downcode the data (Y/n) ? This time we would answer this question 'Y' (or just hit 'Enter') to say we do wish to downcode the data. The program then gives you two options as follows: Do you wish to : (1) Eliminate missing alleles or (2) Collapse alleles into classes manually? The first option will keep exact lod scores, while the second will be approximate, but will allow for much greater reduction in number of alleles. ELIMINATE MISSING ALLELES OPTION: In this case, the first option would scan through your pedigree file, and renumber the marker alleles, such that they are numbered sequentially from 1 to m, and alleles which never occur are eliminated - as they are irrelevant. This kind of downcoding will preserve the exact lod scores from the original marker locus (providing you use the follow-up option where relevant to add one additional allele to correspond to "all the alleles which never occur in this pedigree data" as explained in Ott(1992)). IT IS RECOMMENDED TO RUN YOUR PEDIGREE MATERIAL THROUGH THIS FORM OF DOWNCODING IN EVERY INSTANCE BEFORE RUNNING LINKAGE AS IT SAVES SUBSTANTIAL COMPUTING RESOURCES WHEN ALLELES DO NOT OCCUR IN YOUR PEDIGREES, AND FURTHER THE PROGRAM WILL THEN GENERATE A PEDIN.DAT FILE FOR YOU EVEN IF THERE ARE NO ALLELES WHICH ARE ABSENT IN YOUR PEDIGREE DATA! COLLAPSE ALLELES INTO CLASSES MANUALLY OPTION: In this option, you are given the opportunity to manually collapse the alleles into classes according to their allele frequencies. In other words, you could look at the frequency with which certain alleles occur in your pedigree data and lump sets of alleles together as if they were one allele - for example, if you wanted to use our sample dataset, you could renumber the marker alleles such that alleles 1,2,and 3 were collapsed into a single new allele - there is some associated loss of information, of course, since 1/2 heterozygotes would be considered to be homozygous x/x for the new "superallele" x. That said, it is typically possible to downcode alleles in such a way that you minimize the loss of information. This is most useful when you want to undergo multipoint linkage analysis, and want to use multiple markers jointly. Consider, for a moment, that you have a set of six markers, each with 25 alleles, and heterozygosity of 85%. Typically, it is possible to look at the allele frequency estimates, and collapse them down into four marker alleles with approximately equal allele frequencies of 25%, if you are careful enough in the downcoding process. Then, the new marker heterozygosity would be 75%, so you would have lost a number of informative meioses with each marker. However, in the LINKAGE programs, there is a bottleneck in terms of how large of a problem can be analyzed, and that bottleneck is related to the number of possible haplotypes the program must consider. If you have a 25 allele marker locus and a 2 allele disease locus this is 50 haplotypes. If you wanted to consider 2 of these 25 allele marker loci jointly, that would require 25 x 25 x 2 = 1250 haplotypes - on a PC, for example, the upper bound on the number of haplotypes is about 128, so this would be totally impossible. However, if the number of alleles was reduced to 4, that would require 4 x 4 x 2 = 32 haplotypes, which is no problem. On large UNIX workstations even, when the number of haplotypes exceeds a few hundred, the compute time is so large as to be intractable as well. Now, with these 4 allele loci, we could use 4 of them jointly with only a need for 4 x 4 x 4 x 4 x 2 = 512 haplotypes. Now, what is the probability, given there is now only 75% heterozygosity at each of the loci, that at least one of the four markers is heterozygous in a given meiosis? That is equal to 1 - (1-0.75)^4 = 99.6%, so virtually all meioses will be informative for at least one locus, whereas with the single 25 allele marker, only 85% of meioses were informative, and if we allowed for 1250 haplotypes and considered 2 of them jointly, we would still have only 97.75% of meioses informative at one of the two loci, and much greater computer resources are needed than for the 4 4 allele marker loci. It is of course desirable to eliminate missing alleles in this manner as well by collapsing them into "superallele classes" with other alleles when doing this stepwise manual downcoding - you would use the program and select a set of alleles to collapse together, then the program updates the allele frequencies, and asks if you want to continue - in this way you can do the downcoding stepwise until you have reached the desired number of alleles for your final result. As an example, I will lead you through the downcoding of the first marker locus ("locus 2" in the indata.dat file), such that we will just combine alleles 2 and 3 together first as follows: Frequency estimates for locus 2 Allele Frequency Estimates --------------------------------- Allele Total Founders All Individuals 1 662 0.310417 0.394048 2 198 0.172917 0.117857 3 287 0.189583 0.170833 4 299 0.183333 0.177976 5 234 0.143750 0.139286 Heterozygosity = 0.750576 Polymorphism Information Content = 0.714898 Do you wish to collapse alleles at locus 2? (Y/n) (Respond 'y' or 'Enter' to say yes) Please list all alleles to be lumped together, followed by a 0 (Here you would enter '2 3 0' to collapse alleles 2 and 3 into one new allele) New alleles and corresponding frequencies: All Founders All Inds 1 0.362500 0.288690 2 0.310417 0.394048 3 0.183333 0.177976 4 0.143750 0.139286 Old Heterozygosity = 0.750576 Old Polymorphism Information Content = 0.714898 New Heterozygosity = 0.710308 New Polymorphism Information Content = 0.658822 Do you wish to collapse alleles at locus 2? (Y/n) (Let us say that we wish to further downcode this locus such that we would combine "new" alleles 3 and 4 into a single class - so we answer 'Y' again here) Please list all alleles to be lumped together, followed by a 0 (Now we would enter '3 4 0' to collapse 3 and 4 together - the list of alleles to be lumped together can be as long as you want, so long as it ends with a 0) New alleles and corresponding frequencies: All Founders All Inds 1 0.327083 0.317262 2 0.310417 0.394048 3 0.362500 0.288690 Old Heterozygosity = 0.750576 Old Polymorphism Information Content = 0.714898 New Heterozygosity = 0.660729 New Polymorphism Information Content = 0.586812 Do you wish to collapse alleles at locus 2? (Y/n) (This time we would answer 'n', since we do not want to further downcode this locus. Notice that the heterozygosity has dropped from 75% to 66% as a result of reducing the number of alleles from 5 to 3, and the PIC has decreased from 71% to 59% - these are continually updated so you know how much you have lost from your efforts) Final data for locus 2 Estimated Frequencies Original New Founders All Inds 1 2 0.310417 0.394048 2 3 0.172917 0.117857 3 3 0.189583 0.170833 4 1 0.183333 0.177976 5 1 0.143750 0.139286 Old Heterozygosity = 0.750576 Old Polymorphism Information Content = 0.714898 Press to Continue New alleles and corresponding frequencies: Allele Founders All Inds 1 0.327083 0.317262 2 0.310417 0.394048 3 0.362500 0.288690 Old Heterozygosity = 0.750576 Old Polymorphism Information Content = 0.714898 New Heterozygosity = 0.660729 New Polymorphism Information Content = 0.586812 *The first table above tells you which alleles have been renumbered to what from the original data, and the second table give the new - updated alleles and their frequencies, for your reference. The same procedure can then be repeated for each locus in your file. At the end, you will again have the option of using your original allele frequency estimates from indata.dat, or having them estimated from the founders or from all individuals in the pedigree set, as before. This will generate datain.dat and pedin.dat with the new downcoded dataset, and Outfile.txt containing the summary of the downcoding and allele frequency estimation you have carried out. REFERENCES: Ott, J. (1992) "Strategies for characterizing highly polymorphic markers in human gene mapping." Am J Hum Genet 51:283-290 Terwilliger, J.D., and J. Ott (1994) "Handbook of Human Genetic Linkage" Johns Hopkins University Press.