[ This file contains a brief overview of the APM method, its programs, and its file formats. It also contains some helpful hints, guidelines, and an example run of the apm programs on a fictitious data file. ] The Affected Pedigree Member Method Linkage Analysis Programs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This distribution package contains programs for using the Affected Pedigree Member (APM) method to analyze pedigrees for linkage between a diseased trait and a number of marker loci. They generate probability statistics for each marker to show the degree of linkage for three different weighting functions of the allele frequencies: f(p) = 1, f(p) = 1/p, and f(p) = "1 over the square root of p" = 1/sqrt(p) = p^(-0.5). There are two analysis programs: apm and apmmult. Apm analyzes several marker loci independently, while apmmult analyzes linked markers. The programs require distinctly different file formats; apm uses what is known as ML (many-locus) format, whereas apmmult uses a format called MULT (multi-locus format). (Eventually the programs will be combined and there will be only one file format.) These formats are described in detail in the documentation for each program, but here we will describe some of the more subtle structure behind them and go through an example in detail. Files of both formats begin with a line containing the number of families and the number of marker loci. The next few lines contain information on each locus; for each locus are two lines, the first contains the number of alleles and the human-meaningful name of the locus, and the second contains a string of numbers which are the incident frequencies of each allele (separated by spaces, newlines, and tabs). Following the locus information are descriptions of each pedigree in turn. Before a pedigree can be included in the data file, each individual must be ascribed a number (called an ID). Each member's ID must be unique, greater than 1, and no greater than the number of members (thus there is one member for each ID between 1 and the number of members). There is an additional stipulation that each member has an ID number which is greater than both ID numbers of his parents. A good way to number a pedigree is to first assign all founders ID's, starting at 1, then to assign each member for whom both parents have ID's an ID, continuing until each member has an ID. (This is Karigl's convention for numbering the pedigree.) Genotypic data is included in the description of the pedigree, but only for the affected individuals (thus the name of the method). The pedigree descriptions are different for the two file formats. First let us consider ML format (we will provide a simple example later): For each pedigree, there is a line containing just the name of the pedigree, then a line with the number of members, the number of affected individuals, and the number of marker loci for which this pedigree is typed (which must, of course, be no more than the number of loci specified at the beginning of the file). Following this is a line with a string of numbers which are the ID's of the mothers of all members, so that the first number is the ID of the mother of the member with ID 1, the second is the ID of the mother of the member with ID 2, and so on. For founders, the mother should be 0. Then there is a line which is the same thing for the fathers. (It is not acceptable to have a 0 for only one parent; both parents must be defined or both must be undefined.) After that, there is a list of the ID's of the affecteds, in increasing numerical order. Finally, there are a number of lines, one for each locus for which the pedigree is typed, which contain the number of the typed locus (referring to the order in which they are defined at the beginning of the file) and the genotypes for each affected member (in the same order as the list of the ID's of the affecteds). Note that, if an affected member is untyped at a locus, his genotype may be specified as "0 0" to reflect this. The lines must be in the order that makes the locus numbers increase. Now consider MULT files. The main differences are in the way the genotypes of the affecteds are specified. But let's look at the detail: For each pedigree, there is a line containing the pedigree title, as with ML files. There is then a line containing the number of members and the number of affecteds (there is no third number to show how many loci for which the family is typed - this is because the family must be typed at all loci). Following this are the lists of mother and father ID's, just as in ML format. However, the next few lines, which contain the genotypic data, are different: There is a line for each affected individual, containing the affected member's ID and his genotypes for each locus (in the order the loci were specified at the beginning of the file). All affecteds must be typed at all marker loci. Let's look at an example, a made-up pedigree. Say you know that A and B had two children, C and D, and that C married E and had a child called F. Let's also say that we have two marker loci, Locus1 and Locus2, that Locus1 has three alleles with frequencies 0.1, 0.5, and 0.4, and that Locus2 has three alleles with frequencies 0.3, 0.2, and 0.5. (The allele frequencies for each locus must add up to 1.0.) Sex is relevant, so let's make A, D, and E female and B, C, and F male. Now let's say that D and F are affected with the disease of interest, and their genotypes are 1/1 and 2/1, respectively, at our first marker locus, and 3/1 and 2/1, respectively, at our second marker locus. (The allele numbers must correspond with the allele frequencies, so that, for the first locus, allele 1 is the one with frequency 0.1, allele 2 is the one with frequency 0.5, etc.) The pedigree looks like this: A --- B | --------- | | E --- C D | 1/1 F 3/1 2/1 2/1 We have enough information now to construct our data file. First we must number the members - start with the founders (A, B, and E), and then each member for whom both parents have ID's (C and D). And then repeat the last step, giving F his much-needed ID. Now all members have ID's, and our pedigree looks like this, with the ID's in parentheses (I've also added the sex for each, m or f): (1)Af---mB(2) | --------- | | (3)Ef---mC(4) fD(5) | 1/1 mF(6) 3/1 2/1 2/1 Let's make an ML format file first. Since we have one pedigree and two marker loci, the first line is: 1 2 Now we describe each locus, in turn: 3 Locus1 0.1 0.5 0.4 3 Locus2 0.3 0.2 0.5 And the pedigree name, let's call it DUMMY (prekin and dGENE permit only 8 character titles): DUMMY Now the number of members, number of affecteds, and number of loci: 6 2 2 And the mothers: 0 0 0 1 1 3 And the fathers: 0 0 0 2 2 4 Now the list of all affecteds, in increasing order: 5 6 We are typed for the first locus (Locus1), so: 1 1 1 2 1 and for the second locus (Locus2) as well: 2 3 1 2 1 That's it. The final data file looks like this: 1 2 3 Locus1 0.1 0.5 0.4 3 Locus2 0.3 0.2 0.5 DUMMY 6 2 2 0 0 0 1 1 3 0 0 0 2 2 4 5 6 1 1 1 2 1 2 3 1 2 1 This file could then be used by apm. We suggest that before you use a data file, however, you run chapm with the -check option to make sure everything's ok. If we wanted a MULT file, we could either run chapm on the above file to convert it (if we have that file), or we could create it from scratch. The first few lines are the same as for ML; only the pedigree description is different. Let's start there - first the pedigree name: DUMMY Now the number of members and affecteds: 6 2 And the mothers and fathers as before: 0 0 0 1 1 3 0 0 0 2 2 4 Now the first affected and his genotypes: 5 1 1 3 1 And the second: 6 2 1 2 1 Thus the file looks like this: 1 2 3 Locus1 0.1 0.5 0.4 3 Locus2 0.3 0.2 0.5 DUMMY 6 2 0 0 0 1 1 3 0 0 0 2 2 4 5 1 1 3 1 6 2 1 2 1 In all cases, the white space (spaces and tabs) can be tailored to taste. Newlines can be inserted in the middle of lines, but they are required at the ends of all the lines as they are described here. HINT #1 ~~~~~~~ The Affected Pedigree Member Method needs only two forms of information: 1) Who are the affected individuals and what are their genotypes? 2) What are the relationships between all the affected and typed individuals? Thus, the data file for these programs need not contain any unaffected individuals who are not necessary for defining the relationships between affected (and typed) individuals. For example, in the example files, there could have been children of F and D, more children of A and B or of C and E, etc., but they needn't be included if none of them (and no descendents of them) are affected. HINT #2 ~~~~~~~ If you already have your genetic data in a database or files, it is much less error prone to use a program convert your data into the format described above (rather than doing it by hand). CONTACT ~~~~~~~ You can contact Dr. Weeks to receive copies of the programs for DOS or Unix (source code only for Unix systems), ask questions, make comments, or report bugs. ******************************** | Daniel E. Weeks | | Department of Human Genetics | Internet: | University of Pittsburgh | dweeks@watson.hgen.pitt.edu | Crabtree Hall, Room A310 | | 130 DeSoto Street | Bitnet: | Pittsburgh, PA 15261 | weeks@pittvms.bitnet | | | (412) 624-3066 | | FAX: (412) 624-3020 | ******************************** Thank you! Daniel E. Weeks & Mark Schroeder [ The APM programs are Copyright (C) 1993 Daniel E. Weeks ] ADDENDUM ~~~~~~~~ The following pointers are taken from an article written for an INSERM workshop March 30-31, 1992, with only slight modification to bring it up to date. The original article is entitled "Using the Affected Pedigree Member Method of Linkage Analysis" and was authored by Daniel E. Weeks, Lisa D. Harby, Carmella A. Sarneso, and Michael B. Gorin. Practical Issues 1. Carefully estimate marker allele frequencies The APM method appears to be sensitive to misspecification of marker allele frequencies. This can be a problem due to the difficulty of accurately determining allele frequencies for highly polymorphic markers. Underestimating or misspecifying the frequency of a marker allele may falsely inflate the evidence for linkage, since matches between rare marker alleles contribute more to the APM statistic than matches between common alleles. 2. Exclude families containing only a parent-child pair of affecteds The APM statistic is based on distortions in the identity-by-state (IBS) status at the marker loci, which hopefully reflect the underlying identity-by-descent (IBD) status. There is no opportunity for any meaningful distortion in the IBS status in pedigrees containing only a parent-child pair, because a parent- child pair always shares only one allele IBD. Therefore, we recommend deleting those families that contain only a parent-child pair (Weeks and Lange 1992). However, simulations and analytical results indicate that inadvertently including such families should have little effect on the false positive rate. 3. Affecteds must be typed at all markers under consideration For the multilocus version of the APM statistic, an affected individual must be typed at all the marker loci under consideration for that individual to be included in the statistic (Weeks and Lange 1992). There is currently no provision for missing marker data on the affecteds. 4. Do not collapse missing alleles In traditional linkage analysis, if a marker has 25 alleles and only 4 alleles appear in the pedigrees, then one may carry out the analysis using only 5 alleles, by collapsing all 21 missing alleles into 1 dummy allele with the appropriate frequency. However, in the APM method it is not desirable to collapse the alleles in this manner (Weeks and Lange 1992), since the APM statistic depends on the sum of the square and cube of the allele frequencies. Collapsing the alleles will artificially change these sums. 5. Use the intermediate weighting function only The APM programs calculate the results in terms of three weighting functions: A) f(p)=1, B) f(p)=1/sqrt(p), and C) f(p)=1/p. These functions are used to make the sharing of rare marker alleles between affecteds a more significant event than the sharing of common marker alleles. Function A corresponds to no weighting, while function C corresponds to strong weighting. We recommend using and reporting only the results based on the intermediate function B, since extreme weighting (function C) usually leads to non-normality of the statistic, while no weighting (function A) may be too conservative (Weeks and Lange 1992). It can be useful to examine the APM results under the three different weighting functions. For example, if the result under function C is extremely significant, while the result under function A is not significant, then this may indicate that the affected relatives are sharing an extremely rare marker allele and that most of the significance comes from the fact that the marker allele is very rare. If the allele frequency isn't accurately determined for the particular population being analyzed, this may provide motivation to estimate the allele frequency more accurately. 6. Use simulation to check "significant" APM results The approximate significance of the APM statistic may be determined quickly and easily by referring to a statistical table for the standard normal distribution. However, since this is an approximation, it is always desirable to check the "significant" APM results by simulation, especially if the number of families is less than 20 (Weeks and Lange 1992). Two simulation programs, a single-locus and a multi-locus version, are distributed in the APM package for precisely this purpose. 7. Always examine the data carefully A significant APM result should provide the impetus to examine the data more carefully, especially because a significant APM result only indicates that the marker is not segregating independently of the disease. A significant result does not necessarily imply that a linkage has been discovered, since the result could be due to association or misspecified marker allele frequencies. For example, if a large pedigree was completely homozygous at the marker locus for a very rare allele, then this may generate a significant APM result, even though, under traditional linkage analysis, this pedigree contains no linkage information. However, if the marker allele is really as rare as indicated, then this may be a valid result, since it is extremely unlikely that all founders in the pedigree will be homozygous at the marker allele. APM Programs Kenneth Lange and Daniel Weeks originally wrote the APM programs, and they have been modified by Mark Schroeder for this release. The programs come in two versions. The many-locus version analyzes any number of markers, independently, one at a time. The multiple-locus version analyzes a set of closely linked markers. Both versions accept similar input files which require numbering each pedigree from top to bottom. We provide a utility program for creating an APM file from LINKAGE data files and for converting between APM formats (a program which can convert from MENDEL was in the last release and has not been changed). These utility programs require that the affected individuals be identified by a unique code at the disease locus. The DOS-version and Unix-source of the APM programs may be obtained from Dr. Weeks by sending two formatted DOS diskettes and a return envelope. Alternatively, Unix-source alone may be obtained by e-mailing a request to Dr. Weeks at dweeks@watson.hgen.pitt.edu. References [ see the accompanying REFERENCES file ] An Example Run of APM, Sim, and Hist ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is a typical run of apm, sim, and hist on the test_ml.dat data file (included in the ./examples directory). Aside from introducing some blank lines here and there (to improve legibility) and some comments, the output of the programs is unaltered. The output produced by your copies should be the same (or very close). The use of apmmult, simmult, and hist is very similar. holmes-51:ls test_ml.dat holmes-52:apm The Affected Pedigree Member Method of Linkage Analysis Single marker program apm, version 2.00; June 14, 1993 by Daniel E. Weeks and Kenneth Lange modified by Mark Schroeder Copyright (C) 1993 Daniel E. Weeks ******************************** | Daniel E. Weeks | | Department of Human Genetics | Internet: | University of Pittsburgh | dweeks@watson.hgen.pitt.edu | Crabtree Hall, Room A310 | | 130 DeSoto Street | Bitnet: | Pittsburgh, PA 15261 | weeks@pittvms.bitnet | | | (412) 624-3066 | | FAX: (412) 624-3020 | ******************************** Input the datafile file name and extension > test_ml.dat Input the limitation on memory use in megabytes Entering 0 will set it to about 20 megabytes > 0 Now enter the name of the file of coefficients if it exists (if not just press ). > If you wish to create a new file of coefficients, enter the name (if not just press ). > TESTPED1 <--- pedigree title 15 3 0 0 2 2 2 0 0 0 5 5 5 7 12 12 12 0 0 1 1 1 0 0 0 6 6 6 8 11 11 11 The 3 allele frequencies for locus 1 ( ACK1 ) are 0.45000 0.30000 0.25000 The 3 allele frequencies for locus 2 ( ACK2 ) are 0.55000 0.20000 0.25000 The 2 allele frequencies for locus 3 ( ACK3 ) are 0.46500 0.53500 LOCUS 1 3 4 10 13 making some new kinship calculations f(p) = 1 mean 1.26656 variance 0.27363 observed x 2.00000 The statistic for this family at this locus is 1.40212 f(p) = 1/sqrt(p) mean 2.12586 variance 0.70359 observed x 3.65148 The statistic for this family at this locus is 1.81882 f(p) = 1/p mean 3.62500 variance 2.19444 observed x 6.66667 The statistic for this family at this locus is 2.05329 LOCUS 2 2 4 13 making some new kinship calculations f(p) = 1 mean 0.44219 variance 0.07280 observed x 0.50000 The statistic for this family at this locus is 0.21426 f(p) = 1/sqrt(p) mean 0.68899 variance 0.16435 observed x 1.00000 The statistic for this family at this locus is 0.76717 f(p) = 1/p mean 1.12500 variance 0.54403 observed x 2.00000 The statistic for this family at this locus is 1.18630 LOCUS 3 2 10 13 making some new kinship calculations f(p) = 1 mean 0.56464 variance 0.05909 observed x 0.50000 The statistic for this family at this locus is -0.26594 f(p) = 1/sqrt(p) mean 0.79652 variance 0.11693 observed x 0.73324 The statistic for this family at this locus is -0.18508 f(p) = 1/p mean 1.12500 variance 0.23499 observed x 1.07527 The statistic for this family at this locus is -0.10259 TESTPED2 <--- pedigree title 8 2 0 0 1 1 1 0 6 6 0 0 2 2 2 0 5 5 LOCUS 2 2 4 7 making some new kinship calculations f(p) = 1 mean 0.47938 variance 0.06961 observed x 0.50000 The statistic for this family at this locus is 0.07817 f(p) = 1/sqrt(p) mean 0.75565 variance 0.15779 observed x 1.00000 The statistic for this family at this locus is 0.61515 f(p) = 1/p mean 1.25000 variance 0.55682 observed x 2.00000 The statistic for this family at this locus is 1.00509 Overall Results: Weight Locus N.Fam. Statistic P-value f(p) = 1 ACK1 1 1.40212 0.08043 f(p) = 1 ACK2 2 0.20678 0.41808 f(p) = 1 ACK3 1 -0.26594 0.60486 f(p) = 1/sqrt(p) ACK1 1 1.81882 0.03447 f(p) = 1/sqrt(p) ACK2 2 0.97745 0.16417 f(p) = 1/sqrt(p) ACK3 1 -0.18508 0.57342 f(p) = 1/p ACK1 1 2.05329 0.02003 f(p) = 1/p ACK2 2 1.54955 0.06062 f(p) = 1/p ACK3 1 -0.10259 0.54087 There were 3 affecteds used for locus ACK1 There were 4 affecteds used for locus ACK2 There were 2 affecteds used for locus ACK3 Note: The p-values may be unreliable for small numbers of families. We recommend using the simulation program "sim" and the histogram generator "hist" to compute empirical p-values. [ Since the number of families is small, the distribution doesn't sufficiently resemble the asymptotic distribution. Therefore, the p-value approximation given by apm may not be reliable. We need to then run sim to produce replicates of the families, each of which will have random genotypes assigned to all the members (subject to the constraints of the pedigree structures and the allele frequencies, but NOT assuming linkage between the markers and the disease). This will effectively simulate many different families within the normal distribution. ] holmes-53:ls out1.dat out1p.dat outsqr.dat table.out test_ml.dat holmes-54:sim random seed: 75449 Please enter the data file name: outsqr.dat f(p) = 1/sqrt(p) [ We usually use this weighting function. ] 2 3 3 0.45 0.3 0.25 3 0.55 0.2 0.25 2 0.465 0.535 Input the desired number of iterations (1000 is good): 1000 REMEMBER: Not all families may be used at all loci TESTPED1 15 3 3 0 0 2 2 2 0 0 0 5 5 5 7 12 12 12 0 0 1 1 1 0 0 0 6 6 6 8 11 11 11 1 3 4 10 13 2.12586 0.70359 2 2 4 13 0.68899 0.16435 3 2 10 13 0.79652 0.11693 TESTPED2 8 2 1 0 0 1 1 1 0 6 6 0 0 2 2 2 0 5 5 2 2 4 7 0.75565 0.15779 For locus 1, the mean is -0.0122673 and the variance 0.9499 For locus 2, the mean is -0.0273151 and the variance 0.945936 For locus 3, the mean is -0.0213009 and the variance 0.946877 [ For a sufficiently simulated normal distribution, the mean should be near zero and the variance should be near one. ] 1000 iterations f(p) = 1/sqrt(p) holmes-55:ls out1.dat outsqr.dat test_ml.dat tstat2.out out1p.dat table.out tstat1.out tstat3.out [ Now that we have the simulated families, we can use hist to see where our real families lie in the normal distribution. Basically, we calculate the p-value of the statistic generated by apm by finding where it lies within the (approximately) normal distribution that we have generated. In this example we will do this only for the first locus. ] holmes-56:hist -p - -s tstat1.out [ tstat1.out contains all the simulated statistics for the first locus. ] Reading samples from file 'tstat1.out' Number of Samples: 1000 Sample Range: -2.5344 to 4.61865 Mean: -0.0122677 RMS: 0.974705 Mean Deviation from Mean: 0.773703 Variance: 0.95085 Standard Deviation: 0.975115 Skewness: 0.955855 Kurtosis: 1.43903 -2.3344 2 ] -1.9344 0 B -1.5344 56 ###########[ I -1.1344 121 ########################[ N -0.7344 162 ################################] -0.3344 216 ###########################################[ C 0.0656 88 #################] E 0.4656 157 ###############################] N 0.8656 124 ######################### T 1.2656 0 E 1.6656 31 ######[ R 2.0656 18 ###] 2.4656 0 2.8656 21 ####[ 3.2656 0 3.6656 0 4.0656 3 ] 4.4656 1 [ Bin Width: 0.4 Block Width: 5 Key for Partial Blocks: #: >~ 2/3 ]: >~ 1/3 [: <~ 1/3 Computing empirical p-values: Enter a statistic (Ctrl-D when done): 1.81882 P-value for 1.81882: 0.043 Enter a statistic (Ctrl-D when done): ^D [ We want to see what the p-value is for 1.81882, since that is the statistic generated by apm for the first locus and for the weighting function we are using (f(p) = 1/sqrt(p)). 1.81882 lies in the bin centered on 1.6656. There are many fewer simulated statistics greater than 1.81882 than there are less than that number; hence, the p-value is much less than 0.5. The p-value can be thought of as the probability of a false positive, and in this case 0.043 may be low enough to be significant. ] holmes-57:ls out1.dat outsqr.dat test_ml.dat tstat2.out out1p.dat table.out tstat1.out tstat3.out