[ This file contains a brief overview of the APM method, its programs,
  and its file formats. It also contains some helpful hints, guidelines,
  and an example run of the apm programs on a fictitious data file. ]


The Affected Pedigree Member Method Linkage Analysis Programs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This distribution package contains programs for using the Affected
Pedigree Member (APM) method to analyze pedigrees for linkage between
a diseased trait and a number of marker loci. They generate
probability statistics for each marker to show the degree of linkage
for three different weighting functions of the allele frequencies:
f(p) = 1, f(p) = 1/p, and f(p) = "1 over the square root of p" =
1/sqrt(p) = p^(-0.5).

There are two analysis programs: apm and apmmult. Apm analyzes several
marker loci independently, while apmmult analyzes linked markers. The
programs require distinctly different file formats; apm uses what is
known as ML (many-locus) format, whereas apmmult uses a format called
MULT (multi-locus format). (Eventually the programs will be combined
and there will be only one file format.) These formats are described
in detail in the documentation for each program, but here we will
describe some of the more subtle structure behind them and go through
an example in detail.

Files of both formats begin with a line containing the number of
families and the number of marker loci. The next few lines contain
information on each locus; for each locus are two lines, the first
contains the number of alleles and the human-meaningful name of the
locus, and the second contains a string of numbers which are the
incident frequencies of each allele (separated by spaces, newlines,
and tabs).

Following the locus information are descriptions of each pedigree in
turn. Before a pedigree can be included in the data file, each
individual must be ascribed a number (called an ID). Each member's ID
must be unique, greater than 1, and no greater than the number of
members (thus there is one member for each ID between 1 and the number
of members). There is an additional stipulation that each member has
an ID number which is greater than both ID numbers of his parents. A
good way to number a pedigree is to first assign all founders ID's,
starting at 1, then to assign each member for whom both parents have
ID's an ID, continuing until each member has an ID. (This is Karigl's
convention for numbering the pedigree.) Genotypic data is included in
the description of the pedigree, but only for the affected individuals
(thus the name of the method).

The pedigree descriptions are different for the two file formats. First
let us consider ML format (we will provide a simple example later):

	For each pedigree, there is a line containing just the name of
the pedigree, then a line with the number of members, the number of
affected individuals, and the number of marker loci for which this
pedigree is typed (which must, of course, be no more than the number
of loci specified at the beginning of the file). Following this is a
line with a string of numbers which are the ID's of the mothers of all
members, so that the first number is the ID of the mother of the
member with ID 1, the second is the ID of the mother of the member
with ID 2, and so on. For founders, the mother should be 0. Then there
is a line which is the same thing for the fathers. (It is not
acceptable to have a 0 for only one parent; both parents must be
defined or both must be undefined.) After that, there is a list of the
ID's of the affecteds, in increasing numerical order.  Finally, there
are a number of lines, one for each locus for which the pedigree is
typed, which contain the number of the typed locus (referring to the
order in which they are defined at the beginning of the file) and the
genotypes for each affected member (in the same order as the list of
the ID's of the affecteds). Note that, if an affected member is
untyped at a locus, his genotype may be specified as "0 0" to reflect
this. The lines must be in the order that makes the locus numbers
increase.

Now consider MULT files. The main differences are in the way the
genotypes of the affecteds are specified. But let's look at the
detail:

	For each pedigree, there is a line containing the pedigree
title, as with ML files. There is then a line containing the number of
members and the number of affecteds (there is no third number to show
how many loci for which the family is typed - this is because the
family must be typed at all loci). Following this are the lists of
mother and father ID's, just as in ML format. However, the next few
lines, which contain the genotypic data, are different: There is a
line for each affected individual, containing the affected member's
ID and his genotypes for each locus (in the order the loci were
specified at the beginning of the file). All affecteds must be typed
at all marker loci.

Let's look at an example, a made-up pedigree. Say you know that A and
B had two children, C and D, and that C married E and had a child
called F. Let's also say that we have two marker loci, Locus1 and
Locus2, that Locus1 has three alleles with frequencies 0.1, 0.5, and
0.4, and that Locus2 has three alleles with frequencies 0.3, 0.2, and
0.5. (The allele frequencies for each locus must add up to 1.0.)

Sex is relevant, so let's make A, D, and E female and B, C, and F
male.

Now let's say that D and F are affected with the disease of interest,
and their genotypes are 1/1 and 2/1, respectively, at our first marker
locus, and 3/1 and 2/1, respectively, at our second marker locus. (The
allele numbers must correspond with the allele frequencies, so that,
for the first locus, allele 1 is the one with frequency 0.1, allele 2
is the one with frequency 0.5, etc.)

The pedigree looks like this:

          A --- B
             |
         ---------
         |       |
   E --- C       D
      |         1/1
      F         3/1
     2/1
     2/1

We have enough information now to construct our data file. First we
must number the members - start with the founders (A, B, and E), and
then each member for whom both parents have ID's (C and D). And then
repeat the last step, giving F his much-needed ID. Now all members
have ID's, and our pedigree looks like this, with the ID's in
parentheses (I've also added the sex for each, m or f):

       (1)Af---mB(2)
             |
         ---------
         |       |
(3)Ef---mC(4)   fD(5)
      |         1/1
     mF(6)      3/1
     2/1
     2/1

Let's make an ML format file first. Since we have one pedigree and two
marker loci, the first line is:
	1    2
Now we describe each locus, in turn:
     3   Locus1
  0.1  0.5  0.4
     3   Locus2
  0.3  0.2  0.5
And the pedigree name, let's call it DUMMY (prekin and dGENE permit
only 8 character titles):
DUMMY
Now the number of members, number of affecteds, and number of loci:
        6    2    2
And the mothers:
  0  0  0  1  1  3
And the fathers:
  0  0  0  2  2  4
Now the list of all affecteds, in increasing order:
       5      6
We are typed for the first locus (Locus1), so:
 1   1  1   2  1
and for the second locus (Locus2) as well:
 2   3  1   2  1


That's it. The final data file looks like this:

	1    2
     3   Locus1
  0.1  0.5  0.4
     3   Locus2
  0.3  0.2  0.5
DUMMY
        6    2    2
  0  0  0  1  1  3
  0  0  0  2  2  4
       5      6
 1   1  1   2  1
 2   3  1   2  1


This file could then be used by apm. We suggest that before you use a
data file, however, you run chapm with the -check option to make sure
everything's ok.

If we wanted a MULT file, we could either run chapm on the above file
to convert it (if we have that file), or we could create it from
scratch. The first few lines are the same as for ML; only the pedigree
description is different. Let's start there - first the pedigree name:
DUMMY
Now the number of members and affecteds:
        6    2
And the mothers and fathers as before:
  0  0  0  1  1  3
  0  0  0  2  2  4
Now the first affected and his genotypes:
 5   1  1   3  1
And the second:
 6   2  1   2  1


Thus the file looks like this:

	1    2
     3   Locus1
  0.1  0.5  0.4
     3   Locus2
  0.3  0.2  0.5
DUMMY
        6    2
  0  0  0  1  1  3
  0  0  0  2  2  4
 5   1  1   3  1
 6   2  1   2  1


In all cases, the white space (spaces and tabs) can be tailored to
taste. Newlines can be inserted in the middle of lines, but they are
required at the ends of all the lines as they are described here.


HINT #1
~~~~~~~
The Affected Pedigree Member Method needs only two forms of
information:

      1) Who are the affected individuals and what are their                   
         genotypes?
      2) What are the relationships between all the affected and
         typed individuals?

Thus, the data file for these programs need not contain any unaffected         
individuals who are not necessary for defining the relationships between       
affected (and typed) individuals.  For example, in the example files,
there could have been children of F and D, more children of A and B or
of C and E, etc., but they needn't be included if none of them (and no
descendents of them) are affected.


HINT #2
~~~~~~~
If you already have your genetic data in a database or files, it is
much less error prone to use a program convert your data into the
format described above (rather than doing it by hand).


CONTACT
~~~~~~~
You can contact Dr. Weeks to receive copies of the programs for DOS
or Unix (source code only for Unix systems), ask questions, make
comments, or report bugs.


********************************
| Daniel E. Weeks              | 
| Department of Human Genetics |  Internet: 
| University of Pittsburgh     |    dweeks@watson.hgen.pitt.edu
| Crabtree Hall, Room A310     |
| 130 DeSoto Street            |  Bitnet:
| Pittsburgh, PA 15261         |    weeks@pittvms.bitnet
|                              |
| (412) 624-3066               |
| FAX: (412) 624-3020          |
********************************


                 Thank you!

                 Daniel E. Weeks  &  Mark Schroeder

[ The APM programs are Copyright (C) 1993 Daniel E. Weeks ]


ADDENDUM
~~~~~~~~
The following pointers are taken from an article written for an INSERM
workshop March 30-31, 1992, with only slight modification to bring it
up to date. The original article is entitled "Using the Affected
Pedigree Member Method of Linkage Analysis" and was authored by Daniel
E. Weeks, Lisa D. Harby, Carmella A. Sarneso, and Michael B. Gorin.

Practical Issues

1. Carefully estimate marker allele frequencies 

The APM method appears to be sensitive to misspecification of 
marker allele frequencies.  This can be a problem due to the 
difficulty of accurately determining allele frequencies for highly 
polymorphic markers.  Underestimating or misspecifying the 
frequency of a marker allele may falsely inflate the evidence for 
linkage, since matches between rare marker alleles contribute more 
to the APM statistic than matches between common alleles.  

2. Exclude families containing only a parent-child pair of 
affecteds

The APM statistic is based on distortions in the identity-by-state 
(IBS) status at the marker loci, which hopefully reflect the 
underlying identity-by-descent (IBD) status.  There is no 
opportunity for any meaningful distortion in the IBS status in 
pedigrees containing only a parent-child pair, because a parent-
child pair always shares only one allele IBD. Therefore, we 
recommend deleting those families that contain only a parent-child 
pair (Weeks and Lange 1992).  However, simulations and analytical 
results indicate that inadvertently including such families should 
have little effect on the false positive rate.

3. Affecteds must be typed at all markers under consideration

For the multilocus version of the APM statistic, an affected 
individual must be typed at all the marker loci under 
consideration for that individual to be included in the statistic 
(Weeks and Lange 1992).  There is currently no provision for 
missing marker data on the affecteds.

4. Do not collapse missing alleles

In traditional linkage analysis, if a marker has 25 alleles and 
only 4 alleles appear in the pedigrees, then one may carry out the 
analysis using only 5 alleles, by collapsing all 21 missing 
alleles into 1 dummy allele with the appropriate frequency.  
However, in the APM method it is not desirable to collapse the 
alleles in this manner (Weeks and Lange 1992), since the APM 
statistic depends on the sum of the square and cube of the allele 
frequencies.  Collapsing the alleles will artificially change 
these sums. 

5. Use the intermediate weighting function only

The APM programs calculate the results in terms of three weighting 
functions: A) f(p)=1, B) f(p)=1/sqrt(p), and C) f(p)=1/p.  These 
functions are used to make the sharing of rare marker alleles 
between affecteds a more significant event than the sharing of 
common marker alleles.  Function A corresponds to no weighting, 
while function C corresponds to strong weighting.  We recommend 
using and reporting only the results based on the intermediate 
function B, since extreme weighting (function C) usually leads to 
non-normality of the statistic, while no weighting (function A)  
may be too conservative (Weeks and Lange 1992).

It can be useful to examine the APM results under the three 
different weighting functions.  For example, if the result under 
function C is extremely significant, while the result under 
function A is not significant, then this may indicate that the 
affected relatives are sharing an extremely rare marker allele and 
that most of the significance comes from the fact that the marker 
allele is very rare.  If the allele frequency isn't accurately 
determined for the particular population being analyzed, this may 
provide motivation to estimate the allele frequency more 
accurately.   

6. Use simulation to check "significant" APM results 

The approximate significance of the APM statistic may be 
determined quickly and easily by referring to a statistical table 
for the standard normal distribution.  However, since this is an 
approximation, it is always desirable to check the "significant" 
APM results by simulation, especially if the number of families is 
less than 20 (Weeks and Lange 1992).  Two simulation programs, a 
single-locus and a multi-locus version, are distributed in the APM 
package for precisely this purpose.

7. Always examine the data carefully

A significant APM result should provide the impetus to examine the 
data more carefully, especially because a significant APM result 
only indicates that the marker is not segregating independently of 
the disease.  A significant result does not necessarily imply that 
a linkage has been discovered, since the result could be due to 
association or misspecified marker allele frequencies.  For 
example, if a large pedigree was completely homozygous at the 
marker locus for a very rare allele, then this may generate a 
significant APM result, even though, under traditional linkage 
analysis, this pedigree contains no linkage information.  However, 
if the marker allele is really as rare as indicated, then this may 
be a valid result, since it is extremely unlikely that all 
founders in the pedigree will be homozygous at the marker allele. 


APM Programs

Kenneth Lange and Daniel Weeks originally wrote the APM programs, and
they have been modified by Mark Schroeder for this release.  The
programs come in two versions.  The many-locus version analyzes any
number of markers, independently, one at a time.  The multiple-locus
version analyzes a set of closely linked markers.  Both versions
accept similar input files which require numbering each pedigree from
top to bottom.  We provide a utility program for creating an APM file
from LINKAGE data files and for converting between APM formats (a
program which can convert from MENDEL was in the last release and has
not been changed). These utility programs require that the affected
individuals be identified by a unique code at the disease locus.  The
DOS-version and Unix-source of the APM programs may be obtained from
Dr. Weeks by sending two formatted DOS diskettes and a return
envelope.  Alternatively, Unix-source alone may be obtained by
e-mailing a request to Dr. Weeks at dweeks@watson.hgen.pitt.edu.


References  

[ see the accompanying REFERENCES file ]


An Example Run of APM, Sim, and Hist
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here is a typical run of apm, sim, and hist on the test_ml.dat data
file (included in the ./examples directory). Aside from introducing
some blank lines here and there (to improve legibility) and some
comments, the output of the programs is unaltered. The output produced
by your copies should be the same (or very close). The use of apmmult,
simmult, and hist is very similar.

holmes-51:ls
test_ml.dat

holmes-52:apm
  The Affected Pedigree Member Method of Linkage Analysis 
    Single marker program apm, version 2.00; June 14, 1993
    by Daniel E. Weeks and Kenneth Lange 
    modified by Mark Schroeder
  Copyright (C) 1993  Daniel E. Weeks
 
********************************
| Daniel E. Weeks              |
| Department of Human Genetics |  Internet: 
| University of Pittsburgh     |    dweeks@watson.hgen.pitt.edu
| Crabtree Hall, Room A310     |
| 130 DeSoto Street            |  Bitnet:
| Pittsburgh, PA 15261         |    weeks@pittvms.bitnet
|                              |
| (412) 624-3066               |
| FAX: (412) 624-3020          |
********************************
 
Input the datafile file name and extension
> test_ml.dat
Input the limitation on memory use in megabytes
Entering 0 will set it to about 20 megabytes
> 0
Now enter the name of the file of coefficients
if it exists (if not just press <return>).
> 
If you wish to create a new file of coefficients,
enter the name (if not just press <return>).
> 
   TESTPED1            <--- pedigree title
    15     3
   0   0   2   2   2   0   0   0   5   5   5   7  12  12  12
   0   0   1   1   1   0   0   0   6   6   6   8  11  11  11
The    3 allele frequencies for locus   1  (   ACK1   ) are 
   0.45000
   0.30000
   0.25000
The    3 allele frequencies for locus   2  (   ACK2   ) are 
   0.55000
   0.20000
   0.25000
The    2 allele frequencies for locus   3  (   ACK3   ) are 
   0.46500
   0.53500
LOCUS   1       3
       4
      10
      13
making some new kinship calculations
f(p) = 1        
mean          1.26656
variance      0.27363
observed x    2.00000
The statistic for this family at this locus is    1.40212
f(p) = 1/sqrt(p)
mean          2.12586
variance      0.70359
observed x    3.65148
The statistic for this family at this locus is    1.81882
f(p) = 1/p      
mean          3.62500
variance      2.19444
observed x    6.66667
The statistic for this family at this locus is    2.05329
LOCUS   2       2
       4
      13
making some new kinship calculations
f(p) = 1        
mean          0.44219
variance      0.07280
observed x    0.50000
The statistic for this family at this locus is    0.21426
f(p) = 1/sqrt(p)
mean          0.68899
variance      0.16435
observed x    1.00000
The statistic for this family at this locus is    0.76717
f(p) = 1/p      
mean          1.12500
variance      0.54403
observed x    2.00000
The statistic for this family at this locus is    1.18630
LOCUS   3       2
      10
      13
making some new kinship calculations
f(p) = 1        
mean          0.56464
variance      0.05909
observed x    0.50000
The statistic for this family at this locus is   -0.26594
f(p) = 1/sqrt(p)
mean          0.79652
variance      0.11693
observed x    0.73324
The statistic for this family at this locus is   -0.18508
f(p) = 1/p      
mean          1.12500
variance      0.23499
observed x    1.07527
The statistic for this family at this locus is   -0.10259
   TESTPED2            <--- pedigree title
     8     2
   0   0   1   1   1   0   6   6
   0   0   2   2   2   0   5   5
LOCUS   2       2
       4
       7
making some new kinship calculations
f(p) = 1        
mean          0.47938
variance      0.06961
observed x    0.50000
The statistic for this family at this locus is    0.07817
f(p) = 1/sqrt(p)
mean          0.75565
variance      0.15779
observed x    1.00000
The statistic for this family at this locus is    0.61515
f(p) = 1/p      
mean          1.25000
variance      0.55682
observed x    2.00000
The statistic for this family at this locus is    1.00509
 
Overall Results:
  Weight         Locus   N.Fam. Statistic  P-value
f(p) = 1           ACK1      1   1.40212   0.08043
f(p) = 1           ACK2      2   0.20678   0.41808
f(p) = 1           ACK3      1  -0.26594   0.60486
f(p) = 1/sqrt(p)   ACK1      1   1.81882   0.03447
f(p) = 1/sqrt(p)   ACK2      2   0.97745   0.16417
f(p) = 1/sqrt(p)   ACK3      1  -0.18508   0.57342
f(p) = 1/p         ACK1      1   2.05329   0.02003
f(p) = 1/p         ACK2      2   1.54955   0.06062
f(p) = 1/p         ACK3      1  -0.10259   0.54087
 
There were     3 affecteds used for locus    ACK1   
There were     4 affecteds used for locus    ACK2   
There were     2 affecteds used for locus    ACK3   
 
Note: The p-values may be unreliable for small numbers of
families. We recommend using the simulation program "sim" and
the histogram generator "hist" to compute empirical p-values.
 
[ Since the number of families is small, the distribution
  doesn't sufficiently resemble the asymptotic distribution.
  Therefore, the p-value approximation given by apm may not
  be reliable. We need to then run sim to produce replicates
  of the families, each of which will have random genotypes
  assigned to all the members (subject to the constraints
  of the pedigree structures and the allele frequencies,
  but NOT assuming linkage between the markers and the
  disease). This will effectively simulate many different
  families within the normal distribution. ]

holmes-53:ls
out1.dat        out1p.dat       outsqr.dat      table.out       test_ml.dat

holmes-54:sim
random seed: 75449

Please enter the data file name: outsqr.dat
f(p) = 1/sqrt(p)

[ We usually use this weighting function. ]

   2   3
   3
     0.45
     0.3
     0.25
   3
     0.55
     0.2
     0.25
   2
     0.465
     0.535
Input the desired number of iterations (1000 is good): 1000
REMEMBER: Not all families may be used at all loci

TESTPED1            
    15    3    3
  0  0  2  2  2  0  0  0  5  5  5  7  12  12  12
  0  0  1  1  1  0  0  0  6  6  6  8  11  11  11
   1   3
     4
     10
     13
   2.12586 0.70359
   2   2
     4
     13
   0.68899 0.16435
   3   2
     10
     13
   0.79652 0.11693
 
TESTPED2            
    8    2    1
  0  0  1  1  1  0  6  6
  0  0  2  2  2  0  5  5
   2   2
     4
     7
   0.75565 0.15779
 
For locus 1, the mean is -0.0122673 and the variance 0.9499
For locus 2, the mean is -0.0273151 and the variance 0.945936
For locus 3, the mean is -0.0213009 and the variance 0.946877

[ For a sufficiently simulated normal distribution, the
  mean should be near zero and the variance should be near
  one. ]

1000 iterations       f(p) = 1/sqrt(p)

holmes-55:ls
out1.dat        outsqr.dat      test_ml.dat     tstat2.out
out1p.dat       table.out       tstat1.out      tstat3.out

[ Now that we have the simulated families, we can use
  hist to see where our real families lie in the normal
  distribution. Basically, we calculate the p-value of
  the statistic generated by apm by finding where it
  lies within the (approximately) normal distribution
  that we have generated. In this example we will do
  this only for the first locus. ]

holmes-56:hist -p - -s tstat1.out

[ tstat1.out contains all the simulated statistics for
  the first locus. ]

Reading samples from file 'tstat1.out'
Number of Samples: 1000     Sample Range: -2.5344 to 4.61865
Mean: -0.0122677    RMS: 0.974705    Mean Deviation from Mean: 0.773703
Variance:  0.95085   Standard Deviation: 0.975115
Skewness: 0.955855   Kurtosis:  1.43903
   -2.3344        2 ]
   -1.9344        0 
B  -1.5344       56 ###########[
I  -1.1344      121 ########################[
N  -0.7344      162 ################################]
   -0.3344      216 ###########################################[
C   0.0656       88 #################]
E   0.4656      157 ###############################]
N   0.8656      124 #########################
T   1.2656        0 
E   1.6656       31 ######[
R   2.0656       18 ###]
    2.4656        0 
    2.8656       21 ####[
    3.2656        0 
    3.6656        0 
    4.0656        3 ]
    4.4656        1 [
Bin Width: 0.4     Block Width: 5
Key for Partial Blocks:   #: >~ 2/3   ]: >~ 1/3   [: <~ 1/3

Computing empirical p-values:
  Enter a statistic (Ctrl-D when done): 1.81882
  P-value for 1.81882: 0.043
  Enter a statistic (Ctrl-D when done): ^D

[ We want to see what the p-value is for 1.81882, since that
  is the statistic generated by apm for the first locus and
  for the weighting function we are using (f(p) = 1/sqrt(p)).
  1.81882 lies in the bin centered on 1.6656. There are
  many fewer simulated statistics greater than 1.81882
  than there are less than that number; hence, the p-value
  is much less than 0.5. The p-value can be thought of as
  the probability of a false positive, and in this case 0.043
  may be low enough to be significant. ]

holmes-57:ls
out1.dat        outsqr.dat      test_ml.dat     tstat2.out
out1p.dat       table.out       tstat1.out      tstat3.out