From: softlib.cs.rice.edu                           Last mod: September 27, 1995	

                         GenoCheck, version 1.0

Each section in each README file starts with the string "|*|". To
browse the sections, use your file viewer to search for this unique
string. 

This file describes in detail the error checking scheme that was 
implemented by M. G. Ehm, R. W. Cottingham, Jr., and M. Kimmel.
The error checking system in conjunction with FASTLINK
identifies individuals and loci likely to contain errors using
a likelihood based method.  

|*| INTRODUCTION
    ------------

As described in the papers:

M. G. Ehm, R.W. Cottingham Jr., and M. Kimmel.  Error Detection
in Genetic Linkage Data Using Likelihood Based Methods.  Journal 
of Biological Systems, Vol. 3, No. 1 (1995) 13-25.

M. G. Ehm, R. W. Cottingham Jr., and M. Kimmel.  Error Detection
in Genetic Linkage Data Using Likelihood Based Methods.  American
Journal of Human Genetics, Vol. 58, No. 1 (1996) (to appear).

this directory and its subdirectories contains version 1.0 of the
error detection scheme described in the papers above.  The error 
detection algorithm, called GenoCheck, uses an altered version of 
the ILINK program, called ILINKERR, from FASTLINK 2.2. 

The occurrence of laboratory typing error in pedigree data for
linkage analysis cannot be ignored.  When studying linked markers between
which crossovers rarely occur, errors in the data will often result in         
false recombinations.  Erroneous recombinations in a dense map are given 
substantial weight thereby increasing the estimate of theta, the           
recombination fraction.  In dense maps, theta approaches the error 
rate and most of all observed crossovers will be spurious.  We present 
a method for detecting errors in pedigree data.  The index is a variant of 
the likelihood ratio test statistic and is used to test the null hypothesis 
of no error for each individual at each locus versus the alternative 
hypothesis of error.  High values of the index pinpoint individuals and 
loci with relatively unlikely genotypes.  Power and significance studies 
using Monte Carlo methods show that the index detects errors for 
small values of theta with a small false positive rate.

[This README file has been organized with each section starting with
 the string "|*|".  To browse the sections, you can thus use your file
 viewer to search for this unique string, thus getting from one
 section to the other without having to read the intervening material.]


|*| The Process
    -----------

When pedigree data are obtained by typing individuals, the observed genotype
is equal to the true genotype unless a typing error has occurred. 
We represent error in pedigree data as incomplete penetrance of genotypes. 
The observed genotypes are considered phenotypes and may not correspond to 
the true genotypes due to errors.  Therefore, modeling error in pedigree
data is easily accomplished using the likelihood method of genetic linkage 
analysis by altering the penetrance function.  Our method is designed to 
identify individuals and loci likely to contain errors.  The method is 
equivalent to a hypothesis test for error for each individual and locus 
in the pedigree.  Each hypothesis test entails:  (1) specifying a penetrance 
function based on an assumed error rate, (2) calculating the difference 
between the log-likelihood of the data at the maximum likelihood estimates 
of theta assuming complete penetrance (i.e. no errors) and the log-likelihood 
of the data at the maximum likelihood estimates of theta assuming incomplete 
penetrance (errors possible), (3) identifying test statistics with relatively 
large values as indicative of an unlikely genotype since large values are 
associated with more evidence for errors than for no errors.  The GenoCheck 
program implements steps (1)-(3).  Its output is a file containing the values 
of the test statistic separated by family and locus and ranked in decreasing 
order.  


|*| The Files
    ---------

The following is a list of the files associated with GenoCheck.

pedin.dat        -  File that contains pedigree data.  All loci being
                    checked for errors must be in the affection status
                    locus type.  

datain.dat       -  File that contains locus and parameter data.  Again all
                    loci being checked for errors must be in the affection 
                    status locus type.  Only the loci formats must be 
                    specified.  All program specific parameters at the end 
                    of the file should be omitted.  (i.e. Do not specify 
                    any recombination fraction information.)

toaff.c          -  Auxiliary program that converts the files: indat 
                    (datain.dat format) and inped (pedin.dat format) containing
                    any locus type to the files: outdat (datain.dat format) and
                    outped (pedin.dat format) containing the affection status 
                    locus type.  The program requires the input pedin.dat file 
                    to be named inped and the input datain.dat file to be named
                    indat.  The output files are outped and outdat.  The 
                    program in no way alters the information content of the 
                    files.  For example, the lod scores obtained with the input
                    files should be the same as those for the output files.  

ilinkerr         -  The executable file that calculates the test statistics for
                    error checking.

PosError         -  Output file of the ILINKERR program.  Contains a list
		    of the test statistics for each individual at each locus
                    in each pedigree.  The test statistics within each pedigree
                    and locus are listed in order of decreasing test statistic.

pedinerr         -  Modified version of the pedin file created by lcp.  It is
                    identical to pedin except that in pedinerr, ILINKERR is 
                    called instead of ilink.  You may alter pedin yourself
                    using a text editor or use suberr described below.  

suberr           -  Command that creates the file pedinerr by replacing the 
                    call to ilink with a call to ILINKERR in the pedin file 
                    created by lcp.    
                    

|*| Setting up an Error Checking Run
    --------------------------------

In order to perform error checking on marker data, you must complete
the following checklist.

(0)  Note that the error checking capability is not available for
     sexlinked data, mutation data and sex difference data (male and 
     female theta are assumed to be the same).  The program will exit
     politely with an error message in these situations.  

(1)  The marker data being tested for errors must be in the affection status
     format.  The program TOAFF will convert any data format to the affection 
     status format.  The file "inped" (pedin.dat format) should contain the
     pedigree data to be converted to affection status.  The file "indat"
     (datain.dat format) should contain the parameter information corresponding
     to inped.  TOAFF requires no parameters.  To run TOAFF type "toaff" on the
     command line.  For convenience copy the output files "outped" into pedin.dat
     and "outdat" into datain.dat.

(2)  Partition the ordered markers into 2, 3, and 4 point analyses.
     If n is the number of individuals and m is the number of loci to be
     analyzed jointly, then GenoCheck requires n*m more likelihood evaluations
     beyond finding the maximum likelihood estimate of the recombination 
     fractions.  Therefore, in general, if the recombination fractions can
     be estimated using 2-point analysis, then error checking is possible using
     2-point analysis or if the recombination fractions can be estimated using
     3-point analysis then error checking is possible using a 3-point analysis.

(3)  Assume the published order for the markers to be checked for error or
     find the most likely order.  

(4)  Create a subdirectory for each error analysis.  Each subdirectory should
     contain a pedin.dat and datain.dat file (markers in the affection status
     format).  

(5)  Use lcp to create a script for each run.  The guide below will assist you
     with the options.  The ouput of lcp is a script named pedin.

     Pedigree Options:  General Pedigrees
     General Pedigree Analysis Options:  ILINK
     ILINK - Order Options:  Specific order
     ILINK - Sex Difference Options:  No sex difference
     ILINK - Locus Order Specification:  (Specify the most likely order with
                                          recombination fractions equal to 0.1
                                          or the published values if available.)

(6)  Run suberr.   This command uses the file pedin created in step (5) and
     creates the executable file pedinerr which contains the commands needed
     to run ILINKERR instead of ILINK.

(7)  Run pedinerr.  The file PosError will contain the error checking results.  
         

|*| Interpreting an Error Checking Run
    ----------------------------------

In the file PosError, the test statistics are separated by locus within
each pedigree.  Within each pedigree and locus, each individual is listed 
with its associated test statistic in order of decreasing test statistic.
As briefly described above, test statistics with relatively large values
are indicative of an unlikely genotype for that individual at that locus.
Test statistics greater than 0.0 are of particular interest. Note that test 
statistics are not comparable across different pedigrees or loci.  In the 
presence of multiple errors, the program is likely to catch only some errors.  
Therefore correcting any errors and rerunning the program is very important.

The ordered list of individuals within pedigree and locus given in PosError 
should be thought of as a priority list for retyping.  Interpreting an error 
checking run includes the following steps:


(1) Reread gels and check computer file entries for individuals in the top
    20% of the locus lists within each pedigree.   If no errors are found and
    all the test statistics are less than 0 then stop error checking.  If there 
    are any errors, correct them, run the analysis again, and go to step 2.

(2) Retype each individual in the top 10% of the locus lists within each
    pedigree.  If there are no errors, then stop error checking.  If errors 
    are present, correct them and run the analysis again.