From: softlib.cs.rice.edu Last mod: September 27, 1995 GenoCheck, version 1.0 Each section in each README file starts with the string "|*|". To browse the sections, use your file viewer to search for this unique string. This file describes in detail the error checking scheme that was implemented by M. G. Ehm, R. W. Cottingham, Jr., and M. Kimmel. The error checking system in conjunction with FASTLINK identifies individuals and loci likely to contain errors using a likelihood based method. |*| INTRODUCTION ------------ As described in the papers: M. G. Ehm, R.W. Cottingham Jr., and M. Kimmel. Error Detection in Genetic Linkage Data Using Likelihood Based Methods. Journal of Biological Systems, Vol. 3, No. 1 (1995) 13-25. M. G. Ehm, R. W. Cottingham Jr., and M. Kimmel. Error Detection in Genetic Linkage Data Using Likelihood Based Methods. American Journal of Human Genetics, Vol. 58, No. 1 (1996) (to appear). this directory and its subdirectories contains version 1.0 of the error detection scheme described in the papers above. The error detection algorithm, called GenoCheck, uses an altered version of the ILINK program, called ILINKERR, from FASTLINK 2.2. The occurrence of laboratory typing error in pedigree data for linkage analysis cannot be ignored. When studying linked markers between which crossovers rarely occur, errors in the data will often result in false recombinations. Erroneous recombinations in a dense map are given substantial weight thereby increasing the estimate of theta, the recombination fraction. In dense maps, theta approaches the error rate and most of all observed crossovers will be spurious. We present a method for detecting errors in pedigree data. The index is a variant of the likelihood ratio test statistic and is used to test the null hypothesis of no error for each individual at each locus versus the alternative hypothesis of error. High values of the index pinpoint individuals and loci with relatively unlikely genotypes. Power and significance studies using Monte Carlo methods show that the index detects errors for small values of theta with a small false positive rate. [This README file has been organized with each section starting with the string "|*|". To browse the sections, you can thus use your file viewer to search for this unique string, thus getting from one section to the other without having to read the intervening material.] |*| The Process ----------- When pedigree data are obtained by typing individuals, the observed genotype is equal to the true genotype unless a typing error has occurred. We represent error in pedigree data as incomplete penetrance of genotypes. The observed genotypes are considered phenotypes and may not correspond to the true genotypes due to errors. Therefore, modeling error in pedigree data is easily accomplished using the likelihood method of genetic linkage analysis by altering the penetrance function. Our method is designed to identify individuals and loci likely to contain errors. The method is equivalent to a hypothesis test for error for each individual and locus in the pedigree. Each hypothesis test entails: (1) specifying a penetrance function based on an assumed error rate, (2) calculating the difference between the log-likelihood of the data at the maximum likelihood estimates of theta assuming complete penetrance (i.e. no errors) and the log-likelihood of the data at the maximum likelihood estimates of theta assuming incomplete penetrance (errors possible), (3) identifying test statistics with relatively large values as indicative of an unlikely genotype since large values are associated with more evidence for errors than for no errors. The GenoCheck program implements steps (1)-(3). Its output is a file containing the values of the test statistic separated by family and locus and ranked in decreasing order. |*| The Files --------- The following is a list of the files associated with GenoCheck. pedin.dat - File that contains pedigree data. All loci being checked for errors must be in the affection status locus type. datain.dat - File that contains locus and parameter data. Again all loci being checked for errors must be in the affection status locus type. Only the loci formats must be specified. All program specific parameters at the end of the file should be omitted. (i.e. Do not specify any recombination fraction information.) toaff.c - Auxiliary program that converts the files: indat (datain.dat format) and inped (pedin.dat format) containing any locus type to the files: outdat (datain.dat format) and outped (pedin.dat format) containing the affection status locus type. The program requires the input pedin.dat file to be named inped and the input datain.dat file to be named indat. The output files are outped and outdat. The program in no way alters the information content of the files. For example, the lod scores obtained with the input files should be the same as those for the output files. ilinkerr - The executable file that calculates the test statistics for error checking. PosError - Output file of the ILINKERR program. Contains a list of the test statistics for each individual at each locus in each pedigree. The test statistics within each pedigree and locus are listed in order of decreasing test statistic. pedinerr - Modified version of the pedin file created by lcp. It is identical to pedin except that in pedinerr, ILINKERR is called instead of ilink. You may alter pedin yourself using a text editor or use suberr described below. suberr - Command that creates the file pedinerr by replacing the call to ilink with a call to ILINKERR in the pedin file created by lcp. |*| Setting up an Error Checking Run -------------------------------- In order to perform error checking on marker data, you must complete the following checklist. (0) Note that the error checking capability is not available for sexlinked data, mutation data and sex difference data (male and female theta are assumed to be the same). The program will exit politely with an error message in these situations. (1) The marker data being tested for errors must be in the affection status format. The program TOAFF will convert any data format to the affection status format. The file "inped" (pedin.dat format) should contain the pedigree data to be converted to affection status. The file "indat" (datain.dat format) should contain the parameter information corresponding to inped. TOAFF requires no parameters. To run TOAFF type "toaff" on the command line. For convenience copy the output files "outped" into pedin.dat and "outdat" into datain.dat. (2) Partition the ordered markers into 2, 3, and 4 point analyses. If n is the number of individuals and m is the number of loci to be analyzed jointly, then GenoCheck requires n*m more likelihood evaluations beyond finding the maximum likelihood estimate of the recombination fractions. Therefore, in general, if the recombination fractions can be estimated using 2-point analysis, then error checking is possible using 2-point analysis or if the recombination fractions can be estimated using 3-point analysis then error checking is possible using a 3-point analysis. (3) Assume the published order for the markers to be checked for error or find the most likely order. (4) Create a subdirectory for each error analysis. Each subdirectory should contain a pedin.dat and datain.dat file (markers in the affection status format). (5) Use lcp to create a script for each run. The guide below will assist you with the options. The ouput of lcp is a script named pedin. Pedigree Options: General Pedigrees General Pedigree Analysis Options: ILINK ILINK - Order Options: Specific order ILINK - Sex Difference Options: No sex difference ILINK - Locus Order Specification: (Specify the most likely order with recombination fractions equal to 0.1 or the published values if available.) (6) Run suberr. This command uses the file pedin created in step (5) and creates the executable file pedinerr which contains the commands needed to run ILINKERR instead of ILINK. (7) Run pedinerr. The file PosError will contain the error checking results. |*| Interpreting an Error Checking Run ---------------------------------- In the file PosError, the test statistics are separated by locus within each pedigree. Within each pedigree and locus, each individual is listed with its associated test statistic in order of decreasing test statistic. As briefly described above, test statistics with relatively large values are indicative of an unlikely genotype for that individual at that locus. Test statistics greater than 0.0 are of particular interest. Note that test statistics are not comparable across different pedigrees or loci. In the presence of multiple errors, the program is likely to catch only some errors. Therefore correcting any errors and rerunning the program is very important. The ordered list of individuals within pedigree and locus given in PosError should be thought of as a priority list for retyping. Interpreting an error checking run includes the following steps: (1) Reread gels and check computer file entries for individuals in the top 20% of the locus lists within each pedigree. If no errors are found and all the test statistics are less than 0 then stop error checking. If there are any errors, correct them, run the analysis again, and go to step 2. (2) Retype each individual in the top 10% of the locus lists within each pedigree. If there are no errors, then stop error checking. If errors are present, correct them and run the analysis again.