Documentation for 'block'. -------------------------- (August 1996) Contents -------- 1. Description and Purpose. 2. Options. 3. File formats. a. Input files. b. Log files. c. Results. 4. References. 1. Description and Purpose -------------------------- The 'block' program can be used for performing pedigree and linkage analysis. More specifically, it can be used for : - any pedigree analysis involving an arbitrary number of alleles, incomplete penetrance and liability classes. The pedigree may contain an arbitrary number of loops. The number of loops is limited only by memory (but may be large). - any two-point linkage analysis involving an arbitrary number of alleles at each locus. Convergence is guaranteed only in the case where both loci have two alleles. In cases with more alleles, convergence can be obtained by specifying user-defined blocks (read more about this later). To my knowledge, no other programs in the public domain can perform the two above mentioned tasks. Current available programs are very much limited by the number of loops in the pedigree, and is able to handle only very few (10-20 ?). 'block' has been successfully running examples in pedigree analysis with thousands of loops. The program basically functions in the following way : a. A pedigree is read into memory, and converted to the junction tree representation, see [1]. In the pedigree specification an initial recombination fraction must be specified, if linkage analysis is performed. b. A number of blocks are selected, that all can be sampled exactly. Read about the block selection procedure in [2]. Precompiled blocks may also be read from disk. c. A starting configuration is found. d. Warm-up is performed. This is usually 10% of the iterations. e. The specified number of iterations of blocking Gibbs are performed. If linkage analysis is performed, the number of recombinations and non-recombinations are performed. A paper regarding this is currently under way. If linkage analysis is performed, the results can be further processed with the 'theta' program, described in theta.man. f. The results are stored on disk. 2. Options ---------- The 'block' program is run from the command line, and can be supplied a large number of options. In the following each of these options will be explained. Help can also be obtained with the '-h' option. This is the format of the 'block' program : block [-hvBEHLQS] [-b#] [-C] [-d] [-i#][-m#] [-M] [-n#] [-N#] [-O#] [-r#] [-R#] [-t#] [-w#] [-x#] [-Z#] netfile The 'netfile' contains the name of the file describing the pedigree or linkage analysis problem. Option Description -b This option specifies how to treat the blocks : 0 - load precompiled blocks from disk. 1 - construct new blocks, but don't save them (default). 2 - construct new blocks and save them. -B Use most probable (Best) starting configuration. By specifying this option it is possible to use an alternative method for the selection of a starting configuration (which may be very difficult). This method attempts to select the most probable starting configuration, as this may be easier in some cases. Use this option, if the ordinary method has problems finding a starting configuration. -C Specify file to load starting configuration from. This can be used to avoid having to find a new starting configuration each run, which may be very time-consuming. -D Specify the method for selecting the blocks. There are three methods to choose from ranging from a slow method providing high quality blocks to a fast method providing medium quality blocks. The slow most optimal method is described in [2]. 0 - slow most optimal method (default). 1 - faster less optimal method. 2 - fastest least optimal method. -E Attempt to treat the net exactly. This may be possible for smaller pedigrees. For pedigree analysis this results in getting the exact results. For linkage analysis this results in exact simulation (not implemented yet). -f This option controls the forward sampling of barren nodes. A node is defined as barren, if there is no evidence on it, and it has no offspring, or its offspring are barren as well. All barren nodes can be forward sampled instead of being included in the blocking Gibbs sampler. This enables 'block' to make the blocks smaller, and thus use less memory. The precision of the estimates seems similar to the one obtained when using blocking Gibbs on all individuals. If '-f' is specified, the forward sampling of barren nodes is turned off. -h Show a help page. -H Use memory for backup of tables. The default is to backup on disk. Extra memory is necessary for storing the initial values for table, as these values are entered at each initialization of the junction trees. Depending on your memory setup, you may be forced to store tables on the disk (which is the default), but you may also be able to get 'block' to run faster, by storing the tables in main memory. -i Number of iterations. At each iteration, each block is sampled once. -L Perform linkage analysis. This option may be specified only when option -N3 is also used. If option -N3 is used, and -L is not specified, a simple inference is performed, and the marginal probabilities of all variables (given the starting recombination fraction) are saved. -m Maximal amount of memory available for blocks, specified in units of 8 bytes (which is what one floating point number usually takes up). Default is 100,000. -M Specify a list of strings in one of two ways : 1 : -M#string1,string2,...,stringn# 2 : -Mstring1,string2,...,stringn The first method causes 'block' to monitor those variables with name identical to one of the strings. The second causes 'block' to monitor variables with names that contain one of the strings. Thus, you can obtain either exact or substring match. -n Number of blocks to be constructed. The default is 5. You will notice that there are often constructed more blocks than specified. This is because 'block' in many cases must construct extra blocks to ensure irreducibility. If a very large and complex problem is being handled, it will most likely be necessary to specify a large number of blocks. First try should be with the default 5 blocks, then try with 10, 15, etc., until blocking Gibbs is able to handle the problem in a satisfying way. -N Type of input file given to 'block' : 1 - pedigree 1 format. Pedigree analysis with complete penetrance. 2 - pedigree 2 format. Pedigree analysis with incomplete penetrance. 3 - linkage analysis format. The input file formats are described further down. -O Specify number of iterations after which 'block' is forced to output the configuration of the net. The configuration will be stored in the file : 'work//results/conf.<#iterations>'. -Q Run 'block' quietly with very little output. -r The type of representation to use for pedigrees. This option is valid only for pedigree analysis (-N1 and -N2). It has two values : 1 - variables represent genotypes (default). 2 - variables represent alleles. A description of representation 2 and its virtues can be found in [3]. Only representation 2 can be used when running linkage analysis. In few words, representation 1 ses less memory than representation 2, but representation 2 provides more information than representation 1. Specifically, representation 2 provides all the information that representation 1 provides _and_ in addition information on the level of the allele. This information is needed when running linkage analysis, thus representation 1 cannot be used here. -R Force 'block' to output intermediary results at the specified iterations. If no value is passed to -R, the default is to output intermediary results at 100, 200, 500, 1000, 2000, 5000, ... iterations. A different list of numbers can be specified with -R<#1>,<#2>,<#3>,<#4>,...,<#n>. The intermediary results are printed in the file : 'work//results/results.<#iterations>'. -s Criterion to select the blocks after. This criterion specifies the maximum number of blocks that a variable may be removed from. Read more about this in [2]. A variable cannot be removed from all blocks, as it would never be sampled then. If a very large and complex pedigree is being handled, it may be necessary to remove certain variables from most of the blocks for blocking Gibbs to be able to perform exact sampling on the blocks. In this case, option -s2 should most likely be used, as this allows variables to be removed from all blocks except one. 1 - #blocks/2+1 (default) 2 - #blocks-1 3 - #blocks/4+1 4 - 2*#blocks/3+1 -t Triangulation method to use on the pedigree. Read more about triangulation methods in [4]. The default method is usually adequate, but in very hard cases, -t5 should be tried. 0 - default 1 - minimum fill-in edges 2 - minimum clique size 3 - minimum clique weight (current default) 4 - minimum fill-in weight 5 - try each of the above 10 times and select the best -v Verbose mode on. Run with this option to get a lot of extra information. -w Number of iterations in the warm-up phase. The default is to do 10% of the specified number of iterations as warm-up. If -i100 is specified, first 10 iterations of warm-up are performed, and then the 100 main iterations. -x Number of extra simulations to do when each block is treated. This option would be important if simulations were fast compared with stepping from one iteration to the next. This does not seem to be the case here, though. This option is rarely believed to be useful. -Z Seed option. This option allows you to use and modify the seed used for random operations in 'block'. The seed is stored in the file 'work//general/SEED'. 0 - use old seed in 'SEED' 1 - find new seed and save it in 'SEED' 2 - use new seed but do not save it (default) 3. File formats --------------- In this section the formats of the files used by 'block' will be described. First, the input files describing the pedigree or linkage analysis problem, then the log files, and finally the files containing the results. 3.a. Input files ---------------- There are three types of input files, declared with either the -N1, -N2 or -N3 option. -N1 - pedigree 1 format. This pedigree format should be used if a pedigree analysis with complete penetrance is wanted. Examples of such pedigrees can be found in 'peds/ped_ex1' and 'peds/ped_ex2'. This format is very simple : o # : Comments can be specified by starting the line with '#'. o nalleles : Number of alleles can be specified with 'nalleles ='. o palleles : Prior allele population probabilities can be specified with 'palleles = (p1 ... pn)'. If it is left out, uniform probabilities will be assigned. o block : A block can be specified with either (see 'peds/ped_ex2' for an example) : 'block expand =' : a list of individual names must be given. The variables that are created for these individuals will all be contained in at least one block (and will thus be sampled simultaneously). 'block exact =' : a list of variable names must be given. These variables will all be contained in at least one block. The variables that are created for an individual vary given the representation (-r1 or -r2). For representation 1, one variable is created for each individual (and called the same). For representation 2, the following variables may be created for an individual A : A.f : one of A's alleles. The one inherited from A's father A.m : the second of A's alleles. The one inherited from A's mother A.g : the genotype of A. A.x : variable created if there is evidence that A is heterozygous. 'block =' : the same as 'block expand ='. In 'peds/ped_ex1', if the variables 1, 2 and 3 were not blocks, the Markov chain would not be irreducible, and the pedigree analysis would be stuck in the initial configuration. In 'peds/ped_ex2' there is an example of each block type. Without these two blocks, also this example would be stuck in its initial configuration. o 'Pedigree:' : This must be present in the pedigree file, before the pedigree specification can begin. o individual : Then, line after line, the data of individuals can be specified. There's one line for each individual. Each individual is specified as this : 1 - the name of the individual (up to 20 characters) 2 - the name of the father (0 if not in the pedigree) 3 - the name of the mother (0 if not in the pedigree) Currently, either both parents must be specified, or none of them. 4 - the sex of the individual (u - undefined, m - male, f - female). Alternatively, you can use the syntax : (0 - undefined, 1 - male, 2 - female). 5 - allele 1 (number between 1 and nalleles, 0 if undefined) 6 - allele 2 (number between 1 and nalleles, 0 if undefined) -N2 - pedigree 2 format. This pedigree format should be used if a pedigree analysis with incomplete penetrance is wanted. An example of such a pedigree can be found in 'peds/ped_ex3'. This format is like -N1, but with some extensions and minor changes : o nphenotypes : Number of phenotypes can be specified with 'nphenotypes = #'. o phenotype names : The phenotype names can be specified with 'phenotype names = ( ... )'. See an example of this in 'peds/ped_ex3'. The length of these names can be up to 20 characters. o penetrance : the penetrance probabilities can be specified with 'penetrance = ...'. As seen in 'peds/ped_ex3' there must be one line for each genotype. First, the genotype is listed, then the probabilities that each phenotype is observed given this genotype. o block : for an individual A, there is now created a variable called 'A.p', which represents the phenotype of A. o individual : The pedigree specification is much like with pedigree 1 format. Here, the individual is specified like before, but with a phenotype instead of a genotype. 0 specifies an unknown phenotype. -N3 - linkage analysis format. This format should be used if a two-point linkage analysis is wanted. An example of an input file following this format can be found in 'peds/ped_ex4'. The format is similar to the previous, but most keywords have been extended, and some new have been introduced to be able to handle two loci : o nloci : Number of loci can be specified with 'nloci = #'. Currently this can only be set to 2. o loci names : The names of the 2 loci can be specified with 'loci names = ( )'. The length of the names can be 20 characters. o theta : This is the recombination fraction used under the entire blocking Gibbs sampling. The results will be produced _given_ this value. It must be between 0 and 0.5. o nalleles : The number of alleles is now specified with 'nalleles = ( )'. o palleles# : The prior allele population probabilities are now specified with 'palleles = (p1 ... pn)'. o use penetrance : This keyword specifies for each locus whether it has complete or incomplete penetrance. If incomplete penetrance is wanted for some locus, this is specified with a 1. Thus, this is specified for both loci with 'use penetrance = ( )'. o nphenotypes# : The number of phenotypes at a locus is now specified with 'nphenotypes = '. o phenotype names# : The phenotype names at a specific locus is now specified with 'phenotype names = ( ... )'. o penetrance# : The penetrance probabilities at a locus are now specified with 'penetrance = ...'. The actual specification of the probabilities is similar to -N2. o block# : A block is now specified as belonging to a certain locus. I.e., a block belong to the pedigree at a specific locus is specified as 'block = ( ... )'. o individual : The specification of an individual in the pedigree is much like before. First, the names of the individual itself and its father and mother are given. Then, the sex of the individual, and then following, for each of the two loci either the two alleles or the phenotype depending on whether complete or incomplete penetrance is specified for the locus. 3.b. Log files -------------- In this section, the log files output by 'block' will be described. The log files reside in 'work//log' if nothing else is mentioned. 1. 'main_log' This file contains a log from the compilation of the pedigree to the junction-tree representation described in [1]. The file contains much information that can be useful, for instance on the cliques that are constructed (the size of them and which variables they contain). The file also contains the size of the junction tree. 2. 'generations' This file contains information on the number of generations in the pedigree, and the generation number of each variable. 3. 'complexity_reduction' This file contains output from the algorithm that finds the optimal blocks, that is described in [2]. First, the complete junction tree is listed, first the cliques and then the links. Then, the selection algorithm is started. One variable at a time is selected, and each line contains information regarding the domain the variable is removed from, the complexity reduction (c.r.) of the variable, how much complexity is left over in the domain, and last the size of the largest clique (lc), and some information on the largest cliques. The complexity represents the amount of space taken up by the junction tree. 4. 'barren_nodes' This file lists the barren nodes of the pedigree. Read option '-f' for a definition of barren nodes. 5. 'initial_conf' This file resides in the directory 'general'. It contains the initial configuration for the first block. It is always saved, and can be re-used to make 'block' start faster (with -C). 6. 'exact_log' This file only appears, when 'block' is able to treat the pedigree in an exact manner. It is a log from the compilation of the pedigree to a junction tree, with the same format as 'main_log'. 7. 'exact.tables' This file only appears, when 'block' is able to treat the pedigree exactly. It is used internally by 'block' to save information about the junction-tree. 8. 'SEED' This file resides in the directory 'general'. It contains the seed for the random number generator. It can be managed with the -Z option. 9. 'blocks_log' This file contains information about the selected blocks. For each block, the number of variables that have been removed from it is listed, along with the percentage out of the total number of variables that have been removed. 3.c. Block files ---------------- In this section, the files output for each block are described. They all reside in 'work//blocks/block<#>'. 1. 'compile_log' This is a log from the compilation of the pedigree representing this block to a junction tree. It has the same format as 'main_log'. 2. '..net' This file contains the block represented with a HUGIN specification. For more information on this, refer to [5] and [6]. In this file the names and relations of all variables can be read, as well as their prior probability tables. 3. 'B-set' This file lists the variables that have been removed from this block. 4. 'cut_corrs' This file lists some more information on the variables that have been removed from this block. It lists the names of the variables that are created when breaking the loops for these variables. 5. 'block.bg' This is a save-file for the junction tree of the block. If option -b0 is used, this file is loaded and used for the block. 6. 'tables.bg' This is an internal file used by 'block'. 3.d. Results ------------ In this section, the files containing results from 'block' are described. These files all reside in 'work//results'. 1. 'results.' This file contains the results for each variable in the pedigree. It lists the name of the variable, and the resulting marginal distribution after iterations. 2. 'short.' This file contains the same as 'results.', but in a shorter format. 3. 'conf.' This file is constructed, if option -O is used. It contains the configuration of the net after iterations. 4. 'link,,,' This file is constructed when doing linkage analysis. At each iteration and each block treated at that iteration, it lists the number of recombinations, the number of non-recombinations, and a list of estimated recombination probabilities. This file should be used as input to the 'theta' program. 4. Hints and tips ----------------- 4.a. Block selection -------------------- If a very large and complex pedigree or linkage analysis is performed, 'block' may have trouble selecting small enough blocks. Various parameters can be adjusted to help it do this : 1. The number of blocks can be raised. This is controlled with option -n. If 'block' is allowed to construct more blocks, it is also able to make the individual blocks smaller. 2. The number of blocks that a variable can be removed from should be raised. This is controlled by option -s. Option -s2 should be used. 3. The most optimal method for selecting blocks should be used. This is controlled with option -D. 4. If a pedigree analysis is performed, -r1 should be used, as this representation uses less memory. 5. More triangulations of the initial pedigree should be attempted. This is controlled with option -t. -t5 ensures that each triangulation method is attempted 10 times, and the best triangulation is used. 6. Force 'block' to use less memory by using the -m option. Using this option doesn't guarantee, that 'block' will use the specified amount of memory, but it will attempt to. 4.b. More than 2 alleles - Reducibility --------------------------------------- 'block' can not always ensure that pedigree/linkage analysis with more than 2 alleles are handled correctly, as it does not know how to construct the blocks such that irreducibility is obtained. This will probably be present in a later version. It is likely, though, that in many cases with more than 2 alleles, 'block' does yield the correct results, as it is able to make the blocks large and sample most of the variables jointly. To ensure irreducibility, blocks can be enforced. Examples of this can be seen in the examples 'ped_ex1', 'ped_ex2' and 'ped_ex3'. How to correctly specify these blocks is a large study, and will be described in a later paper. In smaller linkage studies it is often possible to block all the variables of the first locus in one block, and all the variables of the second locus in another block. This always ensures irreducibility. To test whether irreducibility holds, the pedigree/linkage study can be converted to a 2 allele study, by representing n-1 alleles as 1 allele. If this study yields similar probabilities on the remaining allele for all variables, this is a clear indication that irreducibility holds. 4.c. Differences for DOS version -------------------------------- Due to the limitations of MS-DOS, this version uses different names for the various files. Here is a list of the names used under DOS, and the files they correspond to : Name under DOS Name everywhere else ---------------------------------------------------- _.net ..net data.err data_errors data.ld data_loaded generati.ons generations compl.red complexity_reduction initial.con initial_conf cut.cor cut_corrs link, link,,, 5. References ------------- [1] Finn V. Jensen : "Junction Trees and Decomposable Hypergraphs", Judex Datasystemer A/S, Aalborg, Denmark. 1988. [2] Claus S. Jensen, Augustine Kong and Uffe Kjaerulff : "Blocking-Gibbs Sampling in Very Large Probabilistic Expert Systems", International Journal of Human-Computer Studies, 1995. pp. 647-666. [3] Augustine Kong : "Efficient Methods for Computing Linkage Likelihoods of Recessive Diseases in Inbred Pedigrees", Genetic Epidemiology, 1991. pp. 81-103. [4] Uffe Kjaerulff : "Triangulation of Graphs - Algorithms Giving Small Total State Space", Technical Report, Department of Computer Science, Aalborg University, Denmark, 1990. No. R90-09. [5] L. P. Fischer : "Reference Manual for the HUGIN Application Program Interface", Hugin Expert A/S, 1st Edition, 1990. [6] S. K. Andersen, K. G. Olesen, F. V. Jensen and F. Jensen : "HUGIN - a Shell for Building Bayesian Belief Universes for Expert Systems", Proceedings of the Int. Joint Conference on Artificial Intelligence - 11, 1989.