Documentation for 'block'.
              --------------------------
                    (August 1996)

Contents
--------

1. Description and Purpose.

2. Options.

3. File formats.
   a. Input files.
   b. Log files.
   c. Results.

4. References.


1. Description and Purpose
--------------------------

The 'block' program can be used for performing pedigree and linkage
analysis.  More specifically, it can be used for :

 - any pedigree analysis involving an arbitrary number of alleles,
   incomplete penetrance and liability classes.  The pedigree may
   contain an arbitrary number of loops.  The number of loops is
   limited only by memory (but may be large).

 - any two-point linkage analysis involving an arbitrary number
   of alleles at each locus.  Convergence is guaranteed only in
   the case where both loci have two alleles.  In cases with more
   alleles, convergence can be obtained by specifying user-defined
   blocks (read more about this later).

To my knowledge, no other programs in the public domain can perform
the two above mentioned tasks.  Current available programs are very
much limited by the number of loops in the pedigree, and is able to
handle only very few (10-20 ?).  'block' has been successfully
running examples in pedigree analysis with thousands of loops.

The program basically functions in the following way :

   a. A pedigree is read into memory, and converted to the junction
      tree representation, see [1].  In the pedigree specification
      an initial recombination fraction must be specified, if
      linkage analysis is performed.
   b. A number of blocks are selected, that all can be sampled
      exactly.  Read about the block selection procedure in [2].
      Precompiled blocks may also be read from disk.
   c. A starting configuration is found.
   d. Warm-up is performed.  This is usually 10% of the iterations.
   e. The specified number of iterations of blocking Gibbs are performed.
      If linkage analysis is performed, the number of recombinations and
      non-recombinations are performed.  A paper regarding this is
      currently under way.  If linkage analysis is performed, the results
      can be further processed with the 'theta' program, described
      in theta.man.
   f. The results are stored on disk.

2. Options
----------

The 'block' program is run from the command line, and can be supplied
a large number of options.  In the following each of these options
will be explained.  Help can also be obtained with the '-h' option.

This is the format of the 'block' program :

block [-hvBEHLQS] [-b#] [-C<conf-file>] [-d<data-file>] [-i#][-m#]
[-M<substring>] [-n#] [-N#] [-O#] [-r#] [-R#] [-t#] [-w#] [-x#] [-Z#]
netfile

The 'netfile' contains the name of the file describing the pedigree
or linkage analysis problem.

Option	Description

 -b	This option specifies how to treat the blocks :
	 0 - load precompiled blocks from disk.
	 1 - construct new blocks, but don't save them (default).
	 2 - construct new blocks and save them.

 -B	Use most probable (Best) starting configuration.  By specifying
	this option it is possible to use an alternative method for the
	selection of a starting configuration (which may be very
	difficult).  This method attempts to select the most probable
	starting configuration, as this may be easier in some cases.
	Use this option, if the ordinary method has problems finding
	a starting configuration.

 -C	Specify file to load starting configuration from.  This can be
	used to avoid having to find a new starting configuration each
	run, which may be very time-consuming.

 -D	Specify the method for selecting the blocks.  There are three
	methods to choose from ranging from a slow method providing
	high quality blocks to a fast method providing medium quality
	blocks.  The slow most optimal method is described in [2].
	 0 - slow most optimal method (default).
	 1 - faster less optimal method.
	 2 - fastest least optimal method.

 -E	Attempt to treat the net exactly.  This may be possible for
	smaller pedigrees.  For pedigree analysis this results in
	getting the exact results.  For linkage analysis this results
	in exact simulation (not implemented yet).

 -f	This option controls the forward sampling of barren nodes.
	A node is defined as barren, if there is no evidence on it,
	and it has no offspring, or its offspring are barren as well.
	All barren nodes can be forward sampled instead of being
	included in the blocking Gibbs sampler.  This enables 'block'
	to make the blocks smaller, and thus use less memory.
	The precision of the estimates seems similar to the one
	obtained when using blocking Gibbs on all individuals.
	If '-f' is specified, the forward sampling of barren nodes
	is turned off.

 -h	Show a help page.

 -H	Use memory for backup of tables.  The default is to backup on
	disk.  Extra memory is necessary for storing the initial values
	for table, as these values are entered at each initialization
	of the junction trees.  Depending on your memory setup, you
	may be forced to store tables on the disk (which is the default),
	but you may also be able to get 'block' to run faster, by storing
	the tables in main memory.

 -i	Number of iterations.  At each iteration, each block is sampled
	once.

 -L	Perform linkage analysis.  This option may be specified only when
	option -N3 is also used.  If option -N3 is used, and -L is not
	specified, a simple inference is performed, and the marginal
	probabilities of all variables (given the starting recombination
	fraction) are saved.

 -m	Maximal amount of memory available for blocks, specified in units
	of 8 bytes (which is what one floating point number usually
	takes up).  Default is 100,000.

 -M	Specify a list of strings in one of two ways :
	 1 : -M#string1,string2,...,stringn#
	 2 : -Mstring1,string2,...,stringn
	The first method causes 'block' to monitor those variables
	with name identical to one of the strings.
	The second causes 'block' to monitor variables with names
	that contain one of the strings.
	Thus, you can obtain either exact or substring match.

 -n	Number of blocks to be constructed.  The default is 5.
	You will notice that there are often constructed more blocks
	than specified.  This is because 'block' in many cases must
	construct extra blocks to ensure irreducibility.
	If a very large and complex problem is being handled, it will
	most likely be necessary to specify a large number of blocks.
	First try should be with the default 5 blocks, then try with
	10, 15, etc., until blocking Gibbs is able to handle the
	problem in a satisfying way.

 -N	Type of input file given to 'block' :
	 1 - pedigree 1 format.  Pedigree analysis with complete
	     penetrance.
	 2 - pedigree 2 format.  Pedigree analysis with incomplete
	     penetrance.
	 3 - linkage analysis format.
	The input file formats are described further down.

 -O	Specify number of iterations after which 'block' is forced to
	output the configuration of the net.  The configuration will
	be stored in the file :
	  'work/<pedigree-name>/results/conf.<#iterations>'.

 -Q	Run 'block' quietly with very little output.

 -r	The type of representation to use for pedigrees.  This option
	is valid only for pedigree analysis (-N1 and -N2).  It has two
	values :
	 1 - variables represent genotypes (default).
	 2 - variables represent alleles.
	A description of representation 2 and its virtues can be found
	in [3].  Only representation 2 can be used when running linkage
	analysis.  In few words, representation 1 ses less memory than
	representation 2, but representation 2 provides more information
	than representation 1.  Specifically, representation 2 provides
	all the information that representation 1 provides _and_ in
	addition information on the level of the allele.  This information
	is needed when running linkage analysis, thus representation 1
	cannot be used here.

 -R	Force 'block' to output intermediary results at the specified
	iterations.  If no value is passed to -R, the default is to
	output intermediary results at 100, 200, 500, 1000, 2000,
	5000, ... iterations.  A different list of numbers can be
	specified with -R<#1>,<#2>,<#3>,<#4>,...,<#n>.
	The intermediary results are printed in the file :
	  'work/<pedigree-name>/results/results.<#iterations>'.

 -s	Criterion to select the blocks after.  This criterion specifies
	the maximum number of blocks that a variable may be removed from.
	Read more about this in [2].  A variable cannot be removed from
	all blocks, as it would never be sampled then.  If a very large
	and complex pedigree is being handled, it may be necessary to
	remove certain variables from most of the blocks for blocking
	Gibbs to be able to perform exact sampling on the blocks.
	In this case, option -s2 should most likely be used, as this
	allows variables to be removed from all blocks except one.
	 1 - #blocks/2+1 (default)
	 2 - #blocks-1
	 3 - #blocks/4+1
	 4 - 2*#blocks/3+1

 -t	Triangulation method to use on the pedigree.  Read more about
	triangulation methods in [4].  The default method is usually
	adequate, but in very hard cases, -t5 should be tried.
	 0 - default
	 1 - minimum fill-in edges
	 2 - minimum clique size
	 3 - minimum clique weight (current default)
	 4 - minimum fill-in weight
	 5 - try each of the above 10 times and select the best

 -v	Verbose mode on.  Run with this option to get a lot of extra
	information.

 -w	Number of iterations in the warm-up phase.  The default is to
	do 10% of the specified number of iterations as warm-up.
	If -i100 is specified, first 10 iterations of warm-up are
	performed, and then the 100 main iterations.

 -x	Number of extra simulations to do when each block is treated.
	This option would be important if simulations were fast compared
	with stepping from one iteration to the next.  This does not
	seem to be the case here, though.  This option is rarely believed
	to be useful.

 -Z	Seed option.  This option allows you to use and modify the
	seed used for random operations in 'block'.  The seed is stored
	in the file 'work/<pedigree-name>/general/SEED'.
	 0 - use old seed in 'SEED'
	 1 - find new seed and save it in 'SEED'
	 2 - use new seed but do not save it (default)


3. File formats
---------------

In this section the formats of the files used by 'block' will be
described.  First, the input files describing the pedigree or linkage
analysis problem, then the log files, and finally the files containing
the results.

3.a. Input files
----------------

There are three types of input files, declared with either the -N1,
-N2 or -N3 option.

 -N1 - pedigree 1 format.  This pedigree format should be used if a
       pedigree analysis with complete penetrance is wanted.  Examples
       of such pedigrees can be found in 'peds/ped_ex1' and 'peds/ped_ex2'.
       This format is very simple :
	 o # : Comments can be specified by starting the line with '#'.
	 o nalleles : Number of alleles can be specified with 'nalleles ='.
	 o palleles : Prior allele population probabilities can be specified
	   with 'palleles = (p1 ... pn)'.  If it is left out, uniform
	   probabilities will be assigned.
	 o block : A block can be specified with either (see 'peds/ped_ex2'
	   for an example) :
	     'block expand =' : a list of individual names must be given.
		 The variables that are created for these individuals will
		 all be contained in at least one block (and will thus be
		 sampled simultaneously).
	     'block exact =' : a list of variable names must be given.
		 These variables will all be contained in at least one
		 block.  The variables that are created for an individual
		 vary given the representation (-r1 or -r2).  For
		 representation 1, one variable is created for each
		 individual (and called the same).  For representation
		 2, the following variables may be created for an
		 individual A :
		  A.f : one of A's alleles.  The one inherited from A's
		    father
		  A.m : the second of A's alleles.  The one inherited from
		    A's mother
		  A.g : the genotype of A.
		  A.x : variable created if there is evidence that
		    A is heterozygous.
             'block =' : the same as 'block expand ='.  In 'peds/ped_ex1',
		 if the variables 1, 2 and 3 were not blocks, the
		 Markov chain would not be irreducible, and the pedigree
		 analysis would be stuck in the initial configuration.
		 In 'peds/ped_ex2' there is an example of each block
		 type.  Without these two blocks, also this example
		 would be stuck in its initial configuration.
         o 'Pedigree:' : This must be present in the pedigree file, before
	   the pedigree specification can begin.
	 o individual : Then, line after line, the data of individuals can
	   be specified.  There's one line for each individual.
	   Each individual is specified as this :
	     1 - the name of the individual (up to 20 characters)
	     2 - the name of the father (0 if not in the pedigree)
	     3 - the name of the mother (0 if not in the pedigree)
		 Currently, either both parents must be specified,
		 or none of them.
             4 - the sex of the individual (u - undefined, m - male,
		 f - female).  Alternatively, you can use the syntax :
		 (0 - undefined, 1 - male, 2 - female).
	     5 - allele 1 (number between 1 and nalleles, 0 if undefined)
	     6 - allele 2 (number between 1 and nalleles, 0 if undefined)

 -N2 - pedigree 2 format.  This pedigree format should be used if a
       pedigree analysis with incomplete penetrance is wanted.  An
       example of such a pedigree can be found in 'peds/ped_ex3'.
       This format is like -N1, but with some extensions and minor
       changes :
	 o nphenotypes : Number of phenotypes can be specified with
	   'nphenotypes = #'.
	 o phenotype names : The phenotype names can be specified
	   with 'phenotype names = (<name 1> ... <name n>)'.  See
	   an example of this in 'peds/ped_ex3'.  The length of these
	   names can be up to 20 characters.
	 o penetrance : the penetrance probabilities can be specified
	   with 'penetrance = ...'.  As seen in 'peds/ped_ex3' there
	   must be one line for each genotype.  First, the genotype
	   is listed, then the probabilities that each phenotype is
	   observed given this genotype.
         o block : for an individual A, there is now created a variable
	   called 'A.p', which represents the phenotype of A.
	 o individual : The pedigree specification is much like with
	   pedigree 1 format.  Here, the individual is specified like
	   before, but with a phenotype instead of a genotype.
	   0 specifies an unknown phenotype.

 -N3 - linkage analysis format.  This format should be used if a
       two-point linkage analysis is wanted.  An example of an input
       file following this format can be found in 'peds/ped_ex4'.
       The format is similar to the previous, but most keywords have
       been extended, and some new have been introduced to be able
       to handle two loci :
	 o nloci : Number of loci can be specified with 'nloci = #'.
	   Currently this can only be set to 2.
         o loci names : The names of the 2 loci can be specified with
	   'loci names = (<name 1> <name 2>)'.  The length of the names
	   can be 20 characters.
	 o theta : This is the recombination fraction used under the
	   entire blocking Gibbs sampling.  The results will be
	   produced _given_ this value.  It must be between 0 and 0.5.
	 o nalleles : The number of alleles is now specified with
	   'nalleles = (<nalleles at locus 1> <nalleles at locus 2>)'.
	 o palleles# : The prior allele population probabilities are
	   now specified with 'palleles<locus #> = (p1 ... pn)'.
	 o use penetrance : This keyword specifies for each locus
	   whether it has complete or incomplete penetrance.  If
	   incomplete penetrance is wanted for some locus, this is
	   specified with a 1.  Thus, this is specified for both loci
	   with 'use penetrance = (<pen1> <pen2>)'.
	 o nphenotypes# : The number of phenotypes at a locus is now
	   specified with 'nphenotypes<locus #> = <no. of phenotypes>'.
	 o phenotype names# : The phenotype names at a specific locus
	   is now specified with 'phenotype names<locus #> =
	   (<name 1> ... <name n>)'.
	 o penetrance# : The penetrance probabilities at a locus are
	   now specified with 'penetrance<locus #> = ...'.  The actual
	   specification of the probabilities is similar to -N2.
	 o block# : A block is now specified as belonging to a certain
	   locus.  I.e., a block belong to the pedigree at a specific
	   locus is specified as 'block<locus #> = (<name 1> ... <name n>)'.
	 o individual : The specification of an individual in the pedigree
	   is much like before.  First, the names of the individual itself
	   and its father and mother are given.  Then, the sex of the
	   individual, and then following, for each of the two loci
	   either the two alleles or the phenotype depending on whether
	   complete or incomplete penetrance is specified for the locus.


3.b. Log files
--------------

In this section, the log files output by 'block' will be described.
The log files reside in 'work/<pedname>/log' if nothing else is
mentioned.

  1. 'main_log'
     This file contains a log from the compilation of the pedigree
     to the junction-tree representation described in [1].  The file
     contains much information that can be useful, for instance
     on the cliques that are constructed (the size of them and which
     variables they contain).  The file also contains the size of
     the junction tree.

  2. 'generations'
     This file contains information on the number of generations in
     the pedigree, and the generation number of each variable.

  3. 'complexity_reduction'
     This file contains output from the algorithm that finds the
     optimal blocks, that is described in [2].  First, the complete
     junction tree is listed, first the cliques and then the links.
     Then, the selection algorithm is started.  One variable at a
     time is selected, and each line contains information regarding
     the domain the variable is removed from, the complexity reduction
     (c.r.) of the variable, how much complexity is left over in the
     domain, and last the size of the largest clique (lc), and some
     information on the largest cliques.  The complexity represents
     the amount of space taken up by the junction tree.

  4. 'barren_nodes'
     This file lists the barren nodes of the pedigree.  Read option '-f'
     for a definition of barren nodes.

  5. 'initial_conf'
     This file resides in the directory 'general'.  It contains the
     initial configuration for the first block.  It is always saved,
     and can be re-used to make 'block' start faster (with -C).

  6. 'exact_log'
     This file only appears, when 'block' is able to treat the pedigree
     in an exact manner.  It is a log from the compilation of the
     pedigree to a junction tree, with the same format as 'main_log'.

  7. 'exact.tables'
     This file only appears, when 'block' is able to treat the pedigree
     exactly.  It is used internally by 'block' to save information
     about the junction-tree.

  8. 'SEED'
     This file resides in the directory 'general'.  It contains the seed
     for the random number generator.  It can be managed with the -Z
     option.

  9. 'blocks_log'
     This file contains information about the selected blocks.  For each
     block, the number of variables that have been removed from it is
     listed, along with the percentage out of the total number of variables
     that have been removed.

3.c. Block files
----------------

In this section, the files output for each block are described.
They all reside in 'work/<pedname>/blocks/block<#>'.

  1. 'compile_log'
     This is a log from the compilation of the pedigree representing
     this block to a junction tree.  It has the same format as 'main_log'.

  2. '<pedname>.<block#>.net'
     This file contains the block represented with a HUGIN specification.
     For more information on this, refer to [5] and [6].  In this file
     the names and relations of all variables can be read, as well
     as their prior probability tables.

  3. 'B-set'
     This file lists the variables that have been removed from this block.

  4. 'cut_corrs'
     This file lists some more information on the variables that have been
     removed from this block.  It lists the names of the variables that
     are created when breaking the loops for these variables.

  5. 'block.bg'
     This is a save-file for the junction tree of the block.  If option
     -b0 is used, this file is loaded and used for the block.

  6. 'tables.bg'
     This is an internal file used by 'block'.

3.d. Results
------------

In this section, the files containing results from 'block' are
described.  These files all reside in 'work/<pedname>/results'.

  1. 'results.<iteration>'
     This file contains the results for each variable in the pedigree.
     It lists the name of the variable, and the resulting marginal
     distribution after <iteration> iterations.

  2. 'short.<iteration>'
     This file contains the same as 'results.<iteration>', but in a
     shorter format.

  3. 'conf.<iteration>'
     This file is constructed, if option -O is used.  It contains the
     configuration of the net after <iteration> iterations.

  4. 'link,<pedname>,<theta0>,<iteration>'
     This file is constructed when doing linkage analysis.  At each
     iteration and each block treated at that iteration, it lists
     the number of recombinations, the number of non-recombinations,
     and a list of estimated recombination probabilities.  This file
     should be used as input to the 'theta' program.


4. Hints and tips
-----------------

4.a. Block selection
--------------------

If a very large and complex pedigree or linkage analysis is performed,
'block' may have trouble selecting small enough blocks.  Various
parameters can be adjusted to help it do this :

 1. The number of blocks can be raised.  This is controlled with
    option -n.  If 'block' is allowed to construct more blocks,
    it is also able to make the individual blocks smaller.

 2. The number of blocks that a variable can be removed from should
    be raised.  This is controlled by option -s.  Option -s2 should
    be used.

 3. The most optimal method for selecting blocks should be used.
    This is controlled with option -D.

 4. If a pedigree analysis is performed, -r1 should be used, as
    this representation uses less memory.

 5. More triangulations of the initial pedigree should be attempted.
    This is controlled with option -t.  -t5 ensures that each
    triangulation method is attempted 10 times, and the best
    triangulation is used.

 6. Force 'block' to use less memory by using the -m option.  Using
    this option doesn't guarantee, that 'block' will use the specified
    amount of memory, but it will attempt to.

4.b. More than 2 alleles - Reducibility
---------------------------------------

'block' can not always ensure that pedigree/linkage analysis with more
than 2 alleles are handled correctly, as it does not know how to construct
the blocks such that irreducibility is obtained.  This will probably be
present in a later version.  It is likely, though, that in many cases
with more than 2 alleles, 'block' does yield the correct results, as it
is able to make the blocks large and sample most of the variables jointly.

To ensure irreducibility, blocks can be enforced.  Examples of this can
be seen in the examples 'ped_ex1', 'ped_ex2' and 'ped_ex3'.  How to
correctly specify these blocks is a large study, and will be described
in a later paper.  In smaller linkage studies it is often possible to
block all the variables of the first locus in one block, and all the
variables of the second locus in another block.  This always ensures
irreducibility.

To test whether irreducibility holds, the pedigree/linkage study can
be converted to a 2 allele study, by representing n-1 alleles as 1 allele.
If this study yields similar probabilities on the remaining allele for all
variables, this is a clear indication that irreducibility holds.


4.c. Differences for DOS version
--------------------------------

Due to the limitations of MS-DOS, this version uses different names
for the various files.  Here is a list of the names used under DOS,
and the files they correspond to :


   Name under DOS                Name everywhere else
  ----------------------------------------------------
  <pedname>_<block#>.net	<pedname>.<block#>.net
  data.err			data_errors
  data.ld			data_loaded
  generati.ons			generations
  compl.red			complexity_reduction
  initial.con			initial_conf
  cut.cor			cut_corrs
  link,<iteration>		link,<pedname>,<theta0>,<iteration>


5. References
-------------

[1] Finn V. Jensen : "Junction Trees and Decomposable Hypergraphs",
  Judex Datasystemer A/S, Aalborg, Denmark.  1988.
[2] Claus S. Jensen, Augustine Kong and Uffe Kjaerulff :
  "Blocking-Gibbs Sampling in Very Large Probabilistic Expert Systems",
  International Journal of Human-Computer Studies, 1995. pp. 647-666.
[3] Augustine Kong : "Efficient Methods for Computing Linkage Likelihoods
  of Recessive Diseases in Inbred Pedigrees", Genetic Epidemiology,
  1991. pp. 81-103.
[4] Uffe Kjaerulff : "Triangulation of Graphs - Algorithms Giving
  Small Total State Space", Technical Report, Department of Computer Science,
  Aalborg University, Denmark, 1990.  No. R90-09.
[5] L. P. Fischer : "Reference Manual for the HUGIN Application Program
  Interface", Hugin Expert A/S, 1st Edition, 1990.
[6] S. K. Andersen, K. G. Olesen, F. V. Jensen and F. Jensen :
  "HUGIN - a Shell for Building Bayesian Belief Universes for Expert
  Systems", Proceedings of the Int. Joint Conference on Artificial
  Intelligence - 11, 1989.