HMMER Manual                                          hmmbuild(1)


NAME
     hmmbuild - build a profile HMM from an alignment


SYNOPSIS
     hmmbuild [_o_p_t_i_o_n_s] _h_m_m_f_i_l_e _a_l_i_g_n_f_i_l_e


DESCRIPTION
     hmmbuild reads a multiple sequence alignment file  _a_l_i_g_n_f_i_l_e
     , builds a new profile HMM, and saves the HMM in _h_m_m_f_i_l_e.


     _a_l_i_g_n_f_i_l_e may be in ClustalW, GCG MSF, SELEX, Stockholm,  or
     aligned  FASTA alignment format. The format is automatically
     detected.


     By default, the model is configured  to  find  one  or  more
     nonoverlapping  alignments  to  the complete model: multiple
     global alignments with respect to the model, and local  with
     respect  to the sequence.  This is analogous to the behavior
     of the hmmls program of HMMER 1. To configure the model  for
     multiple  _l_o_c_a_l  alignments  with  respect  to the model and
     local with respect to the sequence, a  la  the  old  program
     hmmfs,  use  the  -f (fragment) option. More rarely, you may
     want to configure the model for a  single  global  alignment
     (global  with respect to both model and sequence), using the
     -g  option;  or  to  configure  the  model  for   a   single
     local/local  alignment (a la standard Smith/Waterman, or the
     old hmmsw program), use the -s option.


OPTIONS
     -f   Configure the model for finding  multiple  domains  per
          sequence,  where  each  domain can be a local (fragmen-
          tary) alignment. This is analogous  to  the  old  hmmfs
          program of HMMER 1.


     -g   Configure the model for finding a single global  align-
          ment  to  a  target sequence, analogous to the old hmms
          program of HMMER 1.


     -h   Print brief help; includes version number  and  summary
          of all options, including expert options.


     -n <_s>
          Name this HMM <_s>.  <_s>  can  be  any  string  of  non-
          whitespace  characters  (e.g. one "word").  There is no


HMMER @RELEASE@    Last change: @RELEASEDATE@                   1


HMMER Manual                                          hmmbuild(1)


          length limit (at least not one imposed by  HMMER;  your
          shell will complain about command line lengths first).


     -o <_f>
          Re-save the starting alignment  to  <_f>,  in  Stockholm
          format.   The  columns  which  were  assigned  to match
          states will be marked with x's in  an  #=RF  annotation
          line.  If  either  the  --hand  or  --fast construction
          options  were  chosen,  the  alignment  may  have  been
          slightly  altered  to be compatible with Plan 7 transi-
          tions, so saving the final alignment and  comparing  to
          the  starting  alignment can let you view these altera-
          tions.  See the User's Guide for  more  information  on
          this arcane side effect.


     -s   Configure the model for finding a single  local  align-
          ment  per  target  sequence.  This  is analogous to the
          standard Smith/Waterman algorithm or the hmmsw  program
          of HMMER 1.


     -A   Append this model to an existing  _h_m_m_f_i_l_e  rather  than
          creating  _h_m_m_f_i_l_e.  Useful  for  building HMM libraries
          (like Pfam).


     -F   Force overwriting of  an  existing  _h_m_m_f_i_l_e.  Otherwise
          HMMER  will  refuse to clobber your existing HMM files,
          for safety's sake.


EXPERT OPTIONS
     --amino
          Force the sequence alignment to be interpreted as amino
          acid  sequences. Normally HMMER autodetects whether the
          alignment is protein or DNA, but  sometimes  alignments
          are  so  small  that autodetection is ambiguous. See --
          nucleic.


     --archpri <_x>
          Set the "architecture prior" used by  MAP  architecture
          construction to <_x>, where <_x> is a probability between
          0 and 1. This parameter governs a geometric prior  dis-
          tribution  over model lengths. As <_x> increases, longer
          models are favored a  priori.   As  <_x>  decreases,  it
          takes  more  residue conservation in a column to make a
          column a "consensus" match column in the  model  archi-
          tecture.   The 0.85 default has been chosen empirically
          as a reasonable setting.


HMMER @RELEASE@    Last change: @RELEASEDATE@                   2


HMMER Manual                                          hmmbuild(1)


     --binary
          Write the HMM to _h_m_m_f_i_l_e in HMMER binary format instead
          of readable ASCII text.


     --cfile <_f>
          Save the observed emission and transition counts to <_f>
          after  the architecture has been determined (e.g. after
          residues/gaps have been assigned to match, delete,  and
          insert  states).  This option is used in HMMER develop-
          ment for generating data files useful for training  new
          Dirichlet  priors.  The  format of count files is docu-
          mented in the User's Guide.


     --fast
          Quickly and heuristically determine the architecture of
          the  model  by  assigning  all columns will more than a
          certain fraction of gap characters to insert states. By
          default  this  fraction  is  0.5, and it can be changed
          using the --gapmax option.   The  default  construction
          algorithm  is  a  maximum a posteriori (MAP) algorithm,
          which is slower.


     --gapmax <_x>
          Controls the --_f_a_s_t model construction  algorithm,  but
          if  --_f_a_s_t  is  not  being  used,  has no effect.  If a
          column has more than a fraction <_x> of gap  symbols  in
          it,  it  gets  assigned  to an insert column.  <_x> is a
          frequency from 0 to 1, and by default is  set  to  0.5.
          Higher  values of <_x> mean more columns get assigned to
          consensus, and models get longer; smaller values of <_x>
          mean  fewer  columns  get  assigned  to  consensus, and
          models get smaller.  <_x>


     --hand
          Specify the architecture of  the  model  by  hand:  the
          alignment  file  must  be in SELEX or Stockholm format,
          and the reference annotation line (#=RF in SELEX,  #=GC
          RF  in  Stockholm) is used to specify the architecture.
          Any column marked with a non-gap  symbol  (such  as  an
          'x',  for  instance) is assigned as a consensus (match)
          column in the model.


     --idlevel <_x>
          Controls both the determination of  effective  sequence
          number  and  the  behavior  of  the --_w_b_l_o_s_u_m weighting
          option. The sequence alignment is clustered by  percent
          identity,  and  the  number  of  clusters  at  a cutoff


HMMER @RELEASE@    Last change: @RELEASEDATE@                   3


HMMER Manual                                          hmmbuild(1)


          threshold of <_x> is used  to  determine  the  effective
          sequence  number.  Higher values of <_x> give more clus-
          ters  and  higher  effective  sequence  numbers;  lower
          values  of  <_x> give fewer clusters and lower effective
          sequence numbers.  <_x> is a fraction from 0 to  1,  and
          by  default  is set to 0.62 (corresponding to the clus-
          tering level used in constructing the BLOSUM62  substi-
          tution matrix).


     --informat <_s>
          Assert that the input _s_e_q_f_i_l_e is in format <_s>; do  not
          run  Babelfish  format  autodection. This increases the
          reliability  of  the  program  somewhat,  because   the
          Babelfish  can  make mistakes; particularly recommended
          for unattended, high-throughput runs  of  HMMER.  Valid
          format  strings include FASTA, GENBANK, EMBL, GCG, PIR,
          STOCKHOLM, SELEX, MSF, CLUSTAL,  and  PHYLIP.  See  the
          User's Guide for a complete list.


     --noeff
          Turn off the effective sequence number calculation, and
          use  the  true  number  of sequences instead. This will
          usually reduce the sensitivity of the final  model  (so
          don't do it without good reason!)


     --nucleic
          Force the alignment to be interpreted as  nucleic  acid
          sequence, either RNA or DNA. Normally HMMER autodetects
          whether the alignment is protein or DNA, but  sometimes
          alignments  are  so small that autodetection is ambigu-
          ous. See --amino.


     --null <_f>
          Read a null model from <_f>. The default for protein  is
          to use average amino acid frequencies from Swissprot 34
          and p1 = 350/351; for nucleic acid, the default  is  to
          use 0.25 for each base and p1 = 1000/1001. For documen-
          tation of the format of the null model file and further
          explanation  of  how  the  null  model is used, see the
          User's Guide.


     --pam <_f>
          Apply a heuristic  PAM-  (substitution  matrix-)  based
          prior  on  match  emission probabilities instead of the
          default mixture Dirichlet. The substitution  matrix  is
          read from <_f>. See --pamwgt.


HMMER @RELEASE@    Last change: @RELEASEDATE@                   4


HMMER Manual                                          hmmbuild(1)


          The default Dirichlet state transition prior and insert
          emission  prior  are unaffected. Therefore in principle
          you could combine --prior with  --pam  but  this  isn't
          recommended,  as  it hasn't been tested. ( --pam itself
          hasn't been tested much!)


     --pamwgt <_x>
          Controls the weight on  a  PAM-based  prior.  Only  has
          effect  if  --pam option is also in use. <_x> is a posi-
          tive real number, 20.0 by default. <_x> is the number of
          "pseudocounts"  contriubuted  by  the  heuristic prior.
          Very high values of <_x> can force a scoring system that
          is  entirely  driven by the substitution matrix, making
          HMMER somewhat approximate Gribskov profiles.


     --pbswitch <_n>
          For alignments with a very large number  of  sequences,
          the  GSC,  BLOSUM,  and  Voronoi  weighting schemes are
          slow;  they're  O(N^2)  for   N   sequences.   Henikoff
          position-based weights (PB weights) are more efficient.
          At or above a certain  threshold  sequence  number  <_n>
          hmmbuild  will  switch  from  GSC,  BLOSUM,  or Voronoi
          weights  to  PB  weights.  To  disable  this  switching
          behavior  (at  the  cost of compute time, set <_n> to be
          something larger than the number of sequences  in  your
          alignment.   <_n>  is a positive integer; the default is
          1000.


     --prior <_f>
          Read a Dirichlet prior from <_f>, replacing the  default
          mixture  Dirichlet.  The format of prior files is docu-
          mented in the User's Guide, and an example is given  in
          the Demos directory of the HMMER distribution.


     --swentry <_x>
          Controls the total probability that is  distributed  to
          local  entries  into  the model, versus starting at the
          beginning of the model as in a global  alignment.   <_x>
          is  a probability from 0 to 1, and by default is set to
          0.5.  Higher values of <_x>  mean  that  hits  that  are
          fragments on their left (N or 5'-terminal) side will be
          penalized less, but complete global alignments will  be
          penalized  more.   Lower  values of <_x> mean that frag-
          ments on the left will be penalized  more,  and  global
          alignments  on  this side will be favored.  This option
          only affects the configurations that allow local align-
          ments,  e.g.  -s and -f; unless one of these options is
          also activated, this option has no  effect.   You  have


HMMER @RELEASE@    Last change: @RELEASEDATE@                   5


HMMER Manual                                          hmmbuild(1)


          independent   control   over   local/global   alignment
          behavior for the N/C (5'/3')  termini  of  your  target
          sequences using --swentry and --swexit.


     --swexit <_x>
          Controls the total probability that is  distributed  to
          local  exits from the model, versus ending an alignment
          at the end of the model as in a global alignment.   <_x>
          is  a probability from 0 to 1, and by default is set to
          0.5.  Higher values of <_x>  mean  that  hits  that  are
          fragments  on  their right (C or 3'-terminal) side will
          be penalized less, but complete global alignments  will
          be penalized more.  Lower values of <_x> mean that frag-
          ments on the right will be penalized more,  and  global
          alignments  on  this side will be favored.  This option
          only affects the configurations that allow local align-
          ments,  e.g.  -s and -f; unless one of these options is
          also activated, this option has no  effect.   You  have
          independent   control   over   local/global   alignment
          behavior for the N/C (5'/3')  termini  of  your  target
          sequences using --swentry and --swexit.


     --verbose
          Print more possibly useful stuff, such as  the  indivi-
          dual scores for each sequence in the alignment.


     --wblosum
          Use  the  BLOSUM  filtering  algorithm  to  weight  the
          sequences,   instead   of  the  default.   Cluster  the
          sequences  at  a  given  percentage  identity  (see  --
          idlevel);  assign  each  cluster a total weight of 1.0,
          distributed equally amongst the members of  that  clus-
          ter.


     --wgsc
          Use the  Gerstein/Sonnhammer/Chothia  ad  hoc  sequence
          weighting  algorithm.  This  is already the default, so
          this option has no effect (unless  it  follows  another
          option  in  the  --w family, in which case it overrides
          it).


     --wme
          Use the Krogh/Mitchison maximum  entropy  algorithm  to
          "weight"    the    sequences.   This   supercedes   the
          Eddy/Mitchison/Durbin maximum discrimination algorithm,
          which  gives  almost  identical  weights  but  is  less


HMMER @RELEASE@    Last change: @RELEASEDATE@                   6


HMMER Manual                                          hmmbuild(1)


          robust. ME weighting seems to give a marginal  increase
          in  sensitivity over the default GSC weights, but takes
          a fair amount of time.


     --wnone
          Turn off all sequence weighting.


     --wpb
          Use the Henikoff position-based weighting scheme.


     --wvoronoi
          Use the Sibbald/Argos Voronoi sequence weighting  algo-
          rithm in place of the default GSC weighting.


SEE ALSO
     Master man page, with full list of and guide to the  indivi-
     dual man pages: see hmmer(1).

     A User  guide  and  tutorial  came  with  the  distribution:
     Userguide.ps [Postscript] and/or Userguide.pdf [PDF].

     Finally, all documentation is also available online via WWW:
     http://hmmer.wustl.edu/


AUTHOR
     This software and documentation is:
     @COPYRIGHT@
     HMMER - Biological sequence analysis with profile HMMs
     Copyright (C) 1992-1999 Washington University School of Medicine
     All Rights Reserved

         This source code is distributed under the terms of the
         GNU General Public License. See the files COPYING and LICENSE
         for details.
     See the file  COPYING  in  your  distribution  for  complete
     details.

     Sean Eddy
     HHMI/Dept. of Genetics
     Washington Univ. School of Medicine
     4566 Scott Ave.
     St Louis, MO 63110 USA
     Phone: 1-314-362-7666
     FAX  : 1-314-362-7855
     Email: eddy@genetics.wustl.edu


HMMER @RELEASE@    Last change: @RELEASEDATE@                   7


HMMER Manual                                          hmmbuild(1)


HMMER @RELEASE@    Last change: @RELEASEDATE@                   8