HMMER Manual hmmbuild(1) NAME hmmbuild - build a profile HMM from an alignment SYNOPSIS hmmbuild [_o_p_t_i_o_n_s] _h_m_m_f_i_l_e _a_l_i_g_n_f_i_l_e DESCRIPTION hmmbuild reads a multiple sequence alignment file _a_l_i_g_n_f_i_l_e , builds a new profile HMM, and saves the HMM in _h_m_m_f_i_l_e. _a_l_i_g_n_f_i_l_e may be in ClustalW, GCG MSF, SELEX, Stockholm, or aligned FASTA alignment format. The format is automatically detected. By default, the model is configured to find one or more nonoverlapping alignments to the complete model: multiple global alignments with respect to the model, and local with respect to the sequence. This is analogous to the behavior of the hmmls program of HMMER 1. To configure the model for multiple _l_o_c_a_l alignments with respect to the model and local with respect to the sequence, a la the old program hmmfs, use the -f (fragment) option. More rarely, you may want to configure the model for a single global alignment (global with respect to both model and sequence), using the -g option; or to configure the model for a single local/local alignment (a la standard Smith/Waterman, or the old hmmsw program), use the -s option. OPTIONS -f Configure the model for finding multiple domains per sequence, where each domain can be a local (fragmen- tary) alignment. This is analogous to the old hmmfs program of HMMER 1. -g Configure the model for finding a single global align- ment to a target sequence, analogous to the old hmms program of HMMER 1. -h Print brief help; includes version number and summary of all options, including expert options. -n <_s> Name this HMM <_s>. <_s> can be any string of non- whitespace characters (e.g. one "word"). There is no HMMER @RELEASE@ Last change: @RELEASEDATE@ 1 HMMER Manual hmmbuild(1) length limit (at least not one imposed by HMMER; your shell will complain about command line lengths first). -o <_f> Re-save the starting alignment to <_f>, in Stockholm format. The columns which were assigned to match states will be marked with x's in an #=RF annotation line. If either the --hand or --fast construction options were chosen, the alignment may have been slightly altered to be compatible with Plan 7 transi- tions, so saving the final alignment and comparing to the starting alignment can let you view these altera- tions. See the User's Guide for more information on this arcane side effect. -s Configure the model for finding a single local align- ment per target sequence. This is analogous to the standard Smith/Waterman algorithm or the hmmsw program of HMMER 1. -A Append this model to an existing _h_m_m_f_i_l_e rather than creating _h_m_m_f_i_l_e. Useful for building HMM libraries (like Pfam). -F Force overwriting of an existing _h_m_m_f_i_l_e. Otherwise HMMER will refuse to clobber your existing HMM files, for safety's sake. EXPERT OPTIONS --amino Force the sequence alignment to be interpreted as amino acid sequences. Normally HMMER autodetects whether the alignment is protein or DNA, but sometimes alignments are so small that autodetection is ambiguous. See -- nucleic. --archpri <_x> Set the "architecture prior" used by MAP architecture construction to <_x>, where <_x> is a probability between 0 and 1. This parameter governs a geometric prior dis- tribution over model lengths. As <_x> increases, longer models are favored a priori. As <_x> decreases, it takes more residue conservation in a column to make a column a "consensus" match column in the model archi- tecture. The 0.85 default has been chosen empirically as a reasonable setting. HMMER @RELEASE@ Last change: @RELEASEDATE@ 2 HMMER Manual hmmbuild(1) --binary Write the HMM to _h_m_m_f_i_l_e in HMMER binary format instead of readable ASCII text. --cfile <_f> Save the observed emission and transition counts to <_f> after the architecture has been determined (e.g. after residues/gaps have been assigned to match, delete, and insert states). This option is used in HMMER develop- ment for generating data files useful for training new Dirichlet priors. The format of count files is docu- mented in the User's Guide. --fast Quickly and heuristically determine the architecture of the model by assigning all columns will more than a certain fraction of gap characters to insert states. By default this fraction is 0.5, and it can be changed using the --gapmax option. The default construction algorithm is a maximum a posteriori (MAP) algorithm, which is slower. --gapmax <_x> Controls the --_f_a_s_t model construction algorithm, but if --_f_a_s_t is not being used, has no effect. If a column has more than a fraction <_x> of gap symbols in it, it gets assigned to an insert column. <_x> is a frequency from 0 to 1, and by default is set to 0.5. Higher values of <_x> mean more columns get assigned to consensus, and models get longer; smaller values of <_x> mean fewer columns get assigned to consensus, and models get smaller. <_x> --hand Specify the architecture of the model by hand: the alignment file must be in SELEX or Stockholm format, and the reference annotation line (#=RF in SELEX, #=GC RF in Stockholm) is used to specify the architecture. Any column marked with a non-gap symbol (such as an 'x', for instance) is assigned as a consensus (match) column in the model. --idlevel <_x> Controls both the determination of effective sequence number and the behavior of the --_w_b_l_o_s_u_m weighting option. The sequence alignment is clustered by percent identity, and the number of clusters at a cutoff HMMER @RELEASE@ Last change: @RELEASEDATE@ 3 HMMER Manual hmmbuild(1) threshold of <_x> is used to determine the effective sequence number. Higher values of <_x> give more clus- ters and higher effective sequence numbers; lower values of <_x> give fewer clusters and lower effective sequence numbers. <_x> is a fraction from 0 to 1, and by default is set to 0.62 (corresponding to the clus- tering level used in constructing the BLOSUM62 substi- tution matrix). --informat <_s> Assert that the input _s_e_q_f_i_l_e is in format <_s>; do not run Babelfish format autodection. This increases the reliability of the program somewhat, because the Babelfish can make mistakes; particularly recommended for unattended, high-throughput runs of HMMER. Valid format strings include FASTA, GENBANK, EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF, CLUSTAL, and PHYLIP. See the User's Guide for a complete list. --noeff Turn off the effective sequence number calculation, and use the true number of sequences instead. This will usually reduce the sensitivity of the final model (so don't do it without good reason!) --nucleic Force the alignment to be interpreted as nucleic acid sequence, either RNA or DNA. Normally HMMER autodetects whether the alignment is protein or DNA, but sometimes alignments are so small that autodetection is ambigu- ous. See --amino. --null <_f> Read a null model from <_f>. The default for protein is to use average amino acid frequencies from Swissprot 34 and p1 = 350/351; for nucleic acid, the default is to use 0.25 for each base and p1 = 1000/1001. For documen- tation of the format of the null model file and further explanation of how the null model is used, see the User's Guide. --pam <_f> Apply a heuristic PAM- (substitution matrix-) based prior on match emission probabilities instead of the default mixture Dirichlet. The substitution matrix is read from <_f>. See --pamwgt. HMMER @RELEASE@ Last change: @RELEASEDATE@ 4 HMMER Manual hmmbuild(1) The default Dirichlet state transition prior and insert emission prior are unaffected. Therefore in principle you could combine --prior with --pam but this isn't recommended, as it hasn't been tested. ( --pam itself hasn't been tested much!) --pamwgt <_x> Controls the weight on a PAM-based prior. Only has effect if --pam option is also in use. <_x> is a posi- tive real number, 20.0 by default. <_x> is the number of "pseudocounts" contriubuted by the heuristic prior. Very high values of <_x> can force a scoring system that is entirely driven by the substitution matrix, making HMMER somewhat approximate Gribskov profiles. --pbswitch <_n> For alignments with a very large number of sequences, the GSC, BLOSUM, and Voronoi weighting schemes are slow; they're O(N^2) for N sequences. Henikoff position-based weights (PB weights) are more efficient. At or above a certain threshold sequence number <_n> hmmbuild will switch from GSC, BLOSUM, or Voronoi weights to PB weights. To disable this switching behavior (at the cost of compute time, set <_n> to be something larger than the number of sequences in your alignment. <_n> is a positive integer; the default is 1000. --prior <_f> Read a Dirichlet prior from <_f>, replacing the default mixture Dirichlet. The format of prior files is docu- mented in the User's Guide, and an example is given in the Demos directory of the HMMER distribution. --swentry <_x> Controls the total probability that is distributed to local entries into the model, versus starting at the beginning of the model as in a global alignment. <_x> is a probability from 0 to 1, and by default is set to 0.5. Higher values of <_x> mean that hits that are fragments on their left (N or 5'-terminal) side will be penalized less, but complete global alignments will be penalized more. Lower values of <_x> mean that frag- ments on the left will be penalized more, and global alignments on this side will be favored. This option only affects the configurations that allow local align- ments, e.g. -s and -f; unless one of these options is also activated, this option has no effect. You have HMMER @RELEASE@ Last change: @RELEASEDATE@ 5 HMMER Manual hmmbuild(1) independent control over local/global alignment behavior for the N/C (5'/3') termini of your target sequences using --swentry and --swexit. --swexit <_x> Controls the total probability that is distributed to local exits from the model, versus ending an alignment at the end of the model as in a global alignment. <_x> is a probability from 0 to 1, and by default is set to 0.5. Higher values of <_x> mean that hits that are fragments on their right (C or 3'-terminal) side will be penalized less, but complete global alignments will be penalized more. Lower values of <_x> mean that frag- ments on the right will be penalized more, and global alignments on this side will be favored. This option only affects the configurations that allow local align- ments, e.g. -s and -f; unless one of these options is also activated, this option has no effect. You have independent control over local/global alignment behavior for the N/C (5'/3') termini of your target sequences using --swentry and --swexit. --verbose Print more possibly useful stuff, such as the indivi- dual scores for each sequence in the alignment. --wblosum Use the BLOSUM filtering algorithm to weight the sequences, instead of the default. Cluster the sequences at a given percentage identity (see -- idlevel); assign each cluster a total weight of 1.0, distributed equally amongst the members of that clus- ter. --wgsc Use the Gerstein/Sonnhammer/Chothia ad hoc sequence weighting algorithm. This is already the default, so this option has no effect (unless it follows another option in the --w family, in which case it overrides it). --wme Use the Krogh/Mitchison maximum entropy algorithm to "weight" the sequences. This supercedes the Eddy/Mitchison/Durbin maximum discrimination algorithm, which gives almost identical weights but is less HMMER @RELEASE@ Last change: @RELEASEDATE@ 6 HMMER Manual hmmbuild(1) robust. ME weighting seems to give a marginal increase in sensitivity over the default GSC weights, but takes a fair amount of time. --wnone Turn off all sequence weighting. --wpb Use the Henikoff position-based weighting scheme. --wvoronoi Use the Sibbald/Argos Voronoi sequence weighting algo- rithm in place of the default GSC weighting. SEE ALSO Master man page, with full list of and guide to the indivi- dual man pages: see hmmer(1). A User guide and tutorial came with the distribution: Userguide.ps [Postscript] and/or Userguide.pdf [PDF]. Finally, all documentation is also available online via WWW: http://hmmer.wustl.edu/ AUTHOR This software and documentation is: @COPYRIGHT@ HMMER - Biological sequence analysis with profile HMMs Copyright (C) 1992-1999 Washington University School of Medicine All Rights Reserved This source code is distributed under the terms of the GNU General Public License. See the files COPYING and LICENSE for details. See the file COPYING in your distribution for complete details. Sean Eddy HHMI/Dept. of Genetics Washington Univ. School of Medicine 4566 Scott Ave. St Louis, MO 63110 USA Phone: 1-314-362-7666 FAX : 1-314-362-7855 Email: eddy@genetics.wustl.edu HMMER @RELEASE@ Last change: @RELEASEDATE@ 7 HMMER Manual hmmbuild(1) HMMER @RELEASE@ Last change: @RELEASEDATE@ 8