SAM : Sequence alignment and modeling software system

/**************************************************************************\
 *  SAM:  Sequence Alignment and Modeling Software System                 *
 *                                                                        *
 *  Baskin Center for Computer Engineering and Information Sciences       *
 *  University of California, Santa Cruz                                  *
 *                                                                        *
 *  Copyright 1992-1995, The Regents of the University of California      *
 *                                                                        *
 *  Citations:  A. Krogh et al., JMB 235:1501-1531, Feb. 1994.            *
 *              R. Hughey, A. Krogh, UCSC TR UCSC-CRL-95-7, Jan 1994      *
 *                                                                        *
 *  Distributed for non-commercial use only.                              *
 *                                                                        *
 *  Questions or comments to sam-info@cse.ucsc.edu                        *
\**************************************************************************/


The Sequence Alignment and Modeling system (SAM) is a collection of
flexible software tools for creating, refining, and using linear
hidden Markov models for biological sequence analysis.  The model
states can be viewed as representing the sequence of columns in a
multiple sequence alignment, with provisions for arbitrary
position-dependent insertions and deletions in each sequence.  The
models are trained on a family of protein or nucleic acid sequences
using an expectation-maximization algorithm and a variety of
algorithmic heuristics.  A trained model can then be used to both
generate multiple alignments and search databases for new members of
the family.  SAM is written in the C programming language for Unix
machines and MasPar parallel computers, and includes extensive
documentation.

The algorithms and methods used by SAM have been described in several
pioneering papers from the University of California, Santa Cruz.
These papers (citations below), as well as the SAM software suite, are
available via anonymous ftp to ftp.cse.ucsc.edu in the pub/protein
directory, or via the World-Wide Web to
http://www.cse.ucsc.edu/research/compbio/sam.html.

The software is freely available for non-commercial research use,
however you will need an encryption key to decrypt the
pub/protein/sam1.0.tar.Z.crypt file available from the ftp server or
the WWW page.  Please send email to sam-info@cse.ucsc.edu to receive
the key or to make other arrangements if you do not have the crypt
utility.  The unencrypted documentation (UCSC Technical Report
UCSC-CRL-95-7) is in pub/protein/sam1.0_doc.ps.Z.

Although we plan to create an email or WWW server in the future, one
is currently not available.  If you wish to use SAM, you must grab the
code and compile it yourself, a process we have tried to make as
painless as possible.

Richard Hughey
Anders Krogh

sam-info@cse.ucsc.edu
http://www.cse.ucsc.edu/research/compbio/sam.html
-----------------------------------
Related papers:

A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler.
 Hidden Markov models in computational biology: Applications to
 protein modeling.
 Journal of Molecular Biology, 235:1501--1531, February 1994.

R. Hughey and A. Krogh,
 SAM: Sequence alignment and modeling software system.
 Technical Report UCSC-CRL-95-7, University of California,
 Santa Cruz, CA, January 1995.

M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjolander, and D. Haussler.
 Using Dirichlet mixture priors to derive hidden Markov models
   for protein families.
 In L. Hunter, D. Searls, and J. Shavlik, editors,  Proc. of First
 Int. Conf. on Intelligent Systems for Molecular Biology, pages 47--55, Menlo
 Park, CA, July 1993. AAAI/MIT Press.

D. Haussler, A. Krogh, I. S. Mian, and K. Sjolander.
 Protein modeling using hidden Markov models: Analysis of globins.
 In  Proceedings of the Hawaii International Conference on System
  Sciences, volume 1, pages 792--802, Los Alamitos, CA, 1993. IEEE Computer
  Society Press.

R. Hughey.
 Massively parallel biosequence analysis.
 Technical Report UCSC-CRL-93-14, University of California, Santa
  Cruz, CA, April 1993.

A. Krogh, I. S. Mian, and D. Haussler.
 A hidden Markov model that finds genes in  E. coli DNA.
  Nucleic Acids Research, 1994.
 in press.

=============================================================================
Information sur la version 1.1 (28.10.95)

Date: Fri, 27 Oct 1995 18:00:39 -0700
From: Richard Hughey <rph@cse.ucsc.edu>
Subject: SAM Version 1.1 Available

We didn't send out much information on the last group of changes
(1.03), though some of you have picked up a copy.  The current round,
though, is something I'd encourage all of you to switch to.  It's
numbered version 1.1, and the major changes include....

1. The multdomain program for iteratively aligning a model with a
   sequence to find multiple occurrences of a domain of interest.
2. An implementation of null-model based  scoring.  Basically, rather
   than use Z-scores, we are calculating how well the trained model
   performs in comparison to a simple null model (this is log-odds
   scoring, as, for example, Sean Eddy's HMMER uses).  To use this
   scoring method, FIMs should be placed at both ends of the model (if
   the model was initially trained without them).  The NULL model is
   another FIM.  The score is the difference in NLL scores between
   running the sequence through the FIM-enhanced model and the NULL
   model.  A sequence will score around 0 in this NLL-NULL difference
   if it does not match the model.  The hmmscore output file, when
   this  method (the default) is used, provides the score needed for
   0.01 significance. 
3. Changes in wildcard handling to aid both Z-scoring and NLL-NULL
   scoring.
4. A variety of user interface changes, in particular column numbers
   and other options in prettyalign, more forgiving reading of
   parameter files (espc model node specifications no longer have to
   be on one continuous line).
5. The default regularizer now features background protein frequencies
   in the insert states, rather than 1/20 values.  The scoring method
   works, but is not so robust, with the flat (1/20) distributions in the
   FIMs.  If you prefer the alignments and training with the flat
   regularizer, set it in your .samrc, as detailed in the
   documentation.  Also, if you have any existing models, be sure to
   insert the old regularizer at the top of the model file, or start
   again from scratch.  It is best not to mix the two versions.

We've also become more web-oriented:

1. Distribution is now WWW based --- from the SAM page:
       http://www.cse.ucsc.edu/research/compbio/sam.html
   Click to the distribution page.  It will ask you for a name
   and password, which are 'sam' and 'ucschmm', respectively.

2. We have a WWW interface for running several of the programs on our
   machines here.  The most useful will the be access our MasPar for
   model building.  If there is enough interest, we could also try to
   keep an up-to-date version of, for example, the FBSC Non-Redundant
   Protein Database here for external searches (searching against a
   model requires a couple of hours of MasPar time, so we may have
   some sort of gatekeeping to take care of).  Feel free to try these,
   (and send us comments!)  also accessible from the SAM page.

In creation of this version, Christian Barrett (a new addition to
UCSC's compbio group here) developed the multiple domain finder, WWW
server, and other pieces of programming, and Saira Mian (now at LBL)
was once again our most excellent beta (and at times, pre-alpha!)
tester.

We also have a few new paper links on the SAM page.

In the future....

1. More flexible models, including looping and other features, for
   training on repeated domains, rather than just locating them.
2. Space-saving implementations of hmmscore and buildmodel (important
   for the MasPar versions).
3. Further refinement of the scoring system.
4. Maximum-discrimination training (see Sean Eddy's HMMER for more on
   this -- there's a link from the SAM page) 

Let us know if you have any questions!

Richard Hughey
Anders Krogh

sam-info@cse.ucsc.edu

La documentation complete se trouve dans le fichier postscript :
/env/infobiogen/pub/ftp/pub/doc/bio/sam/sam_doc.ps

L introduction suivante permet d avoir un apercu du logiciel suffisant 
pour une utilisation minimum.

SAM utilise un model de Markov (lineaire cache) pour representer 
les sequences nucleiques ou proteiques. Le modele est une sequence
lineaire de "nodes" incluant chacun des etats match, insert et delete.

"buildmodel" :programme principal 
   cree un modele nouveau a partir d'un jeu de sequences (fichier auformat 
   multiple compatible avec READSEQ).

"align2model"
   cree un alignement multiple de sequences en fonction d un model
   Le filtre "prettyalign" rend la sortie plus lisible.

"addfims" 
   ajoute des modules d insertions a des modles existants

"hmmscore"
   calcule un score de log-vraisemblance negative (NLL) pour un fichier
   sequences par rapport a un modele.

"modelfromalign"
   cree un model d'un alignement multiple


Exemple simplifie :
==================

1) Creation du modele
   ------------------
Supposant un fichier de sequences multiples fic_de_seq au 
format fasta, msf, Intelligenetics, ..., donnez un nom de modele generique :

buildmodel nom_modele  -train fic_de_seq
ou
buildmodel nom_modele [-alphabet protein|DNA|RNA] -train fic_de_seq

genere un fichier nom_modele.mod

2) Alignement multiple 
   -------------------

align2model nom_modele.mod fic_de_seq |prettyalign -lNN >fichier_sortie

(avec NN longueur de ligne, .. et l  comme dans longueur) 

3) Scores des sequences
   --------------------

hmmscore nom_modele nom_modele.mod fic_de_seq