SAM : Sequence alignment and modeling software system /**************************************************************************\ * SAM: Sequence Alignment and Modeling Software System * * * * Baskin Center for Computer Engineering and Information Sciences * * University of California, Santa Cruz * * * * Copyright 1992-1995, The Regents of the University of California * * * * Citations: A. Krogh et al., JMB 235:1501-1531, Feb. 1994. * * R. Hughey, A. Krogh, UCSC TR UCSC-CRL-95-7, Jan 1994 * * * * Distributed for non-commercial use only. * * * * Questions or comments to sam-info@cse.ucsc.edu * \**************************************************************************/ The Sequence Alignment and Modeling system (SAM) is a collection of flexible software tools for creating, refining, and using linear hidden Markov models for biological sequence analysis. The model states can be viewed as representing the sequence of columns in a multiple sequence alignment, with provisions for arbitrary position-dependent insertions and deletions in each sequence. The models are trained on a family of protein or nucleic acid sequences using an expectation-maximization algorithm and a variety of algorithmic heuristics. A trained model can then be used to both generate multiple alignments and search databases for new members of the family. SAM is written in the C programming language for Unix machines and MasPar parallel computers, and includes extensive documentation. The algorithms and methods used by SAM have been described in several pioneering papers from the University of California, Santa Cruz. These papers (citations below), as well as the SAM software suite, are available via anonymous ftp to ftp.cse.ucsc.edu in the pub/protein directory, or via the World-Wide Web to http://www.cse.ucsc.edu/research/compbio/sam.html. The software is freely available for non-commercial research use, however you will need an encryption key to decrypt the pub/protein/sam1.0.tar.Z.crypt file available from the ftp server or the WWW page. Please send email to sam-info@cse.ucsc.edu to receive the key or to make other arrangements if you do not have the crypt utility. The unencrypted documentation (UCSC Technical Report UCSC-CRL-95-7) is in pub/protein/sam1.0_doc.ps.Z. Although we plan to create an email or WWW server in the future, one is currently not available. If you wish to use SAM, you must grab the code and compile it yourself, a process we have tried to make as painless as possible. Richard Hughey Anders Krogh sam-info@cse.ucsc.edu http://www.cse.ucsc.edu/research/compbio/sam.html ----------------------------------- Related papers: A. Krogh, M. Brown, I. S. Mian, K. Sjolander, and D. Haussler. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501--1531, February 1994. R. Hughey and A. Krogh, SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-95-7, University of California, Santa Cruz, CA, January 1995. M. P. Brown, R. Hughey, A. Krogh, I. S. Mian, K. Sjolander, and D. Haussler. Using Dirichlet mixture priors to derive hidden Markov models for protein families. In L. Hunter, D. Searls, and J. Shavlik, editors, Proc. of First Int. Conf. on Intelligent Systems for Molecular Biology, pages 47--55, Menlo Park, CA, July 1993. AAAI/MIT Press. D. Haussler, A. Krogh, I. S. Mian, and K. Sjolander. Protein modeling using hidden Markov models: Analysis of globins. In Proceedings of the Hawaii International Conference on System Sciences, volume 1, pages 792--802, Los Alamitos, CA, 1993. IEEE Computer Society Press. R. Hughey. Massively parallel biosequence analysis. Technical Report UCSC-CRL-93-14, University of California, Santa Cruz, CA, April 1993. A. Krogh, I. S. Mian, and D. Haussler. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Research, 1994. in press. ============================================================================= Information sur la version 1.1 (28.10.95) Date: Fri, 27 Oct 1995 18:00:39 -0700 From: Richard Hughey Subject: SAM Version 1.1 Available We didn't send out much information on the last group of changes (1.03), though some of you have picked up a copy. The current round, though, is something I'd encourage all of you to switch to. It's numbered version 1.1, and the major changes include.... 1. The multdomain program for iteratively aligning a model with a sequence to find multiple occurrences of a domain of interest. 2. An implementation of null-model based scoring. Basically, rather than use Z-scores, we are calculating how well the trained model performs in comparison to a simple null model (this is log-odds scoring, as, for example, Sean Eddy's HMMER uses). To use this scoring method, FIMs should be placed at both ends of the model (if the model was initially trained without them). The NULL model is another FIM. The score is the difference in NLL scores between running the sequence through the FIM-enhanced model and the NULL model. A sequence will score around 0 in this NLL-NULL difference if it does not match the model. The hmmscore output file, when this method (the default) is used, provides the score needed for 0.01 significance. 3. Changes in wildcard handling to aid both Z-scoring and NLL-NULL scoring. 4. A variety of user interface changes, in particular column numbers and other options in prettyalign, more forgiving reading of parameter files (espc model node specifications no longer have to be on one continuous line). 5. The default regularizer now features background protein frequencies in the insert states, rather than 1/20 values. The scoring method works, but is not so robust, with the flat (1/20) distributions in the FIMs. If you prefer the alignments and training with the flat regularizer, set it in your .samrc, as detailed in the documentation. Also, if you have any existing models, be sure to insert the old regularizer at the top of the model file, or start again from scratch. It is best not to mix the two versions. We've also become more web-oriented: 1. Distribution is now WWW based --- from the SAM page: http://www.cse.ucsc.edu/research/compbio/sam.html Click to the distribution page. It will ask you for a name and password, which are 'sam' and 'ucschmm', respectively. 2. We have a WWW interface for running several of the programs on our machines here. The most useful will the be access our MasPar for model building. If there is enough interest, we could also try to keep an up-to-date version of, for example, the FBSC Non-Redundant Protein Database here for external searches (searching against a model requires a couple of hours of MasPar time, so we may have some sort of gatekeeping to take care of). Feel free to try these, (and send us comments!) also accessible from the SAM page. In creation of this version, Christian Barrett (a new addition to UCSC's compbio group here) developed the multiple domain finder, WWW server, and other pieces of programming, and Saira Mian (now at LBL) was once again our most excellent beta (and at times, pre-alpha!) tester. We also have a few new paper links on the SAM page. In the future.... 1. More flexible models, including looping and other features, for training on repeated domains, rather than just locating them. 2. Space-saving implementations of hmmscore and buildmodel (important for the MasPar versions). 3. Further refinement of the scoring system. 4. Maximum-discrimination training (see Sean Eddy's HMMER for more on this -- there's a link from the SAM page) Let us know if you have any questions! Richard Hughey Anders Krogh sam-info@cse.ucsc.edu La documentation complete se trouve dans le fichier postscript : /env/infobiogen/pub/ftp/pub/doc/bio/sam/sam_doc.ps L introduction suivante permet d avoir un apercu du logiciel suffisant pour une utilisation minimum. SAM utilise un model de Markov (lineaire cache) pour representer les sequences nucleiques ou proteiques. Le modele est une sequence lineaire de "nodes" incluant chacun des etats match, insert et delete. "buildmodel" :programme principal cree un modele nouveau a partir d'un jeu de sequences (fichier auformat multiple compatible avec READSEQ). "align2model" cree un alignement multiple de sequences en fonction d un model Le filtre "prettyalign" rend la sortie plus lisible. "addfims" ajoute des modules d insertions a des modles existants "hmmscore" calcule un score de log-vraisemblance negative (NLL) pour un fichier sequences par rapport a un modele. "modelfromalign" cree un model d'un alignement multiple Exemple simplifie : ================== 1) Creation du modele ------------------ Supposant un fichier de sequences multiples fic_de_seq au format fasta, msf, Intelligenetics, ..., donnez un nom de modele generique : buildmodel nom_modele -train fic_de_seq ou buildmodel nom_modele [-alphabet protein|DNA|RNA] -train fic_de_seq genere un fichier nom_modele.mod 2) Alignement multiple ------------------- align2model nom_modele.mod fic_de_seq |prettyalign -lNN >fichier_sortie (avec NN longueur de ligne, .. et l comme dans longueur) 3) Scores des sequences -------------------- hmmscore nom_modele nom_modele.mod fic_de_seq