saps - Statistical Analysis of Protein Sequences [ ] [ ] [ ] [  ]
[ ] [ ] [ ] [ ]
SAPS evaluates by statistical criteria a wide variety of  protein
sequence  properties. Properties considered include compositional
biases; clusters and runs of charge and other amino  acid  types;
different  kinds  and  extents  of repetitive structures; locally
periodic motifs; and anomalous spacings between identical residue
types.   The   statistics   are   computed  for  any  single  (or
appropriately concatenated) protein sequence input. Statistically
significant  sequence  features  highlighted by SAPS in the input
sequence  may  suggest   promising   regions   for   experimental
investigation.   The   program  also  finds  application  in  the
description of conserved features of families of proteins as well
as  in  the  inverse  problem of deriving protein groupings based
upon sequence features.  Short sequences are  subject  to  larger
statistical  fluctuations  than longer sequences. The statistical
evaluations of SAPS are reliable only for sequences of  at  least
about  200  residues.  Shorter  sequences  may  in  some cases be
appropriately  concatenated  and  analyzed  as  a  representative
combined  sequence (e.g., histones, or Ras family proteins).  The
SAPS program was developed in the group of Prof. Samuel Karlin at
Stanford  University.  The program is available via anonymous ftp
from gnomic.stanford.edu.  Correspondence relating to SAPS should
be  addressed to Volker Brendel at the Department of Mathematics,
Stanford University, Stanford  CA  94305,  U.S.A.;  phone:  (415)
723-9256; fax: (415) 725-2040; email: volker@gnomic.stanford.edu.
Users of the program should cite the following reference:
     Brendel, V., Bucher,  P.,  Nourbakhsh,  I.,  Blaisdell,
     B.E., Karlin, S. (1992)
     _M_e_t_h_o_d_s _a_n_d  _a_l_g_o_r_i_t_h_m_s  _f_o_r  _s_t_a_t_i_s_t_i_c_a_l  _a_n_a_l_y_s_i_s  _o_f
     _p_r_o_t_e_i_n _s_e_q_u_e_n_c_e_s.
     Proc. Natl. Acad. Sci. USA 89: 2002-2006.
Generate documented output.   Generate  terse  output.   Generate
verbose  output.  Append computer-readable summary output to file
`saps.table'.  Use species.q for quantile comparisons.   Count  H
as  positive  charge.  Analyze spacings of amino acids X, Y, ....
Read protein sequence data  from  library  file  _l_i_b_f_n_a_m_e.   Read
protein sequence data from files specified in LST__l_s_t_f_n_a_m_e.  Read
protein sequence data from stdin.   Read  protein  sequence  data
from  file(s)  _s_e_q_f_n_a_m_e(_s).   IIIInnnnppppuuuutttt  FFFFiiiilllleeee  FFFFoooorrrrmmmmaaaatttt  Input  to SAPS
consists of individual protein sequences of lengths not exceeding
10,000 residues.  Input is supplied by the arguments _s_e_q_f_n_a_m_e(_s),
----pppp, ----llll _l_s_t_f_n_a_m_e, and ----bbbb _l_i_b_f_n_a_m_e.  A. _s_e_q_f_n_a_m_e(_s)
   Individual sequences are supplied via the files _s_e_q_f_n_a_m_e(_s) in
minimal  EMBL  format: the first line of the file is a descriptor
line  which  will  be  printed,  following  lines  (if  any)  are
annotation,  the  first  line  of  the  sequence  is  immediately
preceded by  a  line  beginning  with  the  delimiter  `SQ',  and
subsequent  symbols  are A-Z (one-letter-code symbols) as part of
the sequence or irrelevant characters (like numbers and  blanks);
non-standard  symbols  for  ambiguous  or  missing  residues  are
ignored. Lines should not exceed 512 characters. SWISS-PROT files
may  be  used  without change in the distributed format (for such
files, also the DE line is printed by default).  Example  (SWISS-
PROT entry for Drosophila cut protein):

ID   HMCU_DROME     STANDARD;      PRT;  2175 AA.
(any number of comment lines that not beginning with `SQ')
 SQ
      1  MQPTLPQAAG TADMDLTAVQ SINDWFFKKE QIYLLAQFWQ QRATLAEKEV
        (sequence continued)
   2161  AVTTAAATAA AGWNY

B. ----pppp
   This option allows to read input formatted as described  under
A  to  be read from stdin. One possible use for this option is in
conjunction with a file  reformatting  program  such  as  ReadSeq
(D.G.    Gilbert;    available    via    anonymous    ftp    from
ftp.bio.indiana.edu,  directory  molbio/readseq).  Thus,  for   a
protein  data  file  in any format recognized by ReadSeq, one may
run saps with the command `readseq -p -f4 _s_e_q_f_n_a_m_e  |  saps  -p',
for example.  C. ----llll _l_s_t_f_n_a_m_e
   There are two other possible inputs to SAPS that can  be  used
alternatively  or  in  conjunction  with  sequence  file input as
described above.  If  the  ---- llll  _l_s_t_f_n_a_m_e  command  line  flag  is
specified,  input is taken from files in minimal EMBL format, the
names of which are specified in the file  LST__l_s_t_f_n_a_m_e.   A  list
file  must  be  named  with  a  prefix  LST_ and arbitrary suffix
_l_s_t_f_n_a_m_e.  It must have two lines of comments indicated  by  a  #
symbol  in  the first position followed by lines giving the names
of input files in minimal EMBL  format,  one  per  line.   Memory
limitations  on  the  system  may limit the number of input files
that can be specified in this way.  Example:

#'HELIX.*LOOP.*HELIX' proteins:
#
ARLC_MAIZE
ARRS_MAIZE
ASH1_RAT

D. ----bbbb _l_i_b_f_n_a_m_e Library files (invoked by the command line flag ----bbbb
_l_i_b_f_n_a_m_e)  contain  one  or  more sequence files assembled in LIB
format:  one-line descriptors  beginning  with  >  in  the  first
position  followed  by  the  sequence  in  free  format (non-one-
letter-code symbols again being ignored; up to 10,000  characters
per line).  Memory limitations on the system may limit the number
of input files that can be specified in this way.
Example:

>SW;ARLC_MAIZE: ANTHOCYANIN REGULATORY LC PROTEIN (GENE NAME: LC).
MALSASRVQQAEELLQRPAERQLMRSQLAAAARSINWSYALFWSISDTQP(sequence continued)
>SW;ARRS_MAIZE: ANTHOCYANIN REGULATORY R-S PROTEIN (GENE NAME: R-S).
MAVSASRVQQAEELLQRPAERQLMRSQLAAAARSINWSYALFWSISDTQP(sequence continued)
>SW;ASH1_RAT: ACHAETE-SCUTE HOMOLOGUE 1 (GENE NAME: MASH-1).
MESSGKMESGAGQQPQPPQPFLPPAACFFATAAAAAAAAAAAAAQSAQQQ(sequence continued)

Running SAPS on each of the above three sequences could  thus  be
done  in  any  of the following ways (assuming that the list file
under C is named LST_hlh and that the library  file  under  D  is
named LIBhlh):

a) saps ARLC_MAIZE ARRS_MAIZE ASH1_RAT > OUTPUT
b) saps -b LIBhlh > OUTPUT
c) saps -l hlh > OUTPUT

Output is directed  to  standard  output.  To  run  SAPS  on  the
sequence file HMCU_DROME, for example (see above), one might type
the command `saps  HMCU_DROME  |  more'  or  `saps  HMCU_DROME  >
OUTPUT'.  The  output format can be modified by the flags ----dddd, ----tttt,
or ----vvvv, and ----TTTT:  The output  will  come  with  documentation  that
annotates  each part of the program; this flag should be set when
SAPS  is  used  for  the  first  time  as  it  provides   helpful
explanations  with  respect  to the statistics being used and the
layout of the output.  This flag specifies terse output  that  is
limited  to  the  analysis of the charge distribution and of high
scoring segments.  This flag specifies verbose output  with  more
detail  than normally required.  This flag is used in conjunction
with the analysis of sets of proteins (specified  typically  with
the  ----bbbb  _l_i_b_f_n_a_m_e or ----llll _l_s_t_f_n_a_m_e options); if specified, the file
`saps.table' is appended with computer-readable lines  describing
the input files and their significant features.
   The residue composition of the input protein may be  evaluated
relative  to  standard  sets of proteins grouped by species, size
class,  subcellular  location,  function,  or   other   criteria.
Specifically,  the  composition  of the input protein is compared
with the quantile table  of  residue  usage  for  the  the  user-
specified  standard  set. Extremal usages which fall in the tails
of the reference distribution are indicated for individual  amino
acids,  charged  and  hydrophobic  residues. The reference set is
selected with the command line flag `----ssss _s_p_e_c_i_e_s'.  The  following
options  for  `_s_p_e_c_i_e_s'  are currently supported: _B_A_C_S_U (_B_a_c_i_l_l_u_s
_s_u_b_t_i_l_i_s);  _C_H_I_C_K  (chicken);  _D_R_O_M_E  (_D_r_o_s_o_p_h_i_l_a  _m_e_l_a_n_o_g_a_s_t_e_r);
_E_C_O_L_I  (_E_s_c_h_e_r_i_c_h_i_a  _c_o_l_i);  _H_U_M_A_N  (human);  _M_O_U_S_E  (mouse); _R_A_T
(rat); _X_E_N_L_A (frog);  _Y_E_A_S_T  (_S_a_c_c_h_a_r_o_m_y_c_e_s  _c_e_r_e_v_i_s_i_a_e);  _s_w_p_2_3_s
(random  sample  of  proteins from SWISS-PROT, Release 23.0).  By
default, a sequence file ending in _SPECIES is evaluated with the
quantile   table  SPECIES  (if  among  the  ones  listed  above);
otherwise swp23s is used.  For each reference set, only  proteins
of lengths at least 200 residues were included; redundant entries
were culled (for lists of SWISS_PROT file  names  composing  each
set and the quantile tables see directory SAPS/Inc).
   By default, SAPS treats only lysine (K) and  arginine  (R)  as
positively  charged  residues.  If  the command line flag `----HHHH' is
set, then histidine (H) is also treated as positively charged  in
all parts of the program involving the charge alphabet.
   Clusters of particular amino acid types may  be  evaluated  by
means  of  the  same  tests that are used to detect clustering of
charged residues (binomial model and scoring  statistics).  These
tests  are invoked by setting the `----aaaa' flag; for example, to test
(separately) for clusters of alanine (A) and serine (S), set  `----aaaa
AAAASSSS'.   The   binomial   test   is  also  programmed  for  certain
combinations of amino acids: AG (flag `----aaaa  aaaa'),  PEST  (flag  `----aaaa
pppp'), QP (flag `----aaaa qqqq'), ST (flag `----aaaa ssss').
SAPS/Inc/(files)
SAPS/README
SAPS/testpro
SAPS/testout
A hardcopy of this manual page is  obtained  by  `man  -t  saps'.
Volker Brendel <volker@gnomic.stanford.edu>