saps - Statistical Analysis of Protein Sequences [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] SAPS evaluates by statistical criteria a wide variety of protein sequence properties. Properties considered include compositional biases; clusters and runs of charge and other amino acid types; different kinds and extents of repetitive structures; locally periodic motifs; and anomalous spacings between identical residue types. The statistics are computed for any single (or appropriately concatenated) protein sequence input. Statistically significant sequence features highlighted by SAPS in the input sequence may suggest promising regions for experimental investigation. The program also finds application in the description of conserved features of families of proteins as well as in the inverse problem of deriving protein groupings based upon sequence features. Short sequences are subject to larger statistical fluctuations than longer sequences. The statistical evaluations of SAPS are reliable only for sequences of at least about 200 residues. Shorter sequences may in some cases be appropriately concatenated and analyzed as a representative combined sequence (e.g., histones, or Ras family proteins). The SAPS program was developed in the group of Prof. Samuel Karlin at Stanford University. The program is available via anonymous ftp from gnomic.stanford.edu. Correspondence relating to SAPS should be addressed to Volker Brendel at the Department of Mathematics, Stanford University, Stanford CA 94305, U.S.A.; phone: (415) 723-9256; fax: (415) 725-2040; email: volker@gnomic.stanford.edu. Users of the program should cite the following reference: Brendel, V., Bucher, P., Nourbakhsh, I., Blaisdell, B.E., Karlin, S. (1992) _M_e_t_h_o_d_s _a_n_d _a_l_g_o_r_i_t_h_m_s _f_o_r _s_t_a_t_i_s_t_i_c_a_l _a_n_a_l_y_s_i_s _o_f _p_r_o_t_e_i_n _s_e_q_u_e_n_c_e_s. Proc. Natl. Acad. Sci. USA 89: 2002-2006. Generate documented output. Generate terse output. Generate verbose output. Append computer-readable summary output to file `saps.table'. Use species.q for quantile comparisons. Count H as positive charge. Analyze spacings of amino acids X, Y, .... Read protein sequence data from library file _l_i_b_f_n_a_m_e. Read protein sequence data from files specified in LST__l_s_t_f_n_a_m_e. Read protein sequence data from stdin. Read protein sequence data from file(s) _s_e_q_f_n_a_m_e(_s). IIIInnnnppppuuuutttt FFFFiiiilllleeee FFFFoooorrrrmmmmaaaatttt Input to SAPS consists of individual protein sequences of lengths not exceeding 10,000 residues. Input is supplied by the arguments _s_e_q_f_n_a_m_e(_s), ----pppp, ----llll _l_s_t_f_n_a_m_e, and ----bbbb _l_i_b_f_n_a_m_e. A. _s_e_q_f_n_a_m_e(_s) Individual sequences are supplied via the files _s_e_q_f_n_a_m_e(_s) in minimal EMBL format: the first line of the file is a descriptor line which will be printed, following lines (if any) are annotation, the first line of the sequence is immediately preceded by a line beginning with the delimiter `SQ', and subsequent symbols are A-Z (one-letter-code symbols) as part of the sequence or irrelevant characters (like numbers and blanks); non-standard symbols for ambiguous or missing residues are ignored. Lines should not exceed 512 characters. SWISS-PROT files may be used without change in the distributed format (for such files, also the DE line is printed by default). Example (SWISS- PROT entry for Drosophila cut protein): ID HMCU_DROME STANDARD; PRT; 2175 AA. (any number of comment lines that not beginning with `SQ') SQ 1 MQPTLPQAAG TADMDLTAVQ SINDWFFKKE QIYLLAQFWQ QRATLAEKEV (sequence continued) 2161 AVTTAAATAA AGWNY B. ----pppp This option allows to read input formatted as described under A to be read from stdin. One possible use for this option is in conjunction with a file reformatting program such as ReadSeq (D.G. Gilbert; available via anonymous ftp from ftp.bio.indiana.edu, directory molbio/readseq). Thus, for a protein data file in any format recognized by ReadSeq, one may run saps with the command `readseq -p -f4 _s_e_q_f_n_a_m_e | saps -p', for example. C. ----llll _l_s_t_f_n_a_m_e There are two other possible inputs to SAPS that can be used alternatively or in conjunction with sequence file input as described above. If the ---- llll _l_s_t_f_n_a_m_e command line flag is specified, input is taken from files in minimal EMBL format, the names of which are specified in the file LST__l_s_t_f_n_a_m_e. A list file must be named with a prefix LST_ and arbitrary suffix _l_s_t_f_n_a_m_e. It must have two lines of comments indicated by a # symbol in the first position followed by lines giving the names of input files in minimal EMBL format, one per line. Memory limitations on the system may limit the number of input files that can be specified in this way. Example: #'HELIX.*LOOP.*HELIX' proteins: # ARLC_MAIZE ARRS_MAIZE ASH1_RAT D. ----bbbb _l_i_b_f_n_a_m_e Library files (invoked by the command line flag ----bbbb _l_i_b_f_n_a_m_e) contain one or more sequence files assembled in LIB format: one-line descriptors beginning with > in the first position followed by the sequence in free format (non-one- letter-code symbols again being ignored; up to 10,000 characters per line). Memory limitations on the system may limit the number of input files that can be specified in this way. Example: >SW;ARLC_MAIZE: ANTHOCYANIN REGULATORY LC PROTEIN (GENE NAME: LC). MALSASRVQQAEELLQRPAERQLMRSQLAAAARSINWSYALFWSISDTQP(sequence continued) >SW;ARRS_MAIZE: ANTHOCYANIN REGULATORY R-S PROTEIN (GENE NAME: R-S). MAVSASRVQQAEELLQRPAERQLMRSQLAAAARSINWSYALFWSISDTQP(sequence continued) >SW;ASH1_RAT: ACHAETE-SCUTE HOMOLOGUE 1 (GENE NAME: MASH-1). MESSGKMESGAGQQPQPPQPFLPPAACFFATAAAAAAAAAAAAAQSAQQQ(sequence continued) Running SAPS on each of the above three sequences could thus be done in any of the following ways (assuming that the list file under C is named LST_hlh and that the library file under D is named LIBhlh): a) saps ARLC_MAIZE ARRS_MAIZE ASH1_RAT > OUTPUT b) saps -b LIBhlh > OUTPUT c) saps -l hlh > OUTPUT Output is directed to standard output. To run SAPS on the sequence file HMCU_DROME, for example (see above), one might type the command `saps HMCU_DROME | more' or `saps HMCU_DROME > OUTPUT'. The output format can be modified by the flags ----dddd, ----tttt, or ----vvvv, and ----TTTT: The output will come with documentation that annotates each part of the program; this flag should be set when SAPS is used for the first time as it provides helpful explanations with respect to the statistics being used and the layout of the output. This flag specifies terse output that is limited to the analysis of the charge distribution and of high scoring segments. This flag specifies verbose output with more detail than normally required. This flag is used in conjunction with the analysis of sets of proteins (specified typically with the ----bbbb _l_i_b_f_n_a_m_e or ----llll _l_s_t_f_n_a_m_e options); if specified, the file `saps.table' is appended with computer-readable lines describing the input files and their significant features. The residue composition of the input protein may be evaluated relative to standard sets of proteins grouped by species, size class, subcellular location, function, or other criteria. Specifically, the composition of the input protein is compared with the quantile table of residue usage for the the user- specified standard set. Extremal usages which fall in the tails of the reference distribution are indicated for individual amino acids, charged and hydrophobic residues. The reference set is selected with the command line flag `----ssss _s_p_e_c_i_e_s'. The following options for `_s_p_e_c_i_e_s' are currently supported: _B_A_C_S_U (_B_a_c_i_l_l_u_s _s_u_b_t_i_l_i_s); _C_H_I_C_K (chicken); _D_R_O_M_E (_D_r_o_s_o_p_h_i_l_a _m_e_l_a_n_o_g_a_s_t_e_r); _E_C_O_L_I (_E_s_c_h_e_r_i_c_h_i_a _c_o_l_i); _H_U_M_A_N (human); _M_O_U_S_E (mouse); _R_A_T (rat); _X_E_N_L_A (frog); _Y_E_A_S_T (_S_a_c_c_h_a_r_o_m_y_c_e_s _c_e_r_e_v_i_s_i_a_e); _s_w_p_2_3_s (random sample of proteins from SWISS-PROT, Release 23.0). By default, a sequence file ending in _SPECIES is evaluated with the quantile table SPECIES (if among the ones listed above); otherwise swp23s is used. For each reference set, only proteins of lengths at least 200 residues were included; redundant entries were culled (for lists of SWISS_PROT file names composing each set and the quantile tables see directory SAPS/Inc). By default, SAPS treats only lysine (K) and arginine (R) as positively charged residues. If the command line flag `----HHHH' is set, then histidine (H) is also treated as positively charged in all parts of the program involving the charge alphabet. Clusters of particular amino acid types may be evaluated by means of the same tests that are used to detect clustering of charged residues (binomial model and scoring statistics). These tests are invoked by setting the `----aaaa' flag; for example, to test (separately) for clusters of alanine (A) and serine (S), set `----aaaa AAAASSSS'. The binomial test is also programmed for certain combinations of amino acids: AG (flag `----aaaa aaaa'), PEST (flag `----aaaa pppp'), QP (flag `----aaaa qqqq'), ST (flag `----aaaa ssss'). SAPS/Inc/(files) SAPS/README SAPS/testpro SAPS/testout A hardcopy of this manual page is obtained by `man -t saps'. Volker Brendel