fragfit(1) User Commands fragfit(1) NAME fragfit - find proteins in a database that contain specified molecular ions SYNOPSIS fragfit [-Dfile] [-Efile] [-Ofile] [-c] [-eenz#] [-i] [-mmisses] [-p] [-rlow-high] [-ttol] ions DESCRIPTION fragfit searches a protein database for sequences which con- tain a specified set of molecular ions. See Henzel etal (PNAS 90, 5011-5015, 1993) for experimental details. The program "digests" each protein in the database and reports those which match the specified ion masses. The minimum input is a list of molecular ion masses, which may be specified individually on the command line (e.g., "fragfit 1146.5 1274.7"), or in a file. Any argument not starting with "-" which starts with a digit is taken to be a molecular ion mass; any argument not starting with "-" which does not start with a digit is taken to be a file containing a list of ions masses. If a file is used, each ion mass should appear on a separate line. At least two ion masses must be specified. PARAMETERS The following parameters modify the behavior of the program. For those parameters which take an argument (e.g., "- Dfile"), the argument immediately follows the parameter, i.e., no space between -D and file. -Dfile file is the name of a database file. Up to eight files can be specified, each preceded by -D, so that updates can be easily integrated. Example: -D/usr/pub/fasta.db -D/usr/pub/fasta.update -Efile file contains enzyme data. Only one enzyme data file may be specified. The most commonly used enzyme should appear first in the file because the program will automatically use the first enzyme unless otherwise specified. The format of an enzyme description is give below. Each enzyme description appears on a single line. Any line not starting with a number is assumed to be a com- ment and is ignored. The format of an enzyme descrip- tion is number name (cleavage site) The cleavage site consists of of "C-side" or "N-Side", Sun Microsystems Last change: 1 June 1993 1 fragfit(1) User Commands fragfit(1) depending on which side the cleavage occurs, followed by a list of residues, in three-letter or one-letter code. Any constraint on the residue following the cleavage site is specified by providing a list of prohibited residues after "; not next". Examples: 1. Trypsin I (C-side of Lys, Arg; next not Pro) 2. Chymotrypsin (C-side of Phe, Tyr, Trp) If the cleavage site requires more than one residue, put a hyphen between residues, for example, "enz x (C- side of Gly-Arg)", cuts after a Gly-Arg pair. Use "X" to indicate any residue, for example enz y (C-side of Arg-X)" specifies an enzyme which cuts after any resi- due following an Arg. A more complex enzyme can be specified by using an apostrophe to indicate the cut in the cleavage site. For example, the specification for Hydroxylamine speci- fies that the cleavage occurs between Asn and Gly. 3. Hydroxylamine (Asn'Gly) -Ofile file is the name of the file containing the output, usually "fragfit.out". -c Convert each Cys to Carboxy- methylCys. -eenz# Use the specified enzyme number (according to the numbering in the enzyme description file). Multiple enzymes can be specified; each must be preceded by -e. Example: -e1 -e3 -i Use monoisotopic (most abundant isotope) atomic weights when calculating fragment weights, otherwise average atomic weights are used. -mmisses Allow the specified number of misses. For example, "- m2" specifies that up to two of the specified molecular ions may not occur in a matching database protein. -p Do a partial digest, i.e., in addition to the usual fragments, create additional fragments by joining two adjacent fragments to simulate incomplete cleavage. -rlow-high Examine only proteins whose molecular weight lies between low and high, which are given in Daltons. For example, -m500-30000 specifies that only proteins between 500 Da and 30 kDa should be examined. Sun Microsystems Last change: 1 June 1993 2 fragfit(1) User Commands fragfit(1) -ttol tol specifies a tolerance. For example, "-t3" speci- fies that any mass within +3 Da of a specified molecu- lar ion should be taken as a match. The default toler- ance is 4 Da. OUTPUT A sample of the output appears below. The header informa- tion records the parameters used. For each database protein found, the one-line description is printed, any misses are noted in [ ], followed data for each fragment matched. The first column gives the computed mass of the fragment; the second column gives the difference between the measured mass and the computed mass; the third column gives the starting residue of the fragment; the fourth column gives the sequence of the fragment. Fri Jun 18 10:04:49 1993 enzyme: Trypsin I (C-side of Lys, Arg; next not Pro) database: /usr/seqdb/gp/genpept.fasta /usr/seqdb/gp/gpcu.fasta (77778 sequences, 23189618 residues) Ion Wts: 1146.500 1274.700 2398.300 2101.900 Tolerance: 4.000 Number of misses allowed: 1 All masses represent protonated ions Cys -> CarboxymethylCys molecular mass range: 500-35000 >gp|Y00129|ECMDH1_1 E. coli mdh gene for malate dehydrogenase [Escherichia coli] (312 aa, 32561.51 Da) [3 specified molwts matched. not found: 2101.90] 1277.46 -2.76 88: SDLFNVNAGIVK 2401.57 -3.27 241: ALQGEQGVVECAYVEGDGQYAR 1150.41 -3.91 263: FFSQPLLLGK CAVEATS AND LIMITATIONS Database entries which are incomplete proteins are not con- sistently identified in most available databases. The pro- gram has only been tested on a DEC 8650 running Berkeley UNIX, and a DEC 5900 running OSF/1. Other platforms may require some modification of the source code. AUTHOR Colin Watanabe (ckw@gene.com) Sun Microsystems Last change: 1 June 1993 3