SIGNAL SCAN SIGNAL SCAN is a program developed to facilitate the analysis of DNA sequences for known eukaryotic signals. This program is FREE. You may copy and distribute this program, but you may not charge for its distribution. You MUST register the program by sending your name and address. Registering this program helps to justify the program for funding purposes. For registering the program you will automatically be notified of SIGNAL SCAN changes and updates. PLEASE REGISTER, you will help assure the future of SIGNAL SCAN. Note that if you obtained this program directly from me, you are already registered. The source code is written in the `C' language and is fairly easy to port over to other hardware and operating systems. The source code will be made available upon request. You may make changes to the source code, but may not release the modified program or source code without my authorization. If you use this program in published research, please site: Prestridge, D.S. (1991) SIGNAL SCAN: A computer program that scans DNA sequences for eukaryotic transcriptional elements. CABIOS 7, 203-206. The author welcomes comments and suggestions on the program or additions to the database. Please contact: Dr. Dan S. Prestridge Tele:(612) 625-3744 Advanced Biosciences Computing Center E-mail:danp@biosci.umn.edu 1479 Gortner Ave. University of Minnesota St. Paul, MN 55108 SEQUENCE FORMAT At present, SIGNAL SCAN will accept Staden, Fasta, and GCG formated sequences. An exception is when using the IMD database search, which presently accepts only Staden and Fasta formats. We hope to add GCG later. Please look at sample sequence files for details. Sample files included are sample.seq (Staden), sample.tfa (Fasta), and sample.gcg (GCG). The sequence file must be ASCII, which means that if you use a word processor (such as WORD (tm), WORDPERFECT (tm), etc.) that you must export the file into an ASCII format. This is because a word processor adds a lot more things to a file than most of you realize (like page formatting, type of printer you selected, and many other things) hidden to the user. For you folks familiar to, and have access to the Genetics Computer Group programs, you can use their "TOSTADEN" formatting program to reformat GenBank files. Currently the maximum number of base pairs in an input sequence is limited to 20kb. If you have MS-DOS 5.0 or greater, it's 'edit' program serves as a good ASCII sequence editor. [GCG sequence note] Note that, presently, SIGNAL SCAN will not accept GCG sequences with inline comments (comments inserted into the sequence using the seqed editor, header comments are ok). If you want to scan a sequence with inline comments, convert it to staden format. [IBM DOS note] Please note that your DNA sequence must be in the same subdirectory as the SIGNAL SCAN program. NEW TO SIGNAL SCAN VERSION 4.0 In addition to Ghosh's TFD (Ghosh,D. (1991) TIBS 16: 445-447) SIGNAL SCAN now contains the TRANSFAC (Wingender,E. NAR 16: 1879-1902) and IMD (Chen, Hertz, and Stormo. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weighted matrices of transcription factor binding sites [In preperation]) transcription factor databases. Ghosh's TFD has not been updated since 8/93. Wingender's TRANSFAC continues to be updated. Chen's IMD (Information Matrix Database) is a new database of weighted matrices of transcription factor binding sites. All 3 databases are now searchable in SIGNAL SCAN. Also, SIGNAL SCAN now accepts GCG and FASTA formated sequences. ________________________PROGRAM INFORMATION_________________________________ SIGNAL SCAN is offered to you as is, and so are its results, with no promises. A signal, defined here, is any short DNA sequence that may have known significance. What SIGNAL SCAN does is find homologies of published signal sequences in your sequence, most of these transcriptional elements. It cannot, at this time, predict if what it does find, has any meaning. The interpretation of those results are up to you. Most signal elements found probably will not have any meaning, as the elements are in the wrong milieu, wrong cell type, or wrong organism. Consequently, there will be many more erroneous signals found by SIGNAL SCAN than significant ones. The significance probably varies greatly with the signal length. There are many matches for CP1 in any sequence because it is a very short sequence with a high probability of random occurrence. There are fewer, and likely more significant, glucocorticoid elements because of its larger signal sequence. There is also a great possibility that elements that are in your sequence will be missed by SIGNAL SCAN, even if those elements are represented in the data files. This can happen if your element does not fall within the consensus of the reported signal in the literature. Use the Journal Citation feature to find references to the signals. Probably the major benefit and use of SIGNAL SCAN is to find out the identity of unknown proteins bound to characterized binding sites in DNA sequences. You can create your own signal database files with this utility, and save them for future use. First you are prompted to either use an existing file that you have created previously or to create a new one. If you select to create a new one and then give the name of an existing database file, the existing file will be erased and overwritten. Once you select an existing file or create a new one, you can then add new signals to the signal file or use the existing file as is. Entering signals is the same as in previous versions of SIGNAL SCAN. If you decide not to enter a new signal when already in the Add Signal part of the program, C out of it as soon as possible. If you make a mistake in the signal, you will have to edit the file with an ASCII editor such as the MS-DOS 5.0 "edit" editor. Be sure to backup your file before editing. Be VERY CAREFUL when editing these files. Keep proper spacing. Note that in the scan results displayed or saved to a file, the database selected will be "user.dat", no matter what your file name is. In fact SIGNAL SCAN copies a copy of your selected or created signal file into a file called user.dat, it does not use your file directly. It is done this way for programming reasons. Use this utility to both create or select a user signal database file. Once selected here, start one of the scan programs and choose the "User Signal Database" selection and any others you want. Unless you change the user signal database with this utility, the user database selected here will be used in all subsequent searches until a new one is selected here. THE MAIN MENU The main menu options are: Keyboard entry user signal database utility This utility can be used to build your own signal database. Information Matrix Database This part of the program is used to scan a DNA sequence against a database of transcription factor binding site weighted matrices (the IMD database). Consensus Signal Databases This part of the program is used to scan a DNA sequence against either the TFD or TRANSFAC consensus transcription factor binding site databases. It contains options for a journal citation lookup feature and choices of 3 types of scans: GROUP SIGNAL SCAN, LINEAR SIGNAL SCAN or MAP SIGNAL SCAN? Group Signal Scan groups the results of the search by signal, so that all of the signal groups are together. Linear scan lists the different signals present in your sequence as it moves along your sequence. Map scan shows your sequence and displays signals below it. The choice of output produces the same result, the preference is up to you. Note that in map scan, the signals reported begin in your sequences directly above the (+) or (-) symbols (for + or - strand). The first bp of a signal begins directly above the + in (+) strand signals. WHAT NEXT? You are prompted for the file name that contains your sequence, which must be in proper format, see sample files for examples, and HELP FORMAT. Next you are prompted for the classes of signals you want to search your sequence with; this selects which signal data files SIGNAL SCAN will use in the search. You can choose User Signal Database to use your own signals. To use your own signal database to scan with, you must first create or select your database in the Keyboard entry selection from the main menu. You are then prompted for a filename that you wish to store the search results in, which can be any legal file name, such as "SAMPLE.SIG". As the program runs, the results are saved to this file. Quit Obvious. HELP You're looking at it. Update TFD and TRANSFAC Databases This is a utility that you can use to update the TFD and TRANSFAC databases in SIGNAL SCAN. You must first obtain a current copy of the database (instructions are included), then use the utility to convert the database file to SIGNAL SCAN format. INTERPRETING THE RESULTS: The results are written to a file that you name, and can be printed out using a DOS 'PRINT' command once the program has completed. The results show the name of the signal, the published signal sequence, and the location (loc) of the first base pair of your sequence that includes that signal. A (-) symbol indicates that the signal sequence was found on the opposite strand of your input sequence, and that the signal sequence is in the reverse orientation, such that the 1st base pair listed is actually the last base pair in the signal, but still the first base pair in your sequence. Let me illustrate, to wit: Signal: AATGC signal found on forward strand, (+) AATGC Your seq: 5' GGTTTCTGAAAGCATTGCCTAAATGAGATGAATGCAAAATTTGGCGCGCGTTGTCCC 3' opp.strand:3' CCAAAGACTTTCGTAACGGATTTACTCTACTTACGTTTTAAACCGCGCGCAACAGGG 5' CGTAA same signal found on opposite strand, (-) The 1st bp on the original seq. strand of the signal is the first A of AATCG. The 1st bp of the signal on the opposite strand is the C of CGTAA, the opposite strand equivalent of AATGC. 'C' is the 3' end of the signal. Note that starting with version 3.0, the binding factor name is given if possible. If the binding factor is unknown then the TFD site name is used. Each signal found in your sequence has its TFD S##### shown. These can be used to find the factor name, specific site name, and journal citation. The same is true with TRANSFAC site numbers (R#####) or IMD site numbers (M#####) in Version 4.0. MATRIX-SEARCH Matrix-search is a program developed to facilitate the analysis of DNA sequences for known transcription factor binding sites. It scores input sequences against matrices of transcription factor binding sites using information theory (Hertz GZ, Hartzell GW, and Stormo GD Comput. Appl. Biosci. 6:81-92 (1990) ). The starting position of patterns with scores above the cutoffs of each matrix are indicated. In order to reduce false positives, the cutoff scores are determined stringently such that a single base mismatch from the consensus pattern, if not previously demonstrated in our databases, will be deemed by the program as not-matching the consensus pattern. However, more than one mismatch might be allowed if they are documented in the database. The Match Ratio listed in the output file represents the ratios of the information score of a sequence alignment to an alignment with the maximum score. The higher the P-value of an alignment, the closer it resembles a perfect match. To visualize the composition of the matrix for a transcription factor, and to get the citation of a journal article about it, please use the Viewing a matrix option in the matrix-search menu. In the case of overlapping sites for the same factor, only the one with the highest information score is selected. This program is based on the information theory developed by Dr. Gary Stormo. If you use this program in published research, please cite: Hertz GZ, Hartzell GW, and Stormo GD Comput. Appl. Biosci. 6:81-92 (1990) Comments and suggestions on the program or additions to the database are welcome. Please contact: Dr. Qing Chen at chenq@beagle.colorado.edu HOW TO OBTAIN JOURNAL CITATIONS FOR SIGNALS Before you attempt to find journal citations you must scan a sequence for signals. In the results file you will find an "S number" associated with every signal found in your sequence. The S numbers (or site numbers, these are obtained from Ghosh's TFD) are found in the last column of the signal group or linear searches, and are found associated with every signal in the map search. The same is true for the TRANSFAC database except the numbers are preceeded by "R", and the IMD database in which numbers are preceeded by "M". Note that searching the TRANSFAC database takes significantly longer, since there may be more than one reference citation for a signal. The IMD reference search is located in the IMD part of SIGNAL SCAN. Simply enter the number when prompted. You may enter it such as "S00023" or simply as "23". Either format works. All previous results are kept on the screen, and are saved to a file. If you do not supply a file name, the search results are stored in a file called "save.ref". DO NOT use "ref.dat" or any SIGNAL SCAN file name (any name *.ref is OK). OTHER REFERENCE PROGRAMS There are two related reference and information lookup programs available: InfoTrac TFD and TINY-TRP. Information on each is below and is copied from each of the programs. ******************TINY-TRP*********************************** TINY-TRP is a computer readable version of the TRANSFAC database. The adress is ftp.gbf-braunschweig.de or 193.175.244.2 You will find the new version in the directory /pub/transfac/tiny or send an E-Mail to karas@gbf-braunschweig.d400.de, you will get an anounce when the new version is available. by Edgar Wingender,Rainer Knueppel, and Holger Karas Gesellschaft fuer Biotechnologische Forschung mbH Mascheroder Weg 1, D-38124 Braunschweig, Germany **************** I N F O T R A C T F D 7.0 ***************** InfoTrac TFD is a microcomputer implementation of the Transcription Factor Database TFD (D. Ghosh; NAR 18 (1990): 1749-1756) with a graphical user interface. For detailed information on the structure of TFD data fields refer to the cited references (D.Ghosh; NAR 20 (1992): 2091-2093 and TIBS 16 (1991): 445-447). InfoTrac TFD is freeware (see "Disclaimer") and requires Filemaker Pro 2.0 for Macintosh or Windows. InfoTrac TFD Demos are available from the EMBL e-mail server (netserv@EMBL- Heidelberg.DE), from the University of Indiana ftp-archive (ftp.bio.indiana.edu) or the corresponding gopher holes (look for InfoTracTFD_Demo.hqx or INFOTRAC.EXE). Demos can also be requested from the regular mailing address listed below. InfoTrac TFD is made available by: Wolfgang G. Hoeck, Ph.D. MBIT Molecular Biology Information Technology 126 Flynn Ave. Apt.A Mountain View, CA 94043 USA phone: (415) 969-3604 e-mail: wk01177@worldlink.com America Online: WolfMac Updating the SIGNAL SCAN database The database files that come with SIGNAL SCAN are derived from David Ghosh's Transcription Factor Database, Wingenders TRANSFAC database, and Chen's IMD database. Only the TFD and TRANSFAC databases can be updated using this facility. UPDATING THE TFD DATABASE The Ghosh TFD is has not been updated since 8/93. You can get the current copy of the TFD by ftp to NCBI.NLM.NIH.GOV, use "anonymous" for the user name and your email address for the password. You will find the file used by SIGNAL SCAN in the repository/TFD/tfd.ascii subdirectory. You must 'get' the "sites.dat" file. Once you get a local copy of the sites.dat file, place it in the SIGNAL SCAN directory. All you have to do now is select Update from menu and the updating takes place automatically. Updating may take several minutes. Before you do this, make sure the original SIGNAL SCAN database is backed up, or at least have your original SIGNAL SCAN disks. If the TFD changes format sometime in the future, the update utility would not work, and your current SIGNAL SCAN database would be destroyed. In case this happens, then recopy the original signal database files (*.dat) into the SIGNAL SCAN directory and contact me to get an updated version of SIGNAL SCAN or the newest tfd2sig.exe program. UPDATING THE TRANSFAC DATABASE The Wingender TRANSFAC database is currently being maintained. You can get the current copy of the TRANSFAC database by ftp to 193.175.244.2, use "anonymous" for the user name and your email address for the password. Once you log in, change directory to pub/transfac/EBI and get the site.dat file (its about 4MB is size). Once you get a local copy, place it in the SIGNAL SCAN directory and procede as above for TFD. UPDATING THE IMD DATABASE Contact chenq@beagle.colorado.edu for information on how to update the IMD database.