SIGNAL SCAN

  SIGNAL SCAN is a program developed to facilitate the analysis of DNA 
sequences for known eukaryotic signals.  This program is FREE. You may copy 
and distribute this program, but you may not charge for its distribution. 
You MUST register the program by sending your name and address. Registering 
this program helps to justify the program for funding purposes.  For 
registering the program you will automatically be notified of SIGNAL SCAN 
changes and updates. PLEASE REGISTER, you will help assure the future of 
SIGNAL SCAN.  Note that if you obtained this program directly from me, you 
are already registered.
  The source code is written in the `C' language and is fairly easy to port
over to other hardware and operating systems.  The source code will be made
available upon request. You may make changes to the source code, but may not 
release the modified program or source code without my authorization.
  If you use this program in published research, please site: 
Prestridge, D.S. (1991) SIGNAL SCAN: A computer program that scans DNA 
sequences for eukaryotic transcriptional elements. CABIOS 7, 203-206.
  The author welcomes comments and suggestions on the program or additions
to the database.  Please contact:
  Dr. Dan S. Prestridge                        Tele:(612) 625-3744
  Advanced Biosciences Computing Center        E-mail:danp@biosci.umn.edu
  1479 Gortner Ave.
  University of Minnesota
  St. Paul, MN    55108

                            SEQUENCE FORMAT 
                    
   At present, SIGNAL SCAN will accept Staden, Fasta, and GCG formated 
sequences. An exception is when using the IMD database search, which
presently accepts only Staden and Fasta formats. We hope to add GCG later.
Please look at sample sequence files for details. Sample files
included are sample.seq (Staden), sample.tfa (Fasta), and sample.gcg (GCG).
The sequence file must be ASCII, which means that if you use a word 
processor (such as WORD (tm), WORDPERFECT (tm), etc.) that you must 
export the file into an ASCII format.  This is because a word processor adds 
a lot more things to a file than most of you realize (like page formatting, 
type of printer you selected, and many other things) hidden to the user.  For 
you folks familiar to, and have access to the Genetics Computer Group programs,
you can use their "TOSTADEN" formatting program to reformat GenBank files.   
Currently the maximum number of base pairs in an input sequence is limited to 
20kb. If you have MS-DOS 5.0 or greater, it's 'edit' program serves as a good 
ASCII sequence editor.

[GCG sequence note]
Note that, presently, SIGNAL SCAN will not accept GCG sequences with
inline comments (comments inserted into the sequence using the seqed 
editor, header comments are ok). If you want to scan a sequence with
inline comments, convert it to staden format.

[IBM DOS note]
Please note that your DNA sequence must be in the same subdirectory as the
SIGNAL SCAN program.
NEW TO SIGNAL SCAN VERSION 4.0
In addition to Ghosh's TFD (Ghosh,D. (1991) TIBS 16: 445-447)
SIGNAL SCAN now contains the TRANSFAC (Wingender,E. NAR 16: 1879-1902) 
and IMD (Chen, Hertz, and Stormo. MATRIX SEARCH 1.0: a computer program that 
scans DNA sequences for transcriptional elements using a database of 
weighted matrices of transcription factor binding sites [In preperation]) 
transcription factor databases. Ghosh's TFD has not been updated since 
8/93. Wingender's TRANSFAC continues to be updated. Chen's IMD 
(Information Matrix Database) is a new database of weighted matrices of 
transcription factor binding sites. All 3 databases are now searchable in
SIGNAL SCAN. Also, SIGNAL SCAN now accepts GCG and FASTA formated sequences.


________________________PROGRAM INFORMATION_________________________________  
 
   SIGNAL SCAN is offered to you as is, and so are its results, with no
promises.  A signal, defined here, is any short DNA sequence that may have
known significance.  What SIGNAL SCAN does is find homologies of published
signal sequences in your sequence, most of these transcriptional elements. 
It cannot, at this time, predict if what it does find, has any meaning.  The
interpretation of those results are up to you.  Most signal elements found
probably will not have any meaning, as the elements are in the wrong milieu,
wrong cell type, or wrong organism.  Consequently, there will be many
more erroneous signals found by SIGNAL SCAN than significant ones.
   The significance probably varies greatly with the signal length.  There 
are many matches for CP1 in any sequence because it is a very short sequence 
with a high probability of random occurrence.  There are fewer, and likely 
more significant, glucocorticoid elements because of its larger signal 
sequence.
   There is also a great possibility that elements that are in your sequence
will be missed by SIGNAL SCAN, even if those elements are represented in the 
data files.  This can happen if your element does not fall within the 
consensus of the reported signal in the literature.  Use the Journal Citation 
feature to find references to the signals.
   Probably the major benefit and use of SIGNAL SCAN is to find out the 
identity of unknown proteins bound to characterized binding sites in DNA
sequences.


   You can create your own signal database files with this utility, and
save them for future use.  First you are prompted to either use an existing
file that you have created previously or to create a new one.  If you
select to create a new one and then give the name of an existing database
file, the existing file will be erased and overwritten.  Once you select an
existing file or create a new one, you can then add new signals to the
signal file or use the existing file as is.  Entering signals is the same
as in previous versions of SIGNAL SCAN.
   If you decide not to enter a new signal when already in the Add Signal
part of the program, <CRTL>C out of it as soon as possible.  If you make
a mistake in the signal, you will have to edit the file with an ASCII editor
such as the MS-DOS 5.0 "edit" editor.  Be sure to backup your file before 
editing. Be VERY CAREFUL when editing these files. Keep proper spacing.
  Note that in the scan results displayed or saved to a file, the database
selected will be "user.dat", no matter what your file name is.  In fact
SIGNAL SCAN copies a copy of your selected or created signal file into a
file called user.dat, it does not use your file directly.  It is done this
way for programming reasons.
   Use this utility to both create or select a user signal database file.
Once selected here, start one of the scan programs and choose the "User
Signal Database" selection and any others you want.  Unless you change the
user signal database with this utility, the user database selected here will
be used in all subsequent searches until a new one is selected here.

THE MAIN MENU

The main menu options are:
Keyboard entry user signal database utility
	This utility can be used to build your own signal database.

Information Matrix Database
	This part of the program is used to scan a DNA sequence against
	a database of transcription factor binding site weighted matrices
	(the IMD database).

Consensus Signal Databases
	This part of the program is used to scan a DNA sequence against
	either the TFD or TRANSFAC consensus transcription factor
	binding site databases. It contains options for a journal citation
	lookup feature and choices of 3 types of scans:

GROUP SIGNAL SCAN, LINEAR SIGNAL SCAN or MAP SIGNAL SCAN?
	Group Signal Scan groups the results of the search by signal, so 
	that all of the signal groups are together.  Linear scan lists the 
	different signals present in your sequence as it moves along your 
	sequence.  Map scan shows your sequence and displays signals below 
	it.  The choice of output produces the same result, the preference 
	is up to you.  Note that in map scan, the signals reported begin in
	your sequences directly above the (+) or (-) symbols (for + or - 
	strand). The first bp of a signal begins directly above	the + in (+)
	strand signals.

	WHAT NEXT?
   	You are prompted for the file name that contains your sequence, which
	must be in proper format, see sample files for examples, and HELP 
	FORMAT. Next you are prompted for the classes of signals you want to
	search your sequence with; this selects which signal data files 
	SIGNAL SCAN will use in the search. You can choose User Signal 
	Database to use your own signals. To use your own signal database to
	scan with, you must first create or select your database in the 
	Keyboard entry selection from the main menu.  You are then prompted
	for a filename that you wish to store the search results in, which 
	can be any legal file name, such as "SAMPLE.SIG". As the program 
	runs, the results are saved to this file. 

Quit
	Obvious.
HELP
	You're looking at it.

Update TFD and TRANSFAC Databases
	This is a utility that you can use to update the TFD and
	TRANSFAC databases in SIGNAL SCAN. You must first obtain
	a current copy of the database (instructions are included),
	then use the utility to convert the database file to 
	SIGNAL SCAN format.


INTERPRETING THE RESULTS:

   The results are written to a file that you name, and can be printed out
using a DOS 'PRINT' command once the program has completed. The results show 
the name of the signal, the published signal sequence, and the location (loc)
of the first base pair of your sequence that includes that signal. A (-) 
symbol indicates that the signal sequence was found on the opposite strand of 
your input sequence, and that the signal sequence is in the reverse 
orientation, such that the 1st base pair listed is actually the last base 
pair in the signal, but still the first base pair in your sequence.  Let me 
illustrate, to wit:
Signal:   AATGC              signal found on forward strand, (+) 
                                            AATGC
Your seq:  5' GGTTTCTGAAAGCATTGCCTAAATGAGATGAATGCAAAATTTGGCGCGCGTTGTCCC 3'
opp.strand:3' CCAAAGACTTTCGTAACGGATTTACTCTACTTACGTTTTAAACCGCGCGCAACAGGG 5'
                         CGTAA
                  same signal found on opposite strand, (-) 
The 1st bp on the original seq. strand of the signal is the first A of AATCG.
The 1st bp of the signal on the opposite strand is the C of CGTAA, 
the opposite strand equivalent of AATGC. 'C' is the 3' end of the signal.
   Note that starting with version 3.0, the binding factor name is given if
possible. If the binding factor is unknown then the TFD site name is used.
Each signal found in your sequence has its TFD S##### shown. These can be used
to find the factor name, specific site name, and journal citation. The same
is true with TRANSFAC site numbers (R#####) or IMD site numbers (M#####) in
Version 4.0.

MATRIX-SEARCH

Matrix-search is a program developed  to facilitate the analysis of 
DNA sequences for known transcription factor binding sites.  It 
scores input sequences against matrices of transcription factor 
binding sites using information theory (Hertz GZ, Hartzell GW, and 
Stormo GD Comput. Appl. Biosci. 6:81-92 (1990) ). The starting 
position of patterns with scores above the cutoffs of each matrix 
are indicated.

In order to reduce false positives, the cutoff scores are determined 
stringently such that a single base mismatch from the consensus pattern, 
if not previously demonstrated in our databases, will be deemed by the 
program as not-matching the consensus pattern. However, more than one
mismatch might be allowed if they are documented in the database.

The Match Ratio listed in the output file represents the ratios of the 
information score of a sequence alignment to an alignment with the maximum
score. The higher the P-value of an alignment, the closer it resembles a
perfect match.

To visualize the composition of the matrix for a transcription factor, and
to get the citation of a journal article about it, please use the Viewing
a matrix option in the matrix-search menu.

In the case of overlapping sites for the same factor, only the one with
the highest information score is selected.

This program is based on the information theory developed by Dr. Gary 
Stormo. If you use this program in published research, please cite: 
Hertz GZ, Hartzell GW, and Stormo GD Comput. Appl. Biosci. 6:81-92 (1990)

Comments and suggestions on the program or additions to the database are 
welcome.  Please contact: Dr. Qing Chen at chenq@beagle.colorado.edu 

		  HOW TO OBTAIN JOURNAL CITATIONS FOR SIGNALS

	Before you attempt to find journal citations you must scan a sequence
for signals.  In the results file you will find an "S number" associated
with every signal found in your sequence.  The S numbers (or site numbers,
these are obtained from Ghosh's TFD) are found in the last column of the
signal group or linear searches, and are found associated with every
signal in the map search. The same is true for the TRANSFAC database
except the numbers are preceeded by "R", and the IMD database in which
numbers are preceeded by "M". Note that searching the TRANSFAC database
takes significantly longer, since there may be more than one reference
citation for a signal. The IMD reference search is located in the IMD
part of SIGNAL SCAN.

	Simply enter the number when prompted.  You may enter it such as
"S00023" or simply as "23".  Either format works.  All previous results are
kept on the screen, and are saved to a file.  If you do not supply a file
name, the search results are stored in a file called "save.ref".  DO NOT
use "ref.dat" or any SIGNAL SCAN file name (any name *.ref is OK).

OTHER REFERENCE PROGRAMS
There are two related reference and information lookup programs available:
InfoTrac TFD and TINY-TRP. Information on each is below and is copied from
each of the programs. 

******************TINY-TRP***********************************
TINY-TRP is a computer readable version of the TRANSFAC database.
The adress is ftp.gbf-braunschweig.de or 193.175.244.2 You will find the 
new version in the directory /pub/transfac/tiny or send an E-Mail to 
karas@gbf-braunschweig.d400.de, you will get an anounce when the new 
version is available.

by 
Edgar Wingender,Rainer Knueppel, and Holger Karas
Gesellschaft fuer Biotechnologische Forschung mbH
Mascheroder Weg 1, D-38124  Braunschweig, Germany

**************** I N F O T R A C  T F D  7.0 *****************

InfoTrac TFD is a microcomputer implementation of the Transcription Factor
Database TFD (D. Ghosh; NAR 18 (1990): 1749-1756) with a graphical user
interface. For detailed information on the structure of TFD data fields refer
to the cited references (D.Ghosh; NAR 20 (1992): 2091-2093 and TIBS 16
(1991): 445-447).

InfoTrac TFD is freeware (see "Disclaimer") and requires Filemaker Pro 2.0
for Macintosh or Windows.

InfoTrac TFD Demos are available from the EMBL e-mail server (netserv@EMBL-
Heidelberg.DE), from the University of Indiana ftp-archive
(ftp.bio.indiana.edu) or the corresponding gopher holes (look for
InfoTracTFD_Demo.hqx or INFOTRAC.EXE). Demos can also be requested from the
regular mailing address listed below.


InfoTrac TFD is made available by:

Wolfgang G. Hoeck, Ph.D.
MBIT Molecular Biology Information Technology
126 Flynn Ave. Apt.A
Mountain View, CA 94043
USA
phone: (415) 969-3604
e-mail: wk01177@worldlink.com
America Online: WolfMac

			 Updating the SIGNAL SCAN database

   The database files that come with SIGNAL SCAN are derived from David Ghosh's
Transcription Factor Database, Wingenders TRANSFAC database, and Chen's IMD
database.  Only the TFD and TRANSFAC databases can be updated using this
facility.

UPDATING THE TFD DATABASE

The Ghosh TFD is has not been updated since 8/93.  You can get the current 
copy of the TFD by ftp to NCBI.NLM.NIH.GOV, use "anonymous" for the user 
name and your email address for the password.  You will find the file used 
by SIGNAL SCAN in the repository/TFD/tfd.ascii subdirectory. You must 'get'
the "sites.dat" file. Once you get a local copy of the sites.dat file, place
it in the SIGNAL SCAN directory. All you have to do now is select Update 
from menu and the updating takes place automatically. Updating may take
several minutes.
   Before you do this, make sure the original SIGNAL SCAN database is backed
up, or at least have your original SIGNAL SCAN disks.  If the TFD changes
format sometime in the future, the update utility would not work, and your
current SIGNAL SCAN database would be destroyed. In case this happens, then
recopy the original signal database files (*.dat) into the SIGNAL SCAN
directory and contact me to get an updated version of SIGNAL SCAN or the
newest tfd2sig.exe program.

UPDATING THE TRANSFAC DATABASE
The Wingender TRANSFAC database is currently being maintained. You can get the
current copy of the TRANSFAC database by ftp to 193.175.244.2, use
"anonymous" for the user name and your email address for the password.
Once you log in, change directory to pub/transfac/EBI and get the
site.dat file (its about 4MB is size). Once you get a local copy, place
it in the SIGNAL SCAN directory and procede as above for TFD.

UPDATING THE IMD DATABASE
Contact chenq@beagle.colorado.edu for information on how to update the
IMD database.