User's Guide to GRAIL and GENQUEST Sequence Analysis, Gene Assembly And Sequence Comparison Systems E-mail Servers & XGRAIL and XGENQUEST (Version 1.2) (Version 1.1) Client-Server Systems (July, 1994) Informatics Group Oak Ridge National Laboratory Oak Ridge, Tennessee U.S.A. ---------- HIGHLIGHTS ---------- GRAIL GRAILs (1, 1a, 2) Protein Coding Regions GAP Gene Modeling PROTEIN TRANSLATIONS FUNCTIONAL PolyA sites SITES Pol II Promoters CpG Islands HUMAN REPETITIVE DNA ELEMENTS GRAIL ANNOTATION REPORT GENQUEST DATABASES METHODS Database Searches Swiss-Prot Fasta and Alignments PDB Blast Prosite Smith-Waterman GSDB BLIMPS BLOCKS QuikSrch dbEST Human Repetitive DNA TABLE OF CONTENTS ----------------- GRAIL OVERVIEW GRAIL E-MAIL SERVER USER MANUAL GENQUEST OVERVIEW GENQUEST (Q) E-MAIL SERVER USER MANUAL XGRAIL CLIENT-SERVER SYSTEM USER MANUAL XGENQUEST CLIENT-SERVER SYSTEM USER MANUAL ACKNOWLEDGEMENTS SOFTWARE SUPPORT GRAIL PUBLICATIONS REFERENCES -------------- GRAIL OVERVIEW -------------- GRAIL is a suite of tools designed to provide analysis and putative annotation of DNA sequences both interactively and through the use of automated computation. The capabilities of GRAIL are available by several methods. These include an e-mail server at ORNL, which processes DNA sequence(s) contained in e-mail messages, and an interactive graphical X-based client-server system called XGRAIL, which supports a wide range of analysis tools, including gene modeling. The current e-mail implementation of GRAIL provides analysis of protein coding potential of a DNA sequence, and an option for protein sequence database searches of putative coding regions. GRAIL VERSIONS: The coding recognition portion of the system uses a neural network which combines a series of coding prediction algorithms. There are three basic versions of this neural network, GRAIL 1, GRAIL 1a and GRAIL 2. GRAIL 1 has been in place for about three years. It uses a neural network described in PNAS 88, 11261-11265, which recognizes coding potential within a fixed size (100 base) window. It evaluates coding potential without looking for additional features (information such as splice junctions, etc). GRAIL 1a is an updated version of GRAIL 1. It uses a fixed-length window to locate the potential coding regions and then evaluates a number of discrete candidates of different lengths around each potential coding region, using information from the two 60-base regions adjacent to that coding region, to find the "best" boundaries for that coding region. GRAIL 2 uses variable-length windows tailored to each potential exon candidate, defined as an open reading frame bounded by a pair of start/donor, acceptor/donor or acceptor/stop sites. This scheme facilitates the use of more genomic context information (splice junctions, translation starts, non-coding scores of 60-base regions on either side of a putative exon) in the exon recognition process. GRAIL 2 is therefore not appropriate for sequences without genomic context (when the regions adjacent to an exon are not present). These changes have improved the overall performance compared to GRAIL 1, particularly for short exons. All three systems have been trained to recognize coding regions in human DNA sequences, although they also work well on a number of other organisms, particularly other mammals. [For convenience we use the term "exon" to refer to coding regions, and a note of caution is that non-coding exons, or non-coding portions of exons will not be recognized by the system.] GRAIL PERFORMANCE STATISTICS GRAIL 1 typically finds about 90% of coding regions greater than 100 bases with performance falling off for shorter exons. GRAIL 1 has been tested on a set of human genes containing 102kb of sequence. This set contained 70 coding exons and the system identified 62 (89%) and assigned them all to the correct strand. Of the eight missed 6 were less than 100 bases long. In a larger test set strand assignment was 90-95% correct. The preferred reading frame assignment was correct for 60 (95%) of these exons while the frame assignment for the other two had some ambiguity. Of the predicted exons with a quality score of "excellent" all were actual coding exons. Of predicted exons scoring "good" 69% were real and of the predicted exons with a score of "marginal" only 16% were real. Though this is a rather limited test set, the results of this analysis give some guidance for interpreting GRAIL 1 output. GRAIL 1a performs much better than GRAIL 1 in finding true exons and eliminating false positives. It is also better than GRAIL 1 in terms of finding the boundaries (edges) of coding regions. GRAIL 1a has been tested on a set 137 sequences containing 954 exons. The system recognized 82% (787) of the exons in the set, with a false positive rate of 11%. Of the 954 exons in the set, 711 exons were greater than 100 bases long. The system recognized 95 % (675) of these exons. The frame assignment was correct virtually always (greater than 98% of the time). GRAIL 2 finds about 91% of all coding regions, with a performance that is close to being independent of exon size. Its false positive level is similar or even slightly better than GRAIL 1. GRAIL 2 has been tested on a set of 137 sequences containing 954 exons. The system recognized 91% (857) of the exons in the set, with an apparent false positive rate of 8.6% (most of these were outside the domain of the known genes and some may actually be real). Of exons less than 100 bases long GRAIL 2 found 102 out of 122 or 84%. GRAIL 2 provides the best candidate for a given coding region in a manner which includes splice junctions (or translation start/stop) at the candidates edges, so the user will note that the edge of the candidates in the initial and summary tables correspond to putative edge signals. In the test set, about 61% of the recognized exons had both edges exactly correct (the right splice junctions picked) and about 96% had at least one edge correct. GRAIL 2 is perhaps better at estimating the true extent of an exon compared to GRAIL 1 and this additional accuracy may help in experimental protocols such as those involving PCR. ------------------------------- GRAIL E-MAIL SERVER USER MANUAL ------------------------------- The GRAIL e-mail server finds potential protein coding regions in anonymous DNA sequences and provides a means of searching the translations of these regions against protein and motif databases. To have sequences analyzed by e-mail, send e-mail to: GRAIL@ornl.gov Please note that: (i) GRAIL is case-insensitive, (ii) More than one sequence can be sent in an e-mail message, (iii) The length of a sequence must be at least 100 bases (for GRAIL 1) and at most 100 kilo-bases, and (iv) The sequence must consist of letters A, C, G, T or U. U is converted to T. Any other character is converted to C. Blanks are ignored. The first line of the message MUST be in the following format: Sequences NUM_SEQ [-1a / -2] [-S / -E / -P / -p / -B / -b] The word Sequences, followed by the number of sequences in the message, followed by OPTIONAL switches: (a) one of -1a and -2 and (b) one of -S, -E, -P or -p or -B or -b. The first line is followed by the sequences in the following format: >sequence_name sequence A typical message is shown below: Sequences 3 -2 -E >seq1name AAAATTTCGGG........ >seq2name GGCTGTTCATG........ >seq3name ATTGCAGACAG OPTIONAL SWITCHES ----------------- One of the following two: -1a switch specifies that GRAIL 1a will be used for the analysis. -2 switch specifies that GRAIL 2 will be used for the analysis. The default is GRAIL 1. and one of the following six: -S switch specifies that translations of all potential coding regions be searched against SwissProt using an implementation of the Smith-Waterman algorithm on an Intel iPSC/860 parallel computer. -E switch is same as -S, except that only "excellent" potential coding regions be considered for the searches. -P switch is same as -E, except that instead of Swiss-Prot, Prosite Database is searched. -p switch is same as -P, except that abbreviated Prosite Database search output is returned. -B switch is same as -E, except that Blast method is used instead of Smith- Waterman. Top 40 database hits are returned. -b switch is the same as -b, except that top 10 database hits are returned. The database search hits provide an indication of homology between recognized exons and existing proteins. RETURN MESSAGE -------------- For each sequence the following information will be returned: 1. Initial Coding Scores: GRAIL 1 reports the score for the coding potential for each position analyzed on each strand (the f-(forward) strand represents the sequence as received, and the r-(reverse) strand represents the reverse compliment). These scores range from 0.0 to 1.0 and a score greater than 0.5 identifies a region with protein encoding potential. Non-coding regions often have a score of 0.000. To reduce the output, only regions with scores of at least 0.01 are reported. GRAIL 1a and GRAIL 2 use a somewhat more concise format appropriate for their design and implementation. Instead of a position by position score, they report a table for the forward strand and a table for the reverse strand, which lists potential exon candidates and their scores. Sometimes a single exon is perceived in both the forward and reverse direction, and the issue of which is the coding strand is resolved in a later step (described below). 2. Frame: In calculating the coding potential, the system calculates the reading-frame which is "preferred" in the window over which the calculation is done (100 bases for GRAIL 1 and the exon candidate length for GRAIL 1a and GRAIL 2). In GRAIL 1 this information is returned for positions with scores over 0.5, while in GRAIL 1a and GRAIL 2 each candidate exon has an associated frame. In GRAIL 1 the translation frame predicted is true for about 95% of true exons, while in GRAIL 1a and GRAIL 2, it is true virtually always (greater than 98% of the time). 3. ORF: The limits between which the preferred frame is open is returned for windows with scores over 0.5 (GRAIL 1) or exon candidates (GRAIL 1a and GRAIL 2). 4. EXON Summary Table: The second part of the output is the system's interpretation of the raw data (neural net outputs). This summary table provides the estimated limits of the coding exon, the most likely strand for the exon with a probability for the correctness of the strand assignment, the preferred reading frame for the exon and a quality assessment. An interesting phenomenon we have noted is that some exons seem to have coding character on both strands, so be aware that strand assignments are not always correct, and it is sometimes useful to consider both strands as possible. Strand assignment is correct about 95% of the time in GRAIL 1 and greater than 98% of the time in GRAIL 1a and GRAIL 2. Any exon with a quality score of "excellent" is worth further consideration. ----------------- GENQUEST OVERVIEW ----------------- GENQUEST is an integrated sequence comparison server which allows users to make use of a wide variety of sequence comparison methods and target databases, through either e-mail or an X-based client server system, XGENQUEST. GENQUEST can also be transparently accessed from XGRAIL. The purpose of the system is to allow rapid and sensitive comparison of DNA and protein sequences to existing DNA and protein sequence databases. The databases which can be accessed from the GENQUEST server include: GSDB (Genome Sequence Database): DNA sequence database satellite maintained at ORNL and updated daily from the primary database at Los Alamos National Laboratory), SWISSPROT[1], PROSITE[2] (a library of protein motifs), PDB[4] (Protein Databank sequences of proteins with solved structures), BLOCKS[9] (Protein motif database based on conserved blocks), dbEST (Expressed Sequence Tag database), and a library of human repetitive DNA sequences (from J. Jurka[3]). GENQUEST uses a specialized parallel computing environment at Oak Ridge National Laboratory and is supported and curated by a number of groups in the Genome community. As new analysis tools become available, the modular nature of the GENQUEST server will facilitate their implementation and broaden their accessibility to the research community. The GENQUEST server not only allows the user to access multiple databases but also allows several databases to be queried from the same message. The GENQUEST server also supports a number of methods for database searching. ---------------------------------- GENQUEST E-MAIL SERVER USER MANUAL ---------------------------------- GenQuest can be accessed by sending e-mail to: Q@ornl.gov Messages to GENQUEST begin with a set of keywords which specify the options to be used in the search. Two key words are mandatory: TYPE and SEQ. The remainder are optional or have default settings. GENQUEST is case insensitive. EXAMPLE of a typical query: TYPE DNA6 TARGET SwissProt METHOD SW -g 13 MATRIX PAM120 SCORE 50 ALIGN 20 SEQ ATCTATCGTCGAGCTGGTGTCTGTGCTAGTCCACAGACAGHCTCGCTATATATGCT CGTTTTAAAGCTCGTATATATGCTCTCGCTAGTCCGATCGATGCTCGATCGCTAGTA TCGTATGATTCTTG END This example translates the given DNA sequence in 6 frames and searches SwissProt, using Smith-Waterman with gap penalty of 13, PAM120 matrix, and showing top 50 matches and top 20 alignments. KEYWORDS: The keywords and options supported by the server are listed below: 1] TYPE ( DNA / DNA6 / PROTEIN ): the type of sequence being submitted. PROTEIN specifies that the input is an amino acid sequence. DNA6 specifies that the input sequence is DNA and to be translated in all 6 reading frames for search against protein databases. DNA specifies a DNA input type which can be searched against DNA target databases or if a protein database is selected as target, translated only in the frame of the first base in the sequence and searched against protein databases. The DNA6 options requires quite a long search time and is not recommended for DNA sequence of more than 1000 to 2000 bases. 2] TARGET ( GSDB | REPETITIVE | dbEST | SWISSPROT | PDB | PROSITE | BLOCKS ): specifies the database to which the sequence will be compared. Multiple targets are allowed to specify comparison against more than one database. a) SWISSPROT: Swissprot protein sequence database (updated quarterly). b) GSDB: Genome Sequence Database, a daily updated DNA sequence database. c) PDB[4]: Structure database, (Brookhaven) Protein Databank. Hits represent homologous proteins of known structure. d) PROSITE[2]: Protein motif library which can provide clues as to protein function or classification. e) REPETITIVE: Comparison of DNA against a library of human repetitive DNA from J. Jurka and helps provides annotation of repetitive DNA elements. f) BLOCKS[9]: Protein motif database based on conserved blocks. BLIMPS (Blocks IMproved Searcher) search tool is used for BLOCKS database searches. g) DBEST: Expressed Sequence Tag database. NOTE: The version of the database searched is listed in the results from GENQUEST. 3] METHOD ( SW / FASTA / BLAST / FLASH ): specifies the comparison algorithm to be used in the search. The options are Smith-Waterman (SW) [5], FASTA [6], BLAST [7], and FLASH [11]. The default method is SW. Exceptions: For BLOCKS and PROSITE databases, no method needs to be specified, since special methods are used for searching those database. The defaults for FASTA and BLAST are the standard defaults used by these programs. Blast and FASTA options can also be set on this line. Descriptions for these are available by sending "help fasta" or "help blast" e-mail to the grail@ornl.gov address. The gap penalty used in the SW program is set on this line using -g. For example, SW - g 10 sets the gap penalty to 10. The default is 13. 4] MATRIX ( PAM [n] / Blosum [m] ): specifies the matrix used for protein sequence comparison. [n] specifies any valid PAM matrix, viz. a multiple of ten, within the range 10 to 250. For example, PAM 250 [8]. [m] can be 62 or 80 [9]. The default is Blosum 62. These are not used for DNA-DNA comparison. 5] FILTER: specifies that repetitive DNA elements recognized in the query sequence should be masked so as not to lead to unwanted matches against the DNA sequence database. This filtering system use a library of human repetitive DNAs from J. Jurka. The default is no filtering. A DNA query which is then translated and searched against the protein databases avoids spurious hits that can arise from the translation of repetitive elements. The utility of such a filter is well documented [10]. 6] SCORE num_score: specifies the number of hits to be reported. Default is 10. 7] ALIGN num_align [-g]: specifies the number of hits for which alignment should be performed. Default is 10. Usually for proteins 10 to 200 is an appropriate range. The SCORE value should be greater than or equal to the ALIGN value. The program normally does a local alignment, however, a global alignment maybe requested using -g on this line. For example, ALIGN 10 -g returns global alignments of the top 10 hits. The keywords SCORE and ALIGN apply only to method SW (parallel implementation of Smith-Waterman). 8] COMMENT comment: specifies one line of text to be prepended to the return message from GENQUEST. 9] SEQ sequence ...... END SEQ and END are keywords which specify where the sequence starts and ends in the e-mail message. The sequence must begin on the line following the keyword SEQ (not on the same line as SEQ). The sequence can be either standard single letter protein or DNA sequence. The length of the sequence lines should be less than 512 characters. In DNA sequences, any characters other than A, C, G, T and U are converted to X (and therefore will be filtered out). U is converted to T. Blanks are ignored in DNA and Protein sequences. ADDITIONAL EXAMPLES ------------------- The examples below illustrate typical queries for various types of searches: i) This example searches the given protein sequence against SwissProt, using FASTA with default parameters and default BLOSUM 62 matrix; and, also searches Prosite (using a special method). TYPE Protein TARGET SwissProt TARGET Prosite METHOD FASTA COMMENT this is my protein sequence comparison run SEQ LYSEGRTAAGLVPPRTYILGREFWAAGLUTRYTHISPLEASE END ii) This example searches the given DNA sequence against GSDB and the Repetitive DNA library (using the SW default). TYPE DNA TARGET GSDB TARGET REPETITIVE SEQ ATAGATAAAGGGTGCTGTTTGGCGAAATATTGCTGCTGGCGCCGTAGATATATAG CTGTGCTGTGATGTCGCTCGTAGATATAGCTAGTCTAGTCGATCG END --------------------------------------- XGRAIL CLIENT-SERVER SYSTEM USER MANUAL --------------------------------------- XGRAIL is a client-server implementation of a group of analysis tools for sequence exploration and gene discovery. It allows the user to find protein coding regions in anonymous DNA sequences, to assemble gene models, translate part or all of these models, and search these translations against various databases. Database searches of a region of a DNA sequence against various databases are also supported. XGRAIL also provides information about GC content, and the location of several types of functional sites (splice junctions, polyA sites, Pol II promoters and CpG Islands) and a variety of human repetitive DNA sequences. All the information generated during the analysis of a DNA sequence can be saved for future retrieval and further processing. Additionally, an annotation tool is provided within XGRAIL, which facilitates marking (annotating) items of significance to the user, and generating an annotation report which can then be saved to a file or printed. Currently the client software has been tested on SPARCstations running Open Windows 3.0 and SunOS 4.1.3. Connection of the user's machine to the Internet is required. OBTAINING AND INSTALLING XGRAIL (Version 1.2) CLIENT SOFTWARE ------------------------------------------------------------- 1. Create a subdirectory in which you wish to install XGRAIL (Version 1.2). % mkdir XGRAIL_1.2 Go to that subdirectory % cd XGRAIL_1.2 2. Obtain the XGRAIL (version 1.2) distribution by anonymous ftp, as follows: % ftp arthur.epm.ornl.gov (or ftp 128.219.9.76) Name: anonymous Password: [your internet address] ftp> cd pub/xgrail/sun/ver1.2 ftp> binary ftp> get README ftp> get xgrail.sun.ver1.2.tar.Z ftp> quit 3. Extract the files from xgrail.sun.ver1.2.tar.Z % zcat xgrail.sun.ver1.2.tar.Z | tar xvf - 4. At this point, there should be following files in XGRAIL subdirectory : Manual.grail-genquest.July94 (Grail-Genquest User Manual) README testseqs (Subdirectory containing test sequences) xgrail_1.2 xgrail.sun.ver1.2.tar.Z (Can be deleted at this point) 5. You can start up the xgrail program: From the command line % xgrail_1.2 & OR From the file manager by double-clicking on the xgrail_1.2 icon. DESCRIPTION OF XGRAIL (Version 1.2): ------------------------------------ This section has been organized in the form of a step-by-step tutorial. The best way to understand the operation and capabilities of XGRAIL is to read the following description while running XGRAIL with one of the sample DNA sequences provided with the software. MAIN WINDOWS: There are three main windows in XGRAIL: the (top) XGRAIL window, the (middle) DNA Sequence window and the (lower) ANALYSIS window. When XGRAIL is started on the client machine, it first contacts the GRAIL server to check for any informational messages. If there are any, they are retrieved and displayed in a notice window. On clearing this window, the empty XGRAIL window is displayed. Across the top of the window is a menu bar with a number of buttons, menus and controls. A button can be selected by clicking on it with the left mouse button. A menu is indicated by an inverted triangle. The menu options can be viewed by holding down the right mouse on the menu. A menu option can be selected by holding down the right mouse button and moving the cursor to the appropriate option, and then releasing the button. Clicking with the left mouse on a menu results in the selection of the default (typically the first) menu option in the list. Initially only the File Menu is enabled, since a sequence must be loaded before any other actions can be taken. MENUS IN XGRAIL WINDOW: FILE MENU (LOAD & SAVE): The first step in using XGRAIL is to load a DNA sequence file into the system. Selecting the menu option Load pops up a sequence directory window which displays subdirectories, sequence files (.seq) and XGRAIL (Version 1.2) analysis files (.xgr.1.2). A file or subdirectory can be selected by double clicking (left mouse) on the name. Alternatively, clicking (left mouse) on the file name and then clicking on the Load button at the bottom of the directory window loads the file. If an analysis file (.xgr.1.2) is selected or a sequence (.seq) file is selected and an analysis file (.xgr.1.2) exists for it, then the information from the analysis file is read in and displayed. If a .seq file is selected and no analysis file (.xgr.1.2) exists for it, then the sequence is read from the file, sent to the GRAIL server for calculation of coding probability, exon prediction and polyA functional sites prediction. [For purposes of this discussion and on the XGRAIL Display, the term exon is used interchangeably with coding region. Non-coding exons or portions of non-coding exons are not currently recognized by the system.] Depending on the size of the sequence and the load on the GRAIL server, it may take a few seconds to a few minutes for the results to come back from GRAIL server. At this point, the other menus and controls are enabled and the GRAIL analysis displayed in several windows: XGRAIL WINDOW displays the GRAIL analysis of the query sequence, identifying potential coding exons on the forward and reverse strands which are color coded for quality with green = "excellent" (about 90% probable), blue = "good" (about 60% probable) and red = "marginal" (about 20% probable). Gene models are also represented in this window by a set of linked cyan bars. Several other features which will be described below are also displayed in this window. This window is initially 10kb wide and longer sequences can be fit into the window by using the zoom feature. Dragging the zoom indicator with left mouse changes the zoom. DNA SEQUENCE WINDOW displays 100 bases of DNA sequence from both strands. The position of this sequence is indicated by the double vertical green lines in the central regions of the XGRAIL window. The position of this blow-up region can be moved by clicking at the desired location on the central horizontal band (the gray-scale band showing GC content) of the XGRAIL window or by clicking the arrows on either side of the DNA Sequence window (left mouse). This window also displays exons from the Exon Table as color coded horizontal bars and exons from gene models similarly in cyan. Translations of exons are also shown in the central region of this window (described later). Other features (PolyA sites, Promoter regions, CpG Islands, Repetitive DNA elements) are displayed as color-coded sequence characters. ANALYSIS WINDOW: This window displays information about exons and gene models found in the sequence, in three subwindows: Exon Table (leftmost) subwindow displays information for each of the exons found by GRAIL: Strand (Forward or Reverse), reading frame, position of the exon on the sequence, limits between which the preferred reading frame is open, quality score, and the number of database searches done. Model Exon Table (central) subwindow displays information for each of the exons in the currently selected gene model, assembled by GRAIL: reading frame, position of the exon on the sequence, quality scores of translation start, acceptor and donor splice junctions used in building the gene model, and the number of database searches done. A * in front of the first model exon score indicates that this score is for translation start, not acceptor junction, and an absence of * means that the assembly program did not find a suitable start site. A * and blank score after the last exon indicates a suitable stop codon has been found, while a numerical score and absence of a * indicates that this is a donor junction, and no stop codon has been found. Gene Model Table (rightmost) subwindow displays information for each of the gene models assembled by GRAIL: date of assembly, strand (Forward or Reverse), region of the sequence considered in assembling the model, score, number of exons in each model and the number of database searches done. Any time during the session, the user has the option to save the current state of analysis to the analysis file, by selecting Save option from File Menu. (PLEASE NOTE that the previous analysis file for that sequence is overwritten). GRAIL 1-1a-2 MENU: Clicking on 1, 1a or 2 in this menu results in the display of information related to that version of GRAIL analysis. (GRAIL 2 is the default). The difference between the three versions is as follows: GRAIL 1 recognizes coding potential without using other signals and is perhaps best suited for those cases when small fragments are to be evaluated or when genomic context is considered to be inappropriate (as in cDNA sequences). GRAIL 1a is an updated version of GRAIL 1. It first uses a fixed-length window to locate the potential coding regions and then evaluates a number of discrete candidates of different lengths around each potential coding region, using information from the two 60-base regions adjacent to that coding region, to find the "best" boundaries for each such region. GRAIL 1a, like GRAIL 1, is more useful for non-genomic sequences (like cDNA sequences). GRAIL 2 identifies exons by using signals such as splice junctions and other genomic context. It is therefore best suited for analysis of genomic sequences. Please note that: (a) Models of genes can be constructed only from GRAIL 2 exons. (b) Database searches and protein translations can be done from any version. WINDOWS MENU: Clicking the right mouse button on Windows menu displays the list of several additional windows: DNA Sequence, Analysis, Features, Annotations, Range Markers, Sketch and Grail Publ windows. Releasing the right mouse button on one of the options results in the display of the corresponding popup window. FEATURES WINDOW: This window displays the list of features (of the currently selected feature type) found in the sequence by GRAIL: PolyA sites, Promoters, CpG Islands and Repetitive DNA elements. The feature type to be displayed can be selected from a selection menu found on left side of the window. A specific feature item can be highlighted by clicking on its entry in the list. The item is highlighted in the XGRAIL and Features Windows. All functional features supported by GRAIL are described later in the manual. ANNOTATIONS WINDOW: This window displays items selected by the user for inclusion in an annotation report. An item can be selected for annotation by clicking with the right mouse button on its entry in the relevant Table: Exon Table, Gene Model Exon Table or Gene Model Table in Analysis Window; Feature Table(PolyA, Promoter, CpG Island or Repetitive DNA) in Features Window; Database Search Table in Database Search Info Window. The Annotation Tool is described in detail later in the manual. RANGE MARKERS WINDOW: This window displays positions of the markers which set the limits for various operations, viz. constructing a single gene model, performing a database search for a region of the DNA sequence. The markers are the blue arrows at the ends of the central region of the XGRAIL window which can be pulled to any position along the sequence using the sliders on this window. Alternatively the arrows can themselves be dragged on the main XGRAIL window. SKETCH WINDOW: This window is overlaid on XGRAIL window and displays the coding probability over the entire sequence and provides a reference for the user's location in the whole sequence. The red horizontal marker in the Sketch window corresponds to the portion of the sequence displayed in the larger XGRAIL window. GRAIL PUBL WINDOW: This window displays all GRAIL-related publications. FEATURES MENU: This menu toggles on and off the display of any of the feature types in the XGRAIL and DNA Sequence windows, viz. PolyA sites, Promoters, CpG Islands or Repetitive DNA elements. Clicking on an individual feature item in XGRAIL Window highlights it and the corresponding entry in the Features Window. ASSEMBLE MENU: is used to construct gene models within specified regions of the sequence. The region for assembly is defined using the Gene Assembly Markers window (described earlier). There are three options for Assembly: Auto Select which allows the program to pick the "best" model, Forward Strand which assembles exons on the forward strand and Reverse Strand, which assembles exons on the reverse strand. This version of the gene assembly program, GAP III, uses dynamic programming and heuristics, and takes only a few seconds to run. The results of model construction can be viewed in the XGRAIL window as a series of linked cyan bars and in the DNA Sequence window as cyan bars. The details of the model are listed in the Model Exon Table and Gene Model Table. Selection of Exons and Models: For a number of operations including translation of individual exons or models, and database searches for individual exons or models, a particular exon or model must first be selected. Exon selection is done by clicking on the desired exon bar in the XGRAIL or DNA Sequence window or by clicking on the corresponding row in the Exon Table, or Model Exon Table. A particular gene model can be selected by clicking on the corresponding row of the Gene Model table. TRANSLATION MENU: This displays the translation of exons in the exon table, gene model exons, or entire gene models based on a choice in the Translation submenu. For exons in the exon table, a translation is provided in only the statistically preferred reading frame (one frame for a given exon). This frame is listed in the Exon Table window. In GRAIL 2 and GRAIL 1a, the choice of this translation frame is correct greater than 98% of the time, while in GRAIL 1 it is about 95% correct. For gene models, the frame appropriate to the exon and model is used (frames listed in Model Exon table). Since the gene model is constructed in a manner which is reading frame consistent with the initial statistical estimates of frame, the frame used here is virtually always the same as in the original exon table. The resulting translation appears in a Translation pop-up window. The extent of the exons and their translations can also be viewed in the DNA Sequence window, in the central horizontal area between the two DNA sequence strands. Yellow single letter protein translation is displayed when an exon in the exon table is selected. Selecting a gene model exon results in display of the translation in cyan, overlying the yellow translation from the same exon in the exon table. If there is a frame discrepancy at a given location both translations will appear simultaneously. SEARCH DATABASE MENU: There are two options in the submenu: GENQUEST SEARCH: allows the user to access the GENQUEST (Q) sequence comparison server. A GENQUEST Search Options window comes up and displays all the available options. Here, a multitude of options are possible including search of exons, gene model exons, and gene models, as well as other selected parts of the DNA sequence against SwissProt, Prosite, PDB (protein structure database), the Genome Sequence Database (GSDB), BLOCKS, dbEST and the repetitive DNA library, using a number of algorithms. Other details for these options are described in the GENQUEST manual. The results of QuikSrch and GENQUEST searches are displayed in a pop-up window. QUIKSRCH: searches selected translated exons or models against SwissProt. The choice of exon, gene model exon, or gene model is made through use of the QuikSrch submenu. This search uses a Fasta-like prescreen followed by a second optimization step based on the Smith-Waterman method. DB SEARCH INFO MENU: Tracks database searches and allows one to find and display previous database search results. The submenu allows selection of the GRAIL Exon, Model Exon, or Gene Model search list. The selected list is displayed in the Search Info pop-up window from which the results of a given search may be chosen for display. These results appear in a pop-up window which lists matches and the target database used for the search. Search results can be deleted from the Search Info window. ZOOM-TO-FIT BUTTON: Between the Db Search Info and Zoom Slider is a circular button which, when selected, automatically fits the sequences within the XGRAIL window. ZOOM: The Zoom slider allows for rescaling of the loaded sequence in the XGRAIL window. The default zoom value is 1 and corresponds to 10 kb per screen width. The zoom can be changed by dragging the zoom slider. QUIT BUTTON: Ends client-server interaction after allowing the user the option of saving new analysis and changes made during the session. Basically the current state of analysis, including database searches, can be saved in a .xgr.1.2 file. DESCRIPTION OF FEATURES: GRAIL can find the following functional sites in a DNA sequence: POLYA Site: The vertical cyan bars above and below the GC band of the XGRAIL window mark the positions of potential poly-A addition signals. PROMOTERS: Pol II Promoter regions are displayed as hollow, yellow rectangles with a red vertical bar (representing 'TATA' location) above or below the GC band, in the XGRAIL window. The current version of promoter recognition software is trained to recognize only Pol II promoters having TATA-like elements. NOTE: The Pol II Promoter recognition system [7] is a prototype. The current system is trained to recognize only Pol II promoter regions with TATA-like elements containing the subsequences TATA or ATA. The system detects about 60% of Pol II promoter regions with TATA-like elements. The false positive rate is approximately 1 per 7100 bases of DNA sequence. The statistics have been calculated based on annotated GSDB sequences. The false positive rate may be lower due to possible unannotated promoters. CpG ISLANDS: CpG Islands are displayed as hollow, purple rectangles with vertical tabs superimposed over the GC band, in the XGRAIL window. REPTTV DNA: An option for locating various repetitive DNA elements is provided and these elements are indicated by centrally located yellow hollow rectangles with vertical tabs and cyan arrow-heads indicating their orientation. Analysis of repetitive DNAs requires detailed sequence comparison using Smith- Waterman and may take some time especially for very long sequences or those with many repetitive elements (about 6 minutes for a 21kb sequence with 23 hits). Once the repetitive analysis is done this feature can be toggled on and off the display like any other feature type using the Features menu. The human repetitive DNA annotations come from a library of 65 elements provided by J. Jurka. GC CONTENT: is represented by gray shading in a central horizontal band in the XGRAIL window. This band reflects the GC content of a sliding 50 base region with white being high GC and black low GC. ALTERNATE EXONS: In cases where exons overlap on both strands, GRAIL 1 & 2 incorporate a strand-determination algorithm to determine the more likely strand. The rejected exon's coding probability is, nonetheless, displayed (by default). It can, however, be toggled, using this option. DESCRIPTION OF ANNOTATION TOOL: Select Menu: The annotation of only a single type of item is displayed, at a time. The type of item to be displayed can be selected from this menu in the Annotation window. User-Input: Selecting "User-Input" option from this menu displays a window with fields which can be filled by the user.Unlike all other items, User-Input is always a part of the annotation. Sequence: Selecting "Sequence" option from this menu displays (the first and last 250 bases of) the DNA sequence itself. The entire sequence will be included in the annotation report. Grail Publications: Selecting "Grail Publ" option from this menu displays the list of GRAIL-related publications. Grab, Ungrab: All features of the currently selected (in annotation menu) feature type can be brought into the annotation report by clicking on the "Grab" button. Similarly, they can all be "deannotated" by clicking on the "Ungrab" button, in this window. Protein Translations: To include the protein translation of Exons, Gene Model Exons or Gene Models, click with the right mouse button on its entry in the annotation window, when that particular item type is being displayed. A (T) is displayed to the left of that entry, indicating that the protein translation of that exon (or Gene model) will be included in the annotation report. Incl/All: The user can select the item types to be included in the annotation report by checking the boxes next to the corresponding Menu options (under the Incl column), and then selecting the "Incl" option (from Incl-All menu). The Incl option allows the user to include only the item types of interest in the annotation report. Selecting "All" (from Incl-All menu) overrides the checkmarks, and includes all the annotated items from all item types in the annotation report. Print/Save: The annotation report can be printed by clicking on the "Print" button or saved to a file by clicking on the "Save" button. Annotation File: The annotation report is saved in a file, which is stored in the same directory as the sequence file. The annotation report file name consists of the sequence file name, appended with .subset.anno.1.2 (for "Incl" option) or .full.anno.1.2 (for "All" option), followed by the current date (e.g. humactga.seq.subset.anno.1.2.07_26_1994). NOTES: i) Windows can be moved by grabbing and dragging their edges. ii) Popup Windows can be removed by "unpinning". iii) Some windows can be resized by grabbing and dragging the corners. iv) Selecting the File button allows another sequence to be chosen for analysis. v) Any popup window can be dismissed by clicking on the pushpin in the upper left corner of the window and popped up again using the Windows menu. CHANGES IN XGRAIL (Version 1.2) ------------------------------- 1. Message from Grail Staff: (On startup, if any) When XGRAIL (version 1.2) is started up, it first tries to get messages from the server, if any. This provides a mechanism for Grail Staff to inform the user of any new versions, or any relevant information regarding GRAIL. 2. The program now ignores digits, and can therefore read sequence files with sequence base numbers. 3. Features Highlighting: A feature can be highlighted by clicking on its graphical representation (in XGRAIL or DNA SEQUENCE window) or on its entry in FEATURES window. 4. Features Window: This window displays a list of all features of the selected feature type. 5. DNA sequence database search on reverse strand: In version 1.1, database searches on DNA sequence were limited to forward strand. In this version, searches can be performed on reverse strand, also. 6. Alt Exons Toggle: In cases where exons on both strands overlap, GRAIL 1 and 2 use a strand- determination algorithm to determine the more likely strand. The rejected exon's coding probability is, nonetheless, displayed (by default). It can, however be toggled on or off, using "Alt Exons" option under "Features" menu in XGRAIL window. 7. CpG Islands: This version of XGRAIL incorporates an algorithm to determine CpG Islands in a DNA sequence. The algorithm is based on the definition of CpG Islands by Gardiner-Garden and Frommer (J. Mol. Bio 196:261-282, 1987). 8. Annotation: A new annotation tool is incorporated in this version of XGRAIL. It allows the user to mark items of interest, generated in the process of analysis, to be included in an annotation report. The annotation report can then be saved to a file or printed. 9. INCOMPATIBILITY with previous XGRAIL (Versions 1 & 1.1) analysis files (.xgr and .xgr.1.1): The analysis is saved in an analysis file. The name of the analysis file consists of .xgr.1.2 appended to the sequence file name. Since several analysis algorithms have been altered (& improved), the older analysis files (.xgr & .xgr.1.1) are no longer supported by xgrail_1.2. You can still access the old analysis files, using the previous versions, viz. xgrail (or xgrail_1.1, respectively). ------------------------------------------ XGENQUEST CLIENT-SERVER SYSTEM USER MANUAL ------------------------------------------ XGENQUEST is a client-server implementation of the integrated sequence comparison system. Currently the client software has been tested on Sparc stations running Open Windows 3.0 and SunOS 4.1.3. Connection of the user's machine to the Internet is required. Differences between XGENQUEST & GENQUEST E-mail Access ------------------------------------------------------ GENQUEST server expects the query to be in a specified format, described in the USER MANUAL FOR GENQUEST E-MAIL SERVER, described above. XGENQUEST client software formats the query based on the options selected by the user, thus relieving the user from that responsibility. XGENQUEST allows only a single database target to be specified in a query, whereas e-mail query can specify multiple database targets. XGENQUEST does not support the IBM FLASH method of sequence comparison. File Management in XGENQUEST ---------------------------- XGENQUEST allows the user to browse the filesystem and displays all filenames with extensions .seq (for DNA sequences) and .prt (for protein sequences), and subdirectory names. When the user selects a file for loading (by double- clicking on the filename), the sequence is displayed in a pop up window. File Format: XGENQUEST expects the .seq and .prt files to be in FASTA format; please refer to examples included in the software distribution. The user can save searches to the disk. The searches are saved in individual files (the filename for a search file is the name of the sequence file, appended with .gqr, and search number, e.g. humvpnp.seq.gqr1). The user can select a search file (using the browser), to be displayed in a popup window. The user can also delete a search file, using the Delete Search button in this popup window. OBTAINING AND INSTALLING XGENQUEST (Version 1.1) CLIENT SOFTWARE ---------------------------------------------------------------- 1. Create a subdirectory in which you wish to install XGENQUEST (Version 1.1). % mkdir XGENQUEST Go to that subdirectory % cd XGENQUEST 2. Obtain the XGENQUEST (ver 1.1) distribution by anonymous ftp, as follows: % ftp arthur.epm.ornl.gov (or ftp 128.219.9.76) Name: anonymous Password: [your internet address] ftp> cd pub/xgenQuest/sun/ver1.1 ftp> binary ftp> get README ftp> get xgenQuest.sun.ver1.1.tar.Z ftp> quit 3. Extract the files from xgenquest.sun.ver1.1.tar.Z % zcat xgenQuest.sun.ver1.1.tar.Z | tar xvf - 4. At this point, there should be following files in XGENQUEST subdirectory : Manual.grail-genquest.July94 (Grail-Genquest User Manual) README testseqs (Subdirectory containing test sequences) xgenQuest_1.1 xgenQuest.sun.ver1.1.tar.Z (Can be deleted at this point) 5. You can start up the xgenQuest_1.1 program: From the command line % xgenQuest_1.1 OR From the file manager by double-clicking on the xgenQuest_1.1 icon. ---------------- ACKNOWLEDGEMENTS ---------------- GRAIL Research and Development is supported by the Office of Health and Environmental Research, United States Department of Energy under contract No. DE-AC05-840R21400 with Martin Marietta Energy Systems, Inc. DATABASES --------- We thank the administrators of the following databases: SWISS-PROT (Bairoch and Boeckmann, 1992) PDB (Brookhaven National Laboratory) PROSITE (Bairoch, 1993) GSDB (Bilofsky and Burks, 1988) BLOCKS/BLIMPS (Henikoff and Henikoff, 1991) DBEST (Boguski et al., 1993) HUMAN REPETITIVE DNA (Jurka, 1990; Jurka, Walichiewicz and Milosavljevic,1992; Jurka et al., 1993) METHODS ------- We thank the authors of the following methods: FASTA (Pearson and Lipman, 1988) BLAST (Altschul, 1990) Smith-Waterman (Smith and Waterman, 1981) SERVERS ------- We thank IBM T. J. Watson Research Center for the use of their dFLASH server. ---------------- SOFTWARE SUPPORT ---------------- A copy of this Manual can be obtained by sending a message to GRAIL@ornl.gov or Q@ornl.gov with the word HELP on the subject line or on the first text line. Questions, Suggestions and Help: If you have any questions or suggestions, or need further help with any GRAIL system, please send an e-mail to the GRAILMAIL@ornl.gov address. ------------------ GRAIL PUBLICATIONS ------------------ 1. E. C. Uberbacher and R. J. Mural, "Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach," Proc. Natl. Acad. Sci. USA, vol. 88, pp. 11261-11265 (December 1991). 2. R.J. Mural, J. R. Einstein, X. Guan, R. C. Mann and E.C. Uberbacher, "An Artificial Intelligence Approach to DNA Sequence Feature Recognition," TIBTECH, Vol 10 (Jan-Feb 1992). 3. X. Guan, R.J. Mural, J.R. Einstein, R.C.Mann, and E.C. Uberbacher, "GRAIL: An Integrated Artificial Intelligence System for Gene Recognition and Interpretation," Proc., The Eighth IEEE Conference on AI Applications, pp. 9-13 (1992). 4. E. C. Uberbacher, J. R. Einstein, X. Guan, R. J. Mural,"Gene Recognition and Assembly in the GRAIL system: Progress and Challenges," Proceedings of the Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis, eds. Lim, H. A., Fickett, J. W., Cantor, C. R. and Robbins, R. J. (World Sci., USA), pp. 465-476 (June 1992). 5. Y. Xu, R. J. Mural, M. B. Shah and E. C. Uberbacher,"Recognizing Exons in Genomic Sequence Using GRAIL II," Genetic Engineering: Principles and Methods, Jane Setlow (Ed.), Plenum Press, Vol 15 (June 1994). (In press) 6. Y. Xu, J. R. Einstein, R. J. Mural, M. B. Shah and E. C. Uberbacher, "An Improved System for Exon Recognition and Gene Modeling in Human DNA Sequences", Proceedings of The 2nd International Conference on Intelligent Systems for Molecular Biology, AAAI Press (August 1994). (In press) 7. Y. Xu, R. J. Mural and E. C. Uberbacher, "Constructing Gene Models from Accurately-predicted Exons: An Application of Dynamic Programming," CABIOS (In press). 8. S. Matis, R. J. Mural, M. B. Shah and E. C. Uberbacher, "An Artificial Intelligence Method for Locating Promoters in Human DNA Sequences," To be submitted to Nucleic Acids Research. 9. M. B. Shah, X. Guan, J. R. Einstein, S. Matis, Y. Xu, R. J. Mural and E. C. Uberbacher, "User's Guide to GRAIL and GENQUEST (Sequence Analysis, Gene Assembly And Sequence Comparison Systems) E-mail Servers and XGRAIL (Version 1.2) and XGENQUEST (Version 1.1) Client-Server Systems," Available by anonymous ftp to arthur.epm.ornl.gov (128.219.9.76) from directory pub/xgrail or pub/xgenQuest as file Manual.grail-genquest.July94 (July 1994). ---------- REFERENCES ---------- [1] Bairoch, A. and B. Boeckmann. 1992. Nucl. Acids Res., 20: 2019-2022. [2] Bairoch, A. 1993. Nucl. Acids Res., 21: 3097-3103. [3] Jurka, J., Walichiewicz, J. and A. Milosavljevic. 1992. J. Mol. Evol. 35: 286-291. [4] Abola, E.E.,Bernstein, F.C., Bryant, S.H., Koetzle, T.F. and J. Weng. 1987. Protein data bank. pp. 107-132 in "Crystallographic Databases- Information Content, Software Systems, Scientific Applications," F. H. Allen, G. Begerhoff and R. Sievers, eds. Data Commission of the International Union of Crystallography, Cambridge. [5] Smith, T. F., and M. Waterman. 1981. Advan. Appl. Math. 2: 482-489. [6] Pearson, W. R. and D. J. Lipman. 1988. Proc. Natl. Acad. Sci. USA, 85: 2444-2448. [7] Altshcul, S. F., Gish, W., Miller, W., Myers, E. W. and D. J. Lipman. 1990. J. Mol. Biol, 215: 403-410. [8] Dayhoff, M. O., Schwartz, R. M. and B. C. Orcutt. 1978. In "Atlas of Protein Sequences and Structure," (Dayhoff, M. O. ed) Vol. 5, Suppl. 3, pp. 345-352. Nat. Biomed. Res. Found., Washington, D. C. [9] Henikoff, S. and Henikoff, J.G. 1992. Proc. Natl. Acad. Sci. USA 89:10915-10109. [10] Claverie, J-M. and States, D. J. 1993. Computers Chem. 17:1919-201. [11] Califano, A. and Rigoutsos, I. 1993. In: "Proceeding of the First International Conference on Intelligent Systems for Molecular Biology" July, 1993, Bethesda, MD.