File: README

*|***************************************************************************|*
*|                                                                           |*
*|   Programs: phd2seqfasta and phd2qualfasta                                |*
*|   Version: 0.960108                                                       |*
*|                                                                           |*
*|   Copyright (C) 1995-1997 by Phil Green and Brent Ewing.                  |*
*|   All rights reserved.                                                    |*
*|                                                                           |*
*|   This software is a beta-test version of the phd2fasta                   |*
*|   program set.                                                            |*
*|   It should not be redistributed or used for any commercial               |*
*|   purpose, including commercially funded sequencing, without              |*
*|   written permission from the author and the University of                |*
*|   Washington.                                                             |*
*|                                                                           |*
*|   This software is provided ``AS IS'' and any express or                  |*
*|   implied warranties, including, but not limited to, the                  |*
*|   implied warranties of merchantability and fitness for a                 |*
*|   particular purpose, are disclaimed.  In no event shall                  |*
*|   the authors or the University of Washington be liable for               |*
*|   any direct, indirect, incidental, special, exemplary, or                |*
*|   consequential damages (including, but not limited to,                   |*
*|   procurement of substitute goods or services; loss of use,               |*
*|   data, or profits; or business interruption) however caused              |*
*|   and on any theory of liability, whether in contract, strict             |*
*|   liability, or tort (including negligence or otherwise)                  |*
*|   arising in any way out of the use of this software, even                |*
*|   if advised of the possibility of such damage.                           |*
*|                                                                           |*
*|***************************************************************************|*

This document discusses the programs "phd2seqfasta" and phd2qualfasta".  You
can avoid dealing directly with these programs by using the perl script
"phred_and_phrap.perl".  See the documentation supplied with "consed".


Program: phd2seqfasta

  The program "phd2seqfasta" reads PHD files, extracts the sequences, and creates
  a single FASTA file containing the sequences.  The PHD files may be created
  either by Phil Green's base calling program "phred" or by the sequence editing
  program "consed".  Each read may be represented by one or more versions of
  a PHD file in the same directory as follows.  Initially, "phred" creates the
  first version of a PHD file, whose name ends with ".phd.1".  Subsequently,
  "consed" may write edited versions of the PHD file, whose names end with a
  digit that indicates the version number; for example, ".phd.2".  The program
  "phd2seqfasta" finds the highest numbered version of each PHD file that lies
  within the directory specified on the command line, extracts the sequence from
  each file, and builds a single FASTA file wherein each sequence begins with
  a FASTA header followed by the sequence of called bases.  The name of the
  output sequence FASTA file is specified as the second command line parameter.

  Usage:

  % phd2seqfasta <phd dirname> <seq filename>

  <phd dirname>			name of the directory containing the PHD
				files to process.

  <seq filename>		name of the output sequence FASTA file.


Program: phd2qualfasta

  The program "phd2qualfasta" reads PHD files, extracts the quality data, and
  creates a single FASTA file containing the quality data.  The PHD files may be
  created either by Phil Green's base calling program "phred" or by the sequence
  editing program "consed".  The program "phd2qualfasta" finds the highest
  numbered version of each PHD file that lies within the directory specified on
  the command line, extracts the quality information from each file, and builds
  a single FASTA file wherein the quality data for each sequence begins with a
  FASTA header followed by a list of the quality scores.  The name of the output
  quality FASTA file is specified as the second command line parameter.

  Usage:

  % phd2qualfasta <phd dirname> <qual filename>

  <phd dirname>			name of the directory containing the PHD
				files to process

  <qual filename>		name of the quality FASTA output file.


Notes:

  1.  The sequence and quality FASTA files are used as input to Phil Green's
      sequence comparison program called "cross_match" and his sequence
      assembly program called "phrap".  An example of the data flow begins
      with the transfer of the chromat files from the ABI MacIntosh to a
      UNIX workstation.  Sunsequently, the bases are called using "phred",
      the vector sequence is screened out using "cross_match", and the reads
      are assembled using "phrap". A more detailed summary of the data flow
      consists of the steps

                        o create a directory called "chromat_dir"
                          for the chromat files and a directory
                          called "phd_dir" for the PHD files on
                          the UNIX workstation where further
                          processing will occur using the command

                              % mkdir chromat_dir phd_dir

                        o transfer chromat files from the ABI MacIntosh
                          to the directory named "chromat_dir" using
                          the MacIntosh program "fetch".

                        o run "phred" in the directory above "chromat_dir"
                          and store the PHD files that "phred" creates in
                          the directory "phd_dir" using the command

                             % phred -id chromat_dir -pd phd_dir

                        o run phd2seqfasta with the command line parameters

                             % phd2seqfasta phd_dir cosmid_fasta

                          to create a sequence FASTA file called "cosmid_fasta",
                          which can be read by "cross_match" and "phrap".

                        o run phd2qualfasta with the command line parameters

                             % phd2qualfasta phd_dir cosmid_fasta.qual

                          to create a quality FASTA file, which can be read
                          by "cross_match" and "phrap".

                        o run "cross_match" to screen vector sequence out
                          of the sequence FASTA file using the command

                             % cross_match cosmid_fasta vector.seq -minmatch 12 \
                               -penalty -2 -minscore 20 -screen > screen.out

                          to create the vector screened FASTA file named
                          "cosmid_fasta.screen".

                        o run "phrap" to assemble the reads into contigs
                          using the command

                             % phrap cosmid_fasta.screen -ace > phrap.out

                        
      A significant improvement in assembly is gained by using "phrap" with
      the base quality evaluation performed by "phred".

  2.  The program "consed" allows one to examine the assembled sequence,  to
      examine the individual reads as they were aligned by "phrap" in the
      assembly, to examine the trace data corresponding to any base used by
      "phrap" in the assembly", and to edit the base calls.  "consed" requires
      a ".ace" file generated by "phrap" during the assembly and the PHD files
      and the ABI chromat files for the reads used in the assembly.

  3.  "cross_match" and "phrap" are available from Phil Green

                        Phil Green
                        Department of Molecular Biotechnology
                        University of Washington
                        Box 357730
                        Seattle, WA   98195-7730
                        phg@u.washington.edu


      "consed" is available from David Gordon

                        David Gordon
                        Department of Molecular Biotechnology
                        University of Washington
                        Box 352145
                        Seattle, WA   98195-7730
                        gordon@mbt.washington.edu


      "phred", "phd2seqfasta", and "phd2qualfasta" are available
      from Brent Ewing

                        Brent Ewing
                        Department of Molecular Biotechnology
                        University of Washington
                        Box 357730
                        Seattle, WA   98195-7730
                        bge@u.washington.edu


End: README