Database of p53 somatic mutations in human tumors and cell lines
----------------------------------------------------------------

Release July 1995


M. Hollstein, C. Rice, M.S. Greenblatt, T. Soussi, R. Fuchs, T. Sorlie, E.
Hovig, B. Smith-Sorensen, R. Montesano and C.C. Harris

German Cancer Research Center, Heidelberg (MH), International Agency for
Research on Cancer, Lyon (TS, RM), EMBL Heidelberg (CR, RF), Hopital Saint
Louis, Paris (TS), National Cancer Institute, Bethesda (MSG, CCH), Norwegian
Radium Hospital, Oslo (EH, BSS).

Nucleic Acids Research, submitted.

----------------------------------------------------------------------

Release Information
-------------------

The July 1995 release of the database containes 4496 entries.


Content of the database
-----------------------

This database is a compilation of p53 mutations in human tumor cells and cell
lines from a systematic search of reports published before 1 January 1994. These
mutations were identified by DNA sequencing of PCR-amplified material or cloned
PCR products. Preliminary screening for mutations by techniques such as those
employing SSCP or DGGE/CDGE (reviewed in Rossiter & Caskey, 1990; Grompe, 1993)
were often performed. Most analyses were confined to exons 5-8, since early
studies noted that mutations occurred primarily in this evolutionarily conserved 
midregion. A bias against identification of DNA sequence alterations outside
this mutation cluster region can thus be expected. If the same mutations were
published in more than one article, only one report is referenced, either the
first or the most complete report, and the data are only entered once in the
database. If the identical mutation was found in two separate samples from the
same patient, for example in the primary tumor and in the metastatic tissue, the
mutation is considered to be a single event and is entered only once. Tandem
mutations, i.e. two adjacent base substitutions, are also considered as one
mutation event and are entered together; therefore there will be only one
identification number (see below) for this mutation pair. Discrepancies in
published reports that are clearly due to typographical errors or that can be
explained by other information in the publication have been corrected. In this
case, or if there are uncorrected errors or ambiguities regarding a mutation
record, the letter 'e' appears in column M (see below). Information that does
not permit us to identify the nature and location of the mutation has not been
entered. Mutations found by digestion of DNA with a restriction enzyme and
demonstration of an RFLP are not entered; however, publications reporting such
data will be cited in the electronic version as second appendix. Mutations
identified in tumors are presumed to be somatic unless 1) analysis of normal
tissue from the same patient demonstrated that the mutation was constitutional
in that individual, or 2) the mutation corresponded to one of the known
constitutional polymorphisms of the human p53 gene (at codons 21, 31, 47, 72,
and 213), as these are unlikely to be mutations that arose in the tumors.
Germline mutations, including those identified in families with the Li-Fraumeni
cancer syndrom are not in this database.

Distribution formats
--------------------

The data are provided to the scientific community in two different formats.
First, the database is available as an Excel spreadsheet which requires the use
of the Microsoft Excel program on either an MS-DOS system or an Apple
Macintosh. Second, the data have been converted into a flatfile format modeled
onto the standard used by the EMBL nucleotide sequence database. In this format
the data are stored in a normal text file with each column of the spreadsheet
represented by a special line type. The flatfile format can be used on any
computer system and with standard text editors. The database can be obtained
from the EBI network servers in the following ways:

* Anonymous ftp to:  ftp.ebi.ac.uk   under   /pub/databases/p53

* Through the WWW server at:  http://www.ebi.ac.uk/

* Through gopher at:  gopher.ebi.ac.uk  (port 70)

* send an email message to: 
netserv@ebi.ac.uk and include the line "help p53".

Excel format description
------------------------

Each row represents a single tumor mutation with an arbitrarily assigned unique
number in in column A. 

The columns contain the following information and abbreviations:

Column A:
        Unique mutation identification number

Column B:
        Codon number at which the mutation is located (1-393). If a tandem
dinucleotide mutation spans two codons, both codons are entered. If other
mutations span more than one codon, e.g. there is a deletion of several bases,
only the first (5') codon is entered. If the mutation is located at intron
sequences this is indicated by 'intron' and intron number.

Column C:
        Normal and mutated base sequence of the codon in which the mutation
occurred. If the mutation is a base pair deletion or insertion this is indicated
by 'del' or 'ins'.

Column D:
        Nucleotide position at which the mutation is located (1-1179), numbered
from the ATG to the termination codon.

Column E:
        Base change, read from the coding strand by convention, for base
substitutions. For deletions (indicated by '-') and insertions (indicated by
'+') the number of bases deleted or inserted is given in parenthesis.

Column F:
        The name or number given by the authors to the tumor sample or cell line
is entered here. If the name is not sufficiently distinctive, e.g. if the
publication referes to samples 1,2,3, etc., then we have assigned a name,
usually the first letters of the first author's name, followed by the number in
the series. If more than one mutation has been found in the same sample, the
tumor name in column F is suffixed with an apostrophe.

Column G:
        Anatomical site or type of the tumor as described in the publication
cited. Abbreviations used in this column are: HCC, hepatocelluar carcinoma,
Leuk/Lym, leukemias and lymphomas.

Column H:
        Reference number (1-312). The full citation is given as a separate file.

Column I:
        This column contains notes regarding the tumor or the patient, such as
histological type of tumor, exposure history or other clinical parameters
emphasized by authors reporting the mutations. The terminology used by the
authors has been retained and therefore may not be uniform. Pre-cancer lesions
are also included, e.g. dysplasia.
        Abbreviations of tumor subtype or cell type are as follows: SCLC, small
cell lung cancer; adenoca, adenocarcinoma; osteo, osteosarcoma; rhab,
rhabdomyosarcoma; leiomyo, leiomyosarcoma; eryth, erythroleukemia; medull,
medulloblastoma; SCC, squamous cell carcinoma; TCC, transitional cell carcinoma;
hypoph, hypopharynx; NPC, nasopharyngeal carcinoma. For abbreviations of
leukemia and lymphoma subclassifications, e.g., ATL (adult T-cell leukemia),
refer to cited reference. Uniformity of these abbreviations in the different
reports has not been verified.
        Other abbreviations: UC, ulcerative colitis; FAP familial adenoma
polyposis; XP, xeroderma pigmentosum; HPV, tumor harbors human papilloma virus
DNA (HPV+), or lacks virus DNA (HPV-); diff or undiff, (un)differentiated tumor;
CIS, carcinoma in situ; premal, premalignant.
        Other information: 1) "metastatis" specifies that the DNA analyzed for
the mutation was obtained from metastatic tissue. The primary tumor is in column
G. 2) exposure history: tobacco smoke; radon gas.

Column J:
        An entry 'L' indicates the material examined was from a tumor cell line.
If there is no entry, the material is from tissue tumor or biopsy (most
instances), or xenograft, or unspecified.

Column K:
        Mutations that are transitions of CpG dinucleotides, i.e. CpG to TpG or
CpG to CpA, are designated by 'yes'. If there is no entry, the mutation does not
fall into this category.

Column L:
        Amino acid substitution. Chain termination mutations due to single base
substitutions are designated by '(amino acid)->stop'. Frameshift mutations are
designated by 'frameshift', whereas in-frame deletions and insertions are
designated 'deletion' or 'insertion'. Mutations that do not result in an amino
acid change are designated by 'silent'. Mutations that occurred in intron
sequences are indicated by the term 'splicing' even though in most instances it
was not determined whether splicing errors did result from the mutation; some of
these mutations are likely to be phenotypically silent.

Column M:
        If the information on the nature or location of the mutation in the
reference is ambiguous or contradictory, the letter 'e' appears in this column.
Where possible we have made a presumptive correction of the published
discrepancy in the database entry.

Examples:
A     B    C         D   E     F       G        H    I    J  K    L         M
466   7    GAT->CAT  19  G->C  N16     Skin     54   SCC          Asp->His
1207  152  CCG->CTG      C->T  HTC/C3  Thyroid  149       L  yes  Pro->leu


Flatfile format
---------------

Each database entry consists of a series of lines, each one tagged by a
two-character identifier separated from the text of the line by three blanks.
The mapping of line contents to the columns in the Excel format is indicated.

ID (1 per entry)
        IDentifier; contains mutation id (column A)

DC (0 or 1 per entry)
        Data Correctness. If the report is ambiguous or incorrect (column M),
the line "DC   ambiguous" is added to the entry.

CD (1 per entry)
        CoDon change; this line has three semicolon-separated fields of closely
related information, terminated by a period: the codon number (column B); the
codon change (column C); the amino acid change (column L). If any field is not
known, a question mark is substituted.

BC (1 per entry)
        Base Change; this line has two semicolon-separated fields of closely
related information, terminated by a period: the nucleotide position (column D);
the base change (column E). If any field is not known, a question mark is
substituted.

CT (0 or 1 per entry)
        CpG Transition; optional line. If a CpG transition occurred (column K),
the line "CT   yes" appears.

TS (0 or 1 per entry)
        Tumor Specifics; this line has three semicolon-separated fields of
closely related information, terminated by a period: the tumor name (column F);
the tumor source (column G); tumor cell line (column J). If any field is not
known, a question mark is substituted. If the source is a tumor cell line, the
third field is "Y", otherwise "N".

CC (0 or 1 per entry)
        Comments; allows free text comments and keeps contents of column I.

RN (1 per entry)
        Reference Pointer; contains cross-reference to literature reference file
(column H).

// (1 per entry)
        Marks end of entry.

Examples:
ID   466
CD   7; GAT->CAT; Asp->His.
BC   19; G->C.
TS   N16; Skin; N.
CC   SCC
RP   54
//
ID   1207
CD   152; CCG->CTG; Pro->Leu.
BC   455; C->T.
CT   yes
TS   HTC/C3; Thyroid; Y.
RP   149
//


Updates
-------

This compilation of p53 mutations is to provide the scientific community with a
database of rapidly accumulating data that can be useful to various disciplines
in cancer research, including epidemiology, medicine and basic science. Future
versions of the database may include separate sections on germline mutations,
mutations detected by RFLP, anamnestic data on patients, and standardization of
terminology with the International Classification of Diseases for Oncology
(ICD-O). Notifications of omissions and errors of the current version would be
gratefully received by the authors. When individual records in the present
version require correction they will be revised and the date of revision will be
noted in a new column, column N. Data published in th first six months of 1994
will be added at regular intervals during the second half of the year.


References
----------

Caron de Fromentel, C. and Soussi, T. (1992) Genes Chromosomes Cancer 4, 1-15.
Donehower, L.A. and Bradley, A. (1993) Biochim. Biophys. Acta 1155, 181-207.
Greenblatt, M.S., Bennet, W.P., Hollstein, M.C. and Harris, C.C. (1994) Cancer
        Res. (in press).
Grompe, M. (1993) Nature genetics 5, 111-117.
Harris, C.C. and Hollstein, M. (1993) New Engl. J. Med. 329, 1318-1327.
Hollstein, M.C, Sidranskym D., Vogelstein, B. and Harris, C.C. (1991) Science  
        253, 49-53.
Jones, P.A., Buckley, J.D., Henderson, B.E., Ross, R.K. and Pike M.C. (1991)
        Cancer Res. 51, 3617-3620.
Kunkel, T.A. (1990) Biochemistry 29, 8003-8011.
Kunkel, T.A. (1993) Nature 365, 207-208.
Levine, A.J. (1993) Annu. Rev. Biochem. 62, 623-651.
Lindahl, T. (1993) Nature 362, 709-715.
Mellon & Hanawalt (1989) Nature 342, 95-998.
Rice, C., Fuchs, R., Higgins, D.G., Stoehr, P.S. and Cameron, G.N. (1993) Nucl.
        Acids Res. 21, 2967-2971.
Rossiter, B.J.F. and Caskey, C.T. (1990) J. Biol. Chem. 265, 12753-12756.
Selby & Sancar (1993) Science 260, 53-58.
Takeshima, S., Seyama, T., Bennett, W.P. et al. (1991) Lancet 342, 1520-2521.