The sequence database compilers cooperate extensively; EMBL, DDBJ (DNA
DataBank of Japan), and GenBank, exchange new sequences
daily. The vast majority of the sequences in Genbank are also in EMBL.
The DNA databases, in particular, have identical information for each sequence
but organised differently. Compare the header information for the
HSHEPSH sequence as stored in EMBL vs. Genbank.
In the example above, the accession number for HSHEPSH is
X07732. It also has a secondary accession number, M18930,
probably indicating another sequence was combined with HSHEPSH.
Each sequence database has a corresponding data library,
usually named after the database. For example, EMBL,
SwissProt, and GenBank are the names of databases, and are
also the logical names of E/GCG data
libraries. The GenEMBL data library represents a fusion of EMBL with
Genbank.
All these data library names have short forms to save typing:
em refers to EMBL, gb refers to GenBank, ge refers to GenEMBL, etc.
To specify a particular sequence in a particular data library, you give the
logical name (or short form) of the data library together with the sequence
identifier, separated by a colon. "gb:humrep2" specifies the
humrep2 sequence from GenBank.
Databases Available
The most commonly used sequence databases can be accessed from within the
E/GCG packages. Databases are regularly updated where possible.
Nucleic Acid Sequences
Peptide Sequences
Other
Sequence Formats
Each sequence database has its own distinctive format, and all database formats
are different in detail from the E/GCG sequence file format. Broadly
speaking, though, ALL sequence files consist of commentary (header information),
followed by sequence data. This similarity makes the inter-conversion of
sequences relatively straightforward.
EMBL Format
ID HSHEPSH standard; RNA; PRI; 2363 BP.
XX
AC X07732; M18930;
XX
DT 16-JUL-1988 (Rel. 16, Created)
DT 22-SEP-1995 (Rel. 45, Last updated, Version 9)
XX
DE Human hepatoma mRNA for serine protease hepsin
XX
KW hepsin; membrane protein; serine protease; zymogen.
XX
OS Homo sapiens (human)
OC Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mammalia;
OC Theria; Eutheria; Primates; Haplorhini; Catarrhini; Hominidae.
...
Genbank Format
LOCUS HSHEPSH 2363 bp RNA PRI 22-SEP-1995
DEFINITION Human hepatoma mRNA for serine protease hepsin.
ACCESSION X07732 M18930
KEYWORDS hepsin; membrane protein; serine protease; zymogen.
SOURCE human.
ORGANISM Homo sapiens
Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
Vertebrata; Sarcopterygii; Mammalia; Eutheria; Primates;
Catarrhini; Hominidae; Homo.
...
Accession Numbers
These are the unique, and therefore absolutely reliable, identifiers
assigned to sequences in the databases. Each sequence has a unique accession
number, used for that sequence in all the databases containing it.
An accession number is permanently associated with its sequence. On occassion,
two or more sequences are merged; this new sequence is likely to be given
a new accession number. All the old accession numbers are retained with the
new sequence, becoming secondary accession numbers.
Data Libraries
E/GCG converts the commentary and sequence
information available in these databases into (E)GCG format, and organises it into data
libraries. Thus, any sequence you obtain using E/GCG programmes
will automatically be in E/GCG format.
Subsections of the Databases
DNA Databases
The EMBL and Genbank sequence databases are split into many different subsections
or divisions in the E/GCG data libraries.
The main purpose of
this is to allow the searching of only the most relevant sequences. These
divisions may contain certain taxonomic categories, individual species, or even
special classes of loci. What are the advantages?
Logical Name | Abbreviation | Subsection Accessed |
---|---|---|
phage:* | ph:* | Bacteriophages |
Viral:* | vi:* | Viral |
Bacterial:* | ba:* | Bacterial (prokaryotes) |
Eukaryote:* | or:* | Eukaryote organelles |
Organelle:* | or:* | Organelle sequences |
Fungal:* | fun:* | Fungal (EMBL only) |
Plant:* | pl:* | Plant (includes fungi in Genbank) |
Invertebrate:* | in:* | Invertebrates |
Human:* | hu:* | Human sequences |
Rodent:* | ro:* | Rodent sequences |
Primate:* | pr:* | Primate sequences |
other_mammalian:* | om:* | Other Mammalian (not primate or rodent) |
Other_vertebrate:* | ov:* | Other Vertebrate |
sts:* | sts:* | Sequence-tagged site sequences (NEW) |
est:* | est:* | Expressed sequence tags (NEW) |
tags:* | tags:* | STSs and ESTs(NEW) |
Structural:* | st:* | Structural RNA |
Synthetic:* | sy:* | Synthetic |
Unclassified:* | un:* | Unclassified |
Patent:* | pat:* | Patented sequences |
There are three relatively new DNA database divisions available as E/GCG data libraries: sequence-tagged sites, expressed sequence tags, and the union of these two, called simply "tags". These subsections have grown so quickly in number that if you wish to include these sequences in a database search, you must now ask for them explicitly.
Data Accessed | GenEMBL | EMBL | GenBank |
---|---|---|---|
Entire sequence | GenEMBLPlus:* | EMBLPlus:* | GenBankPlus:* |
database | geplus:* | emplus:* | gbplus:* |
gep:* | emp:* | gbp:* | |
All sequences | genembl:* | embl:* | genbank:* |
except tags | ge:* | em:* | gb:* |
Only tags | tags:* | em_tags:* | gb_tags:* |
Data Accessed | SwissProt | PIR | TREMBL |
---|---|---|---|
Entire sequence database | swissprot:* | protein:* | not avail |
(Annotated in PIR) | swiss:* | prot:* | not avail |
sw:* | pir1:* | not avail | |
PIR Preliminary sequences | pir2:* | ||
PIR Unverified seqs | pir3:* | ||
PIR Unencoded/untranslated seqs | pir4:* |
prompt> lookup LookUp identifies sequences by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list of sequences. The LookUp program is experimental in this release--please look carefully at your results. LOOKUP in what sequence libraries: a) sw_release b) pir c) embl d) genbank e) em_tags f) gb_tags g) gb_new h) em_new i) sw_new j) epd k) All libraries q) quit Please choose one or more (* k *): c ... a new screen is written ... Complete the query form below: All text: Definition: mRNA Author: Keyword: Sequence name: Accession number: Organism: Carassius auratus Reference: Title: Feature: On or after (dd-mmm-yy): On or before (dd-mmm-yy): Shortest sequence length: Longest sequence length: Inter-field operator: AND Form of output list: Whole Entries Press <Ctrl>D to continue. Searching embl 53 entries were found. Do you wish to: 1) write out this list to a file 2) preview the results 3) refine the query 4) choose different libraries q) quit Please choose one (* 1 *): What should I call the output file (* lookup.list *) ? . 53 entries were written to "lookup.list" prompt>
The resulting file "lookup.list" contains the set of EMBL database sequence entries, with comments describing the sequences indicated by an exclamation mark:
prompt> more lookup.list LOOKUP in: embl of: "([SQ-DEF: mRNA*] & [SQ-ORG: Carassius auratus*])" 53 entries October 27, 1995 11:05 .. EM_OV:CA07056 ! ID: a0000103 ! DE Carassius auratus homeobox protein mRNA, complete cds. EM_OV:CA08016 ! ID: a1000103 ! DE Carassius auratus kainate receptor beta subunit mRNA, complete cds. EM_OV:CA08017 ! ID: a2000103 ! DE Carassius auratus kainate receptor alpha subunit mRNA, complete ! DE cds. EM_OV:CA12018 ! ID: a3000103 ! DE Carassius auratus glutamate receptor 4 (glur4) mRNA, partial cds. ...
prompt> lookup -out=rhodopsin.list
...Choose EMBL as the database ...
...Enter rhodopsin in the
"All text:", "Definition:", &
"Keyword:" fields,
selecting OR as the "Inter-field
operator:" ...
...Press <CTRL>D to continue, and accept the remaining defaults.
prompt> more rhodopsin.list
To copy a sequence entry from one of the E/GCG data libraries to a UNIX file, use the programme called fetch. It takes the database:entry you want as its argument. fetch responds by describing itself, and then prints the filename it has copied the database entry to.
prompt> fetch gb:hsef2 FETCH copies GCG sequences or data files from the GCG database into your directory or displays them on your terminal screen. hsef2.gb_pr
The name of the new UNIX file holding the E/GCG format sequence data is "hsef2.gb_pr". Because it is a normal UNIX file, you may use any normal UNIX commands on it. You can type it to the screen (using "more"), delete it (using "rm"), edit it (please use "seqed", NOT "pico, vi, emacs, etc."!), transfer it to your local site over the computer network, and use it as an input file to other E/GCG programs.
prompt> fetch ge:hsef2
prompt> more hsef2.ge_pr
prompt> etc.
prompt> typedata ge:hsef2 | more
prompt> etc.
This can be frustrating if you want to fetch long sequences, rather than search through data libraries! Retrieving complete long sequences is easier with specialist sequence retrieval programmes like SRS.