Searching Databases 1
-
- Creating databases to search
- Virtual databases via list files
- Actual databases via dataset
Creating databases to search
It is often useful to specify a set of sequences - a "personal
database" - for your particular research interests, or for a
special series of analyses. A personal database can have members from
different data libraries, as well as your unpublished results. Depending on
your needs, a personal database can be either virtual, with the
sequences still existing in the E/GCG data libraries,
or actual, with the sequences stored in files in your own directories.
Virtual databases
The contents of a virtual personal database are described by list files, like
the one produced by the lookup programme (see
Sequence Databases). Since several
E/GCG programmes that search data libraries also
write list files, you can create virtual personal databases of high
precision simply by running two or three different searches in tandem.
Virtual personal databases are easy to create using various
searching programs, easy to amend, and use hardly any disk-space compared
to their actual counterparts. They are, however, limited in scope to sequences
that are found in E/GCG data libraries; a list file can usually only have
references to valid data library sequences. Nonetheless, virtual personal
databases are the recommended approach!
Actual databases
Actual personal databases are created with the dataset programme.
These are full E/GCG data libraries, occupying space in your
disk space. Analysing or manipulating sequences from an actual personal database
will be slightly faster than from the E/GCG data libraries because the
search time will be shorter. Further, you can select subsets of an actual
personal database by using wildcards in the name, just as you can with Genbank
or EMBL, etc.
Use an actual personal database if you have a large set of sequences that you
will be processing often, which do not occur in the public databases, and which
will not be changed, altered or added to.
Virtual databases via list files
There are many programs which write out list files. Some of them are:
- lookup
- stringsearch
- wordsearch
- findpatterns -names
- fasta -noalign
- tfasta -noalign
To illustrate the creation and refinement of a virtual database, we will find
all the mRNA sequences for goldfish, filter out those lacking a particular
restriction enzyme cutting site, and view the sequences on the screen.
-
- Exercise DNA Analysis - Searching Databases 1.
1: create and refine a virtual database with
lookup & findpatterns
- Query the GenBank & EMBL data libraries for mRNA sequences from the
goldfish. (see Sequence Databases to refresh
your memory on lookup)
prompt> lookup -lib=gb,em -all=mRNA -org="Carassius auratus" -out=gofishmrna.list
- (Or you may enter only lookup, and respond to all the
prompts.)
"<CTRL> D" begins the
search and "1" writes the list file.
- Refine this set of sequences to hold only sequences containing two or
more EcoRI recognition sites (GAATTC).
prompt> findpatterns @gofishmrna.list -pat=GAATTC
-minc=2 -names -out=gofishmrnaecor1.list
- The findpatterns programme is given the output list file from
lookup as its input file, preceeded by an "@"
symbol to indicate that gofishmrna.list is a list file.
The "-names" switch tells findpatterns to write
a list file as its output.
- View the sequences.
prompt> typedata @gofishmrnaecor1.list | more
FETCH copies GCG sequences or data files from the GCG database
into your directory or displays them on your terminal screen.
crablu
LOCUS CRABLU 1257 bp ss-mRNA VRT 03-MAR-1993
DEFINITION Carassius auratus blue cone opsin mRNA, complete cds.
ACCESSION L11864
KEYWORDS blue sensitive cone opsin; opsin.
...
- Can any of these sequences be almost completely sub-cloned using only
EcoRI? (Hint!)
On-line help for lookup, findpatterns,
stringsearch, and wordsearch
is available via the commands
prompt> genhelp lookup
prompt> genhelp findpatterns
prompt> genhelp stringsearch
prompt> genhelp wordsearch
You may also check the manual web pages for complete details:
lookup,
findpatterns,
stringsearch, &
wordsearch.
Actual databases via dataset
Some warnings about creating actual personal databases:
- It is a very good way to fill up your file space.
- They are best used for a large number of private sequences that will
not change, and will be searched often.
- Large personal databases are easily re-created at each login, if you
have access to temporary file space.
NB: It is far better to use virtual personal databases via
list files - these are more flexible and use far, far less disk space!
To illustrate the creation of an actual database, we will first make a list
file, edit it to hold references to ~20 sequences, and use it as an input
file for dataset.
-
- Exercise DNA Analysis - Searching Databases 1.
2: create and refine a list file with
lookup & findpatterns; create a personal database
with dataset
- Query the GenBank & EMBL data libraries for sequences having one of
"jewel", "hippo", or
"broom" in the header information.
prompt> lookup -lib=gb,em -all=hippo -out=hippo.list
- If the number of entries is >>20 (I found
468 with "hippo"),
use findpatterns
to trim the list size. (E.g., find only sequences that have three
EcoRI &/or XhoI sites.)
prompt> findpatterns @hippo.list -pat=GAATTC,CTCGAG
-minc=3 -maxc=3 -names -out=hippo2.list
- Use dataset to create a database named
hippodb .
prompt>
dataset @hippo2.list -out=hippodb -sn=hi
- But contigcg.seq is also relevant to the
hippodb database! Add this sequence.
prompt> dataset contigcg.seq -append -out=hippodb
- Look at the human sequences in hippodb.
prompt> typedata hi:hs* | more
- When through experimenting with the new personal database, delete it
to conserve disk space. Check that you got ALL of it removed!
prompt> rm *hippo* ; ls -l hippo*
On-line help for dataset is available via the command
You may also check the manual web pages for complete details:
dataset.
Please continue with Part
9 - Multiple Sequence Analysis (under construction!)
Comments? Questions? Accolades?
Comments? Questions? Accolades?
Please send them to David Featherston
Please
( dwf@biobase.dk )
Updated on Thursday, 24 October, 1996
Copyright © 1995-1996 by Gary Williams, Peter Woollard, &David
W. Featherston