One inconvenience of having a number of different DNA sequence analysis
packages available is that they use different formats for storing more-or-less
the same information. Further, most packages refuse to accept even the simplest
files from one another. And finally, the DNA sequence libraries - EMBL, Genbank,
GSDB, and DDBJ - each has its own sequence file format, incompatible
with both Staden and E/GCG.
Recognising this inconvenience, the developers of Staden and E/GCG have
built small exchange programmes that inter-convert DNA sequence files
among a number of different formats. There are several (23, in fact)
exchange programmes
in the E/GCG suite; the table below details these programmes and the formats
they convert.
Sequence exchange programmes
"Foreign" Format | E/GCG Pgms | Comments |
---|---|---|
Staden | fromstaden | for Staden Package files |
tostaden | ||
efromstaden | Staden sequences from a fasta output | |
etostaden | = tostaden + command line control | |
IntelliGenetics (SEQ) | fromig | for IntelliGenetics files |
toig | ||
EMBL | fromembl | EMBL sequence database |
toembl | ||
Genbank | fromgenbank | Genbank sequence database |
togenbank | ||
PIR | frompir | PIR sequence database |
topir | ||
topirall | for a list of sequence files | |
Fasta & Blast | fromfasta | for FASTA files |
efromfasta | ||
toblast | makes a blast sequence set for submission to BLAST | |
Misc. | totext | plain text output |
toprimer | for PRIMER | |
torelate | for NBRF RELATE | |
getseq | gets sequence from a local terminal to a remote E/GCG host | |
egetseq | = getseq + command line control | |
reformat | for corrupted sequence files | |
creformat | for corrupted sequence, scoring matrix, or enzyme data files |
Having explored sequence assembly and editing through the Staden Package programmes, let's convert a few of these Staden format DNA sequence files to E/GCG format. E/GCG conversion programmes generally take the sequence file to be converted as an argument, and prompt for the name of the file to be written.
prompt> gcg
prompt> fromstaden contig.seq
prompt> more contig.seq
prompt> more contigcg.seq
These conversion programmes may also be used to trade data between databases, for use by other analysis programmes. For example, topir can be used to create PIR compatible files from Genbank sequences, by giving topir an argument that is a Genbank data library entry. (More on this feature in Sequence Databases.)
prompt> topir gb:hsfau TOPIR writes GCG sequence(s) into a single file in PIR format. Begin (* 1 *) ? End (* 518 *) ? Reverse (* No *) ? What should I call the output file (* hsfau.pir *) ? HSFAU 518 characters. prompt>
This is my own file, with header and sequence. I made it myself. Good isn't it? fred.seq Length: 123 February 2, 1994 13:21 Type: N Check: 1639 .. 1 aaacccgggt ttatcgagcg tatcgatcga ctgagtcgta cgtcatatcg 51 tgactagcgt acgtacgtat gtgacgactg acgatgcgtg tatgcgtacg 101 tacgtgcagc agatgtgcag atg
The header contains all non-sequence information relevant to the data. In fact, if the file has been produced by one of the exchange programmes (fromstaden, fromembl, frompir, etc.), the header holds all the comments in the original database entry. The header section always ends with the name of the sequence file, its length, creation date and time, type of sequence, and the "Check:" number.
The "Check:" number just before the two dots is calculated from the sequence and is inserted automatically by the E/GCG program that produced it. When another E/GCG program reads in the sequence, it will see if the original "Check:" number matches the new one it creates from the sequence. If the two "Check:" numbers don't match, then it assumes the sequence has become corrupted and will stop with an error message.
It is therefore a very good idea to only use E/GCG programs to alter E/GCG sequence files. "Normal" text editors cannot calculate the "Check:" number for altered sequences. To edit E/GCG sequence files, use the GCG editor, seqed.The sequence section of the file displays 50 nucleotides per line, grouping them in five sets of 10, and numbering the first nucleotide on each line.
The program seqed is the editor used for entering and modifying single sequences in E/GCG.
To edit a new or existing E/GCG sequence file called seqfilename, enter seqed seqfilename at the UNIX prompt. To edit a protein sequence, add the switch "-protein" to the command. E.g.,
There are two areas on the seqed screen, corresponding to the two sections of an E/GCG sequence file: header and sequence. Four lines of the header are shown in the upper area, and 70 bases of sequence, plus an indication of the position in the entire sequence, are shown below. If the file given as the argument to the seqed command does not exist, seqed will create it and you will start in the header area. If the file does already exist, you will start editing the sequence.
prompt> seqed new.seq new.seq ***** K E Y B O A R D ***** seqed : This is a header : : : : : : : ....|.........|.........|.........|.........|.........|.........|.........|.... 0 10 20 30 40 50 60 70 ^ |......|......|......|......|......|......|......|......|......|......| 0 10 20 30 40 50 60 70 80 90 100 Press ^D to quit editing header comments.
There are many keystroke and text commands in seqed. Check the GCG Manual seqed web page for complete details. Some essential seqed keystroke commands are shown in the table below.
Keystroke | Function | |
---|---|---|
A,G,C,N,R,T,Y | enter nucleotide | |
<DELETE> | delete | |
/TAA<RETURN> | find next occurance of TAA | |
n <RIGHT CURSOR> | ahead 'n' characters | |
n <LEFT CURSOR> | back 'n' characters | |
n <RETURN> | go to sequence position 'n' | |
<CTRL>D | enter "command mode" (see below) |
These keystroke commands are the basic ones for sequence entry and editing. seqed has another set of commands, with other, more global functions, which are entered as text. To use these text string commands, you must first type the keystroke command <CTRL>D to enter "command mode".
Text | Function | |
---|---|---|
help | show help | |
exit | write the sequence file and exit | |
s,f delete | delete a block of sequence from start to finish | |
n include file.seq | include another sequence from file.seq at position 'n' | |
insert | insert new nucleotides at the present position; nucleotides to the right move | |
overstrike | change nucleotides at the present position; nucleotides to the right replaced | |
n comment comments | adds your comments to position 'n' | |
n heading | edit line 'n' of the header | |
<RETURN> | re-enter keystroke mode (see above) |
prompt> seqed new.seq
prompt> more new.seq
One of these should be used if a sequence file has been edited by anything other than an E/GCG program like seqed.You can take sequence data in any format from a file and separate exiting comments from the sequence information by placing two dots ("..") between them. reformat will recognise such files as corrupted E/GCG format files and correct them. Comments become header lines, and the "Check:" number is computed and inserted before the two dots. If there are no comments, just raw sequence data in the file, reformat enters the two dots for you.
prompt> cp new.seq uncorrupt.seq
prompt> translate corrupt.seq -outfile=corrupt.pep
prompt> reformat corrupt.seq
-outfile=fixednew.seq
prompt> translate fixednew.seq -outfile=fixednew.pep
prompt> more fixednew.pep
reformat has other uses, too, for valid E/GCG files.
prompt> reformat -ONEIntothree fixednew.pep -outfile=fixednew3lc.pep
prompt> more fixednew3lc.pep