European Molecular
Biology Computing Network - Biocomputing Tutorials DNA Sequence Analysis Sequence
Editing and Exchange

Sequence Editing and Exchange


Table of Contents

Sequence exchange programmes
E/GCG DNA sequence file format
Entering a new sequence
Reformatting a corrupted sequence file


Sequence exchange programmes

One inconvenience of having a number of different DNA sequence analysis packages available is that they use different formats for storing more-or-less the same information. Further, most packages refuse to accept even the simplest files from one another. And finally, the DNA sequence libraries - EMBL, Genbank, GSDB, and DDBJ - each has its own sequence file format, incompatible with both Staden and E/GCG.

Recognising this inconvenience, the developers of Staden and E/GCG have built small exchange programmes that inter-convert DNA sequence files among a number of different formats. There are several (23, in fact) exchange programmes in the E/GCG suite; the table below details these programmes and the formats they convert.

E/GCG Sequence File Conversion Programmes
"Foreign" FormatE/GCG PgmsComments
Stadenfromstadenfor Staden Package files
tostaden
efromstadenStaden sequences from a fasta output
etostaden= tostaden + command line control
IntelliGenetics (SEQ)fromigfor IntelliGenetics files
toig
EMBLfromemblEMBL sequence database
toembl
GenbankfromgenbankGenbank sequence database
togenbank
PIRfrompirPIR sequence database
topir
topirallfor a list of sequence files
Fasta & Blastfromfastafor FASTA files
efromfasta
toblastmakes a blast sequence set for submission to BLAST
Misc.totextplain text output
toprimerfor PRIMER
torelatefor NBRF RELATE
getseqgets sequence from a local terminal to a remote E/GCG host
egetseq= getseq + command line control
reformatfor corrupted sequence files
creformatfor corrupted sequence, scoring matrix, or enzyme data files

Having explored sequence assembly and editing through the Staden Package programmes, let's convert a few of these Staden format DNA sequence files to E/GCG format. E/GCG conversion programmes generally take the sequence file to be converted as an argument, and prompt for the name of the file to be written.

Exercise DNA Analysis - Sequence Editing & Exchange 1: begin GCG and convert Staden sequence files to E/GCG format
Begin GCG by typing "gcg" at the prompt

prompt> gcg

The system should respond with a greeting.

Convert the Staden contig sequence file created in Sequence Assembly 1. Accept the default settings, but name the output file contigcg.seq . (Need a Staden format sequence file? Download one from the Datasets page.)

prompt> fromstaden contig.seq

Check the differences between the two sequence file formats.

prompt> more contig.seq

prompt> more contigcg.seq

These conversion programmes may also be used to trade data between databases, for use by other analysis programmes. For example, topir can be used to create PIR compatible files from Genbank sequences, by giving topir an argument that is a Genbank data library entry. (More on this feature in Sequence Databases.)



How would you convert contig.seq from Staden format to PIR format? (Hint!)

 

Look!


E/GCG sequence file format

An E/GCG sequence file consists of a header and a sequence, separated by a line with two dots in ("..") .



The header contains all non-sequence information relevant to the data. In fact, if the file has been produced by one of the exchange programmes (fromstaden, fromembl, frompir, etc.), the header holds all the comments in the original database entry. The header section always ends with the name of the sequence file, its length, creation date and time, type of sequence, and the "Check:" number.

The "Check:" number just before the two dots is calculated from the sequence and is inserted automatically by the E/GCG program that produced it. When another E/GCG program reads in the sequence, it will see if the original "Check:" number matches the new one it creates from the sequence. If the two "Check:" numbers don't match, then it assumes the sequence has become corrupted and will stop with an error message.

It is therefore a very good idea to only use E/GCG programs to alter E/GCG sequence files. "Normal" text editors cannot calculate the "Check:" number for altered sequences. To edit E/GCG sequence files, use the GCG editor, seqed.
The sequence section of the file displays 50 nucleotides per line, grouping them in five sets of 10, and numbering the first nucleotide on each line.


Entering a new sequence

The program seqed is the editor used for entering and modifying single sequences in E/GCG.

To edit a new or existing E/GCG sequence file called seqfilename, enter seqed seqfilename at the UNIX prompt. To edit a protein sequence, add the switch "-protein" to the command. E.g.,

There are two areas on the seqed screen, corresponding to the two sections of an E/GCG sequence file: header and sequence. Four lines of the header are shown in the upper area, and 70 bases of sequence, plus an indication of the position in the entire sequence, are shown below. If the file given as the argument to the seqed command does not exist, seqed will create it and you will start in the header area. If the file does already exist, you will start editing the sequence.


prompt> seqed new.seq

new.seq                    ***** K E Y B O A R D *****                 seqed
    : This is a header                                                      :
    :                                                                       :
    :                                                                       :
    :                                                                       :
 
 
 
 
....|.........|.........|.........|.........|.........|.........|.........|....
    0        10        20        30        40        50        60        70
 
 
    ^
    |......|......|......|......|......|......|......|......|......|......|
    0     10     20     30     40     50     60     70     80     90     100
 
 
 Press ^D to quit editing header comments.

There are many keystroke and text commands in seqed. Check the GCG Manual seqed web page for complete details. Some essential seqed keystroke commands are shown in the table below.

Essential seqed keystroke commands
Keystroke  Function
A,G,C,N,R,T,Yenter nucleotide
<DELETE>delete
/TAA<RETURN>find next occurance of TAA
n <RIGHT CURSOR>ahead 'n' characters
n <LEFT CURSOR>back 'n' characters
n <RETURN>go to sequence position 'n'
<CTRL>Denter "command mode" (see below)

These keystroke commands are the basic ones for sequence entry and editing. seqed has another set of commands, with other, more global functions, which are entered as text. To use these text string commands, you must first type the keystroke command <CTRL>D to enter "command mode".

Essential seqed "command mode" (text string) commands
Text  Function
helpshow help
exitwrite the sequence file and exit
s,f deletedelete a block of sequence from start to finish
n include file.seqinclude another sequence from file.seq at position 'n'
insertinsert new nucleotides at the present position; nucleotides to the right move
overstrikechange nucleotides at the present position; nucleotides to the right replaced
n comment commentsadds your comments to position 'n'
n headingedit line 'n' of the header
<RETURN>re-enter keystroke mode (see above)

Exercise DNA Analysis - Sequence Editing & Exchange 2: use seqed to create and enter a new E/GCG format sequence file
Create a DNA sequence file called new.seq .

prompt> seqed new.seq

Describe the sequence in the header section. Type "<CTRL>D" to switch from the header to the sequence area. Invent and enter some sequence (as long as you like), and go to position 20.

Enter "command mode", and insert contigcg.seq at this position. Back in keystroke mode, search for the first occurance of the pattern "TATA" and note its position number. Now search for "AAAAA"; in "command mode", delete all sequence between these two patterns.

Exit and save new.seq . Don't be shy about using the "command mode" help !

Display new.seq to the screen to see how seqed saved it.

prompt> more new.seq

 

Look!


Reformatting a corrupted sequence file

The programmes reformat and ereformat are probably the most useful in the packages. Their primary role is to take damaged or "edited elsewhere" sequence files (& scoring matrix or enzyme data files), files that aren't quite in E/GCG format, and to turn them into useful files.
One of these should be used if a sequence file has been edited by anything other than an E/GCG program like seqed.
You can take sequence data in any format from a file and separate exiting comments from the sequence information by placing two dots ("..") between them. reformat will recognise such files as corrupted E/GCG format files and correct them. Comments become header lines, and the "Check:" number is computed and inserted before the two dots. If there are no comments, just raw sequence data in the file, reformat enters the two dots for you.

Exercise DNA Analysis - Sequence Editing & Exchange 3: use reformat to correct a corrupted E/GCG format sequence file
Copy new.seq to uncorrupt.seq .

prompt> cp new.seq uncorrupt.seq

Delete a few nucleotides from uncorrupt.seq using the pico UNIX editor (or vi, emacs, etc.). Save the changes as corrupt.seq .

Try to use translate on corrupt.seq . Notice the complaint.

prompt> translate corrupt.seq -outfile=corrupt.pep

Use reformat on corrupt.seq, saving the output as fixednew.seq . Try another translation.

prompt> reformat corrupt.seq -outfile=fixednew.seq
prompt> translate fixednew.seq -outfile=fixednew.pep

Display fixednew.pep to the screen to see the single letter amino acid code.

prompt> more fixednew.pep

reformat has other uses, too, for valid E/GCG files.

Exercise DNA Analysis - Sequence Editing & Exchange 4: use reformat to change the data format in a E/GCG format sequence file
Change the single letter amino acid code to the three letter code. Display the results to screen again.

prompt> reformat -ONEIntothree fixednew.pep -outfile=fixednew3lc.pep

prompt> more fixednew3lc.pep

 

Look!


Table of Contents Please continue with Part 5 - Sequence Databases   Sequence Databases


Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please   ( dwf@biobase.dk )
Updated on Thursday, 24 October, 1996
Copyright © 1995-1996 by Gary Williams, Peter Woollard, &David W. Featherston