European Molecular
Biology Computing Network - Biocomputing Tutorials

Sequence Editing and Exchange

Sequence exchange programmes
Exercise 1: begin E/GCG and convert a Staden format sequence

E/GCG DNA sequence file format
Entering a new sequence
Exercise 2: use seqed to create and enter a new E/GCG format sequence file

Reformatting a corrupted sequence file
Exercise 3: use reformat to correct a corrupted E/GCG format sequence file
Exercise 4: use reformat to change the data format in a E/GCG format sequence file

Sequence exchange programmes

One inconvenience of having a number of different DNA sequence analysis packages available is that they use different formats for storing more-or-less the same information. Further, most packages refuse to accept even the simplest files from one another. And finally, the DNA sequence libraries - EMBL, Genbank, GSDB, and DDBJ - each has its own sequence file format, incompatible with both Staden and E/GCG.

Recognising this inconvenience, the developers of Staden and E/GCG have built small exchange programmes that inter-convert DNA sequence files among a number of different formats. There are several (23, in fact) exchange programmes in the E/GCG suite; the table below details these programmes and the formats they convert.

E/GCG Sequence File Conversion Programmes
"Foreign" Format E/GCG Pgms Comments

Staden fromstaden for Staden Package files

tostaden

efromstaden Staden sequences from a fasta output

etostaden = tostaden + command line control

IntelliGenetics (SEQ) fromig for IntelliGenetics files

toig

EMBL fromembl EMBL sequence database

toembl

Genbank fromgenbank Genbank sequence database

togenbank

PIR frompir PIR sequence database

topir

topirall for a list of sequence files

Fasta & Blast fromfasta for FASTA files

efromfasta

toblast makes a blast sequence set for submission to BLAST

Misc. totext plain text output

toprimer for PRIMER

torelate for NBRF RELATE

getseq gets sequence from a local terminal to a remote E/GCG host

egetseq = getseq + command line control

reformat for corrupted sequence files

creformat for corrupted sequence, scoring matrix, or enzyme data files

**E/GCG Sequence File Conversion Programmes**
"Foreign" Format	E/GCG Pgms	Comments
Staden	fromstaden	for Staden Package files
	tostaden
	efromstaden	Staden sequences from a fasta output
	etostaden	= tostaden + command line control
IntelliGenetics (SEQ)	fromig	for IntelliGenetics files
	toig
EMBL	fromembl	EMBL sequence database
	toembl
Genbank	fromgenbank	Genbank sequence database
	togenbank
PIR	frompir	PIR sequence database
	topir
	topirall	for a list of sequence files
Fasta & Blast	fromfasta	for FASTA files
	efromfasta
	toblast	makes a blast sequence set for submission to BLAST
Misc.	totext	plain text output
	toprimer	for PRIMER
	torelate	for NBRF RELATE
	getseq	gets sequence from a local terminal to a remote E/GCG host
	egetseq	= getseq + command line control
	reformat	for corrupted sequence files
	creformat	for corrupted sequence, scoring matrix, or enzyme data files

Having explored sequence assembly and editing through the Staden Package programmes, let's convert a few of these Staden format DNA sequence files to E/GCG format. E/GCG conversion programmes generally take the sequence file to be converted as an argument, and prompt for the name of the file to be written.

Exercise DNA Analysis - Sequence Editing & Exchange 1: begin GCG and convert Staden sequence files to E/GCG format

Begin GCG by typing "gcg" at the prompt

prompt> gcg

The system should respond with a greeting.

Convert the Staden contig sequence file created in Sequence Assembly 1. Accept the default settings, but name the output file contigcg.seq . (Need a Staden format sequence file? Download one from the Datasets page.)

prompt> fromstaden contig.seq

Check the differences between the two sequence file formats.

prompt> more contig.seq

prompt> more contigcg.seq

These conversion programmes may also be used to trade data between databases, for use by other analysis programmes. For example, topir can be used to create PIR compatible files from Genbank sequences, by giving topir an argument that is a Genbank data library entry. (More on this feature in Sequence Databases.)

prompt> topir gb:hsfau
 
TOPIR writes GCG sequence(s) into a single file in PIR format. 
 
                  Begin (* 1 *) ?  
                End (*   518 *) ?  
               Reverse (* No *) ?  
 
 What should I call the output file (* hsfau.pir *) ?  
 
 HSFAU 518 characters.

prompt>

How would you convert contig.seq from Staden format to PIR format? (Hint!)

Look!

fromstaden & topir

prompt> genhelp fromstaden prompt> genhelp topir

fromstaden

topir

E/GCG sequence file format

An E/GCG sequence file consists of a header and a sequence, separated by a line with two dots in ("..") .

This is my own file, with header and sequence.
I made it myself.
Good isn't it?
fred.seq  Length: 123  February 2, 1994  13:21  Type: N  Check: 1639  ..
 
       1  aaacccgggt ttatcgagcg tatcgatcga ctgagtcgta cgtcatatcg 
 
      51  tgactagcgt acgtacgtat gtgacgactg acgatgcgtg tatgcgtacg 
 
     101  tacgtgcagc agatgtgcag atg

The header contains all non-sequence information relevant to the data. In fact, if the file has been produced by one of the exchange programmes (fromstaden, fromembl, frompir, etc.), the header holds all the comments in the original database entry. The header section always ends with the name of the sequence file, its length, creation date and time, type of sequence, and the "Check:" number.

The "Check:" number just before the two dots is calculated from the sequence and is inserted automatically by the E/GCG program that produced it. When another E/GCG program reads in the sequence, it will see if the original "Check:" number matches the new one it creates from the sequence. If the two "Check:" numbers don't match, then it assumes the sequence has become corrupted and will stop with an error message.

It is therefore a very good idea to only use E/GCG programs to alter E/GCG sequence files. "Normal" text editors cannot calculate the "Check:" number for altered sequences. To edit E/GCG sequence files, use the GCG editor, seqed.

The sequence section of the file displays 50 nucleotides per line, grouping them in five sets of 10, and numbering the first nucleotide on each line.

Entering a new sequence

The program seqed is the editor used for entering and modifying single sequences in E/GCG.

To edit a new or existing E/GCG sequence file called seqfilename, enter seqed seqfilename at the UNIX prompt. To edit a protein sequence, add the switch "-protein" to the command. E.g.,

prompt> seqed -protein tubulin.seq

There are two areas on the seqed screen, corresponding to the two sections of an E/GCG sequence file: header and sequence. Four lines of the header are shown in the upper area, and 70 bases of sequence, plus an indication of the position in the entire sequence, are shown below. If the file given as the argument to the seqed command does not exist, seqed will create it and you will start in the header area. If the file does already exist, you will start editing the sequence.

prompt> seqed new.seq

new.seq                    ***** K E Y B O A R D *****                 seqed
    : This is a header                                                      :
    :                                                                       :
    :                                                                       :
    :                                                                       :
 
 
 
 
....|.........|.........|.........|.........|.........|.........|.........|....
    0        10        20        30        40        50        60        70
 
 
    ^
    |......|......|......|......|......|......|......|......|......|......|
    0     10     20     30     40     50     60     70     80     90     100
 
 
 Press ^D to quit editing header comments.

**Essential `seqed` keystroke commands**
Keystroke		Function
A,G,C,N,R,T,Y		enter nucleotide
<DELETE>		delete
/TAA<RETURN>		find next occurance of TAA
n <RIGHT CURSOR>		ahead 'n' characters
n <LEFT CURSOR>		back 'n' characters
n <RETURN>		go to sequence position 'n'
<CTRL>D		enter "command mode" (see below)

**Essential `seqed` "command mode" (text string) commands**
Text		Function
help		show help
exit		write the sequence file and exit
s,f delete		delete a block of sequence from start to finish
n include `file.seq`		include another sequence from `file.seq` at position 'n'
insert		insert new nucleotides at the present position; nucleotides to the right move
overstrike		change nucleotides at the present position; nucleotides to the right replaced
n comment comments		adds your comments to position 'n'
n heading		edit line 'n' of the header
<RETURN>		re-enter keystroke mode (see above)

There are many keystroke and text commands in seqed. Check the GCG Manual seqed web page for complete details. Some essential seqed keystroke commands are shown in the table below. Essential seqed keystroke commands Keystroke Function A,G,C,N,R,T,Y enter nucleotide <DELETE> delete /TAA<RETURN> find next occurance of TAA n <RIGHT CURSOR> ahead 'n' characters n <LEFT CURSOR> back 'n' characters n <RETURN> go to sequence position 'n' <CTRL>D enter "command mode" (see below) These keystroke commands are the basic ones for sequence entry and editing. seqed has another set of commands, with other, more global functions, which are entered as text. To use these text string commands, you must first type the keystroke command <CTRL>D to enter "command mode". Essential seqed "command mode" (text string) commands Text Function help show help exit write the sequence file and exit s,f delete delete a block of sequence from start to finish n include file.seq include another sequence from file.seq at position 'n' insert insert new nucleotides at the present position; nucleotides to the right move overstrike change nucleotides at the present position; nucleotides to the right replaced n comment comments adds your comments to position 'n' n heading edit line 'n' of the header <RETURN> re-enter keystroke mode (see above) Exercise DNA Analysis - Sequence Editing & Exchange 2: use seqed to create and enter a new E/GCG format sequence file Create a DNA sequence file called new.seq . prompt> seqed new.seq Describe the sequence in the header section. Type "<CTRL>D" to switch from the header to the sequence area. Invent and enter some sequence (as long as you like), and go to position 20. Enter "command mode", and insert contigcg.seq at this position. Back in keystroke mode, search for the first occurance of the pattern "TATA" and note its position number. Now search for "AAAAA"; in "command mode", delete all sequence between these two patterns. Exit and save new.seq . Don't be shy about using the "command mode" help ! Display new.seq to the screen to see how seqed saved it. prompt> more new.seq On-line help for seqed is available via the command prompt> genhelp seqed You may also read the GCG manual seqed web page for more details. Reformatting a corrupted sequence file The programmes reformat and ereformat are probably the most useful in the packages. Their primary role is to take damaged or "edited elsewhere" sequence files (& scoring matrix or enzyme data files), files that aren't quite in E/GCG format, and to turn them into useful files. One of these should be used if a sequence file has been edited by anything other than an E/GCG program like seqed. You can take sequence data in any format from a file and separate exiting comments from the sequence information by placing two dots ("..") between them. reformat will recognise such files as corrupted E/GCG format files and correct them. Comments become header lines, and the "Check:" number is computed and inserted before the two dots. If there are no comments, just raw sequence data in the file, reformat enters the two dots for you. Exercise DNA Analysis - Sequence Editing & Exchange 3: use reformat to correct a corrupted E/GCG format sequence file Copy new.seq to uncorrupt.seq . prompt> cp new.seq uncorrupt.seq Delete a few nucleotides from uncorrupt.seq using the pico UNIX editor (or vi, emacs, etc.). Save the changes as corrupt.seq . Try to use translate on corrupt.seq . Notice the complaint. prompt> translate corrupt.seq -outfile=corrupt.pep Use reformat on corrupt.seq, saving the output as fixednew.seq . Try another translation. prompt> reformat corrupt.seq -outfile=fixednew.seq prompt> translate fixednew.seq -outfile=fixednew.pep Display fixednew.pep to the screen to see the single letter amino acid code. prompt> more fixednew.pep reformat has other uses, too, for valid E/GCG files. Exercise DNA Analysis - Sequence Editing & Exchange 4: use reformat to change the data format in a E/GCG format sequence file Change the single letter amino acid code to the three letter code. Display the results to screen again. prompt> reformat -ONEIntothree fixednew.pep -outfile=fixednew3lc.pep prompt> more fixednew3lc.pep On-line help for reformat and ereformat is available via the commands prompt> genhelp reformat optional prompt> egenhelp ereformat optional You may also check the manual web pages for complete details: reformat & ereformat. Please continue with Part 5 - Sequence Databases

Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please ( dwf@biobase.dk )