European Molecular
Biology Computing Network - Biocomputing Tutorials DNA Sequence Analysis Sequence
Comparison

Sequence Comparison


Table of Contents

Comparing two sequences


Comparing Two Sequences

There are three variations on the theme of sequence comparison. You can find the BEST region of similarity between two sequences, the best OVERALL alignment of two sequences, or ALL regions of similarity between them. E/GCG provides three programme sets for these three different tasks:

NB: Be careful when using these programmes; it is possible to align one sequence with any other, if you really want to. False alignments, and the research you plan using them, may have no biological significance!

We will begin with the most common use of bestfit - to find the best region of similarity between two distantly-related (but homologous) sequences.

Exercise DNA Analysis - Sequence Comparison 1: compare distantly-related sequences with bestfit
fetch rhodopsin mRNA sequences for the rat and the octopus.

prompt> fetch gb_in:pdrhod -out=pdrhod.gb_in
prompt> fetch gb_ro:rnops -out=rnops.gb_ro

Compare them along their entire lengths with bestfit. Accept the defaults for gap creation & extension penalties.
prompt> bestfit pdrhod.gb_in rnops.gb_ro -out=rhodop.pair

BestFit makes an optimal alignment of the best segment of similarity
between two sequences. Optimal alignments are found by inserting gaps to
maximize the number of matches using the local homology algorithm of
Smith and Waterman. 

                  Begin (* 1 *) ?  
                End (*  1675 *) ?  
               Reverse (* No *) ?  
 
                  Begin (* 1 *) ?  
                End (*  1493 *) ?  
               Reverse (* No *) ?  


 What is the gap creation penalty (* 5.00 *) ?  
 
 What is the gap extension penalty (* 0.30 *) ?  
 
 Aligning ..................................................
          ........................-.

          Gaps:     0
       Quality:  20.1
 Quality Ratio: 0.490
  % Similarity: 73.171
        Length:    41

prompt>
View the alignment.
prompt> more rhodop.pair
 BESTFIT of: pdrhod.ge_in  check: 8638  from: 1  to: 1675
 
LOCUS       PDRHOD       1675 bp    RNA             INV       12-SEP-1993
DEFINITION  Octopus mRNA for rhodopsin.
ACCESSION   X07797
NID         g9822
KEYWORDS    rhodopsin.
SOURCE      Octopus dofleini. . . . 
 
 to: rnops.ge_ro  check: 6230  from: 1  to: 1493
 
LOCUS       RNOPS        1493 bp    RNA             ROD       20-DEC-1994
DEFINITION  R.norvegicus mRNA for rhodopsin.
ACCESSION   Z46957
NID         g603874
KEYWORDS    rhodopsin.
SOURCE      Norway rat. . . . 

 Symbol comparison table: /usr/prog/gcg/gcgcore/data/rundata/swgapdna.cmp
 CompCheck: 5234

         Gap Weight:  5.000      Average Match:  1.000
      Length Weight:  0.300   Average Mismatch: -0.900

            Quality:   20.1             Length:     41
              Ratio:  0.490               Gaps:      0
 Percent Similarity: 73.171   Percent Identity: 73.171

 pdrhod.ge_in x rnops.ge_ro September 24, 1996 10:00  ..

                  .         .         .         . 
     982 TGTTTGCTAAAGCTTCAGCTATCCACAACCCAATTGTCTAC 1022
         | ||||||||  |  |  | ||| ||||||||||  |||||
     961 TCTTTGCTAAGACCGCCTCCATCTACAACCCAATCATCTAC 1001

prompt>
Find and view the second best region of similarity. (Hint!)

Although the small scale (41 bases) of this particular best region allows us to see all of it via more, send the result to the graphics window.

prompt> gapshow pdrhod.ge_in rnops.ge_ro -begin1=982 begin2=961 end1=1022 end2=1001

Note that matching bases are marked with vertical bars in the text file, but base mis-matches are marked in the graphics display. How can you change this? (Hint!)

Look!

 

gap is for aligning two sequences over their entire length. While it will work with distantly-related sequences (as in the example above), much of the alignment may have little to no biological significance. Instead, we will align two more closely-related rhodopsin mRNAs.

Exercise DNA Analysis - Sequence Comparison 2: align closely-related sequences with gap
lookup and fetch the rabbit rhodopsin mRNA sequence.

prompt>lookup -library=genbank -definition=rhodopsin -organism="Oryctolagus cuniculus" -out=rabbitrh.list
prompt> more rabbitrh.list

LOOKUP in: genbank  of: "([SQ-DEF: rhodopsin*] & [SQ-ORG: oryctolagus cuniculus*])"

 1 entry  September 24, 1996 14:52 ..

gb:OCU21688 ! ID: a4f20006
! DEFINITION  Oryctolagus cuniculus rhodopsin mRNA, complete cds.
prompt> fetch gb:OCU21688 -out=ocops.gb_om

Align the rabbit and rat sequences with gap. Accept the defaults for gap creation & extension penalties. Include output files for use by gapshow.

prompt> gap ocops.gb_om rnops.gb_ro -out=rhodopm.pair -outfile2=ocops.gap -outfile3=rnops.gap

Gap uses the algorithm of Needleman and Wunsch to find the alignment of
two complete sequences that maximizes the number of matches and minimizes
the number of gaps. 

                  Begin (* 1 *) ?
                  End (*  1198 *) ?
                  Reverse (* No *) ?
  
                  Begin (* 1 *) ?
                  End (*  1493 *) ?
                  Reverse (* No *) ?  

 What is the gap creation penalty (* 5.00 *) ?  

 What is the gap extension penalty (* 0.30 *) ?  

 Aligning ..................................................
          .........-......

          Gaps:     5
       Quality: 1004.7
 Quality Ratio: 0.839
  % Similarity: 86.880
        Length:  1502

prompt>
View the alignment as text and as a graphic.

prompt> more rhodopm.pair
prompt> gapshow ocops.gap rnops.gap

Look!

 

compare, together with the graphing programme dotplot, is used to show regions of similarity within a sequence or between two sequences.

In the example sequences for bestfit, the two distantly-related rhodopsin mRNAs showed a best alignment region having ~73% similarity, and a second best one with ~70%; overall, though, these two sequences have only ~43% similarity (data from gap not shown). Thus, for compare to show only the best regions of similarity for the two distantly-related sequences, we need to use a stringency of between 60% & 70% matching bases. When compare checks for the percentage of matching bases, it does so in every possible comparison register, and within a window, i.e., a certain number of bases at a time. In a window of size 10, at least 6 to 7 bases must match (our best alignment region stringency conditions) for compare to score a "hit" between the two sequences.

Exercise DNA Analysis - Sequence Comparison 3: check for sequence similarities between two distantly-related sequences with compare
Using a window size of 10 and a stringency of 70%, find the best region of similarity between the octopus and rat rhodopsin mRNAs. Give the two input sequence files as arguments for the command, accept the defaults for the beginning and ending points, and respond to the "What comparison window size" prompt with 10, and the"What stringency" prompt with 7. View the output file from compare with dotplot, accepting its defaults.

prompt> compare pdrhod.ge_in rnops.ge_ro
...
prompt> dotplot pdrhod.pnt

The graphic output from dotplot shows the two distantly-related sequences as the two axes of a plot. A point on the plot - the dots - indicates where 7 or more bases out of 10 are identical for an alignment of the bases (the co-ordinates) at that point. Our expectation for homologous sequences is a diagonal line running from lower left to upper right. In this plot, however, there are over 10,000 points, scattered everywhere. Although the best region, originally found with bestfit is found again (the small diagonal line highlighted with the purple circle), many other small diagonal lines appear, as well as many spurious, non-diagonal matches, making the plot "noisy".

Decrease the noise by increasing the window size. Find only the best regions by using a window of 41 and a stringency of ~68%.

prompt> compare pdrhod.ge_in rnops.ge_ro -win=41 -stri=28 -out=rhod4128.pnt
...
prompt> dotplot rhod4128.pnt

This time there are only 11 points on the plot, one cluster for the best region (co-ords ~1,000;1,000) and one for the second best (co-ords ~1,400;1,200). At this window size, the stringency is too high for a diagonal line to form. But there is no noise.

Generally, distantly-related sequences reveal their significant homologies when the window size is high and the stringency is low. With closely-related sequences, a medium size window with high stringency is best. E/GCG recommends the default window size of 21 and stringency of 67% (14) only as a starting point.

Increase the window size and lower the stringency.

prompt> compare pdrhod.ge_in rnops.ge_ro -win=100 -stri=40 -out=rhod0040.pnt
...
prompt> dotplot rhod0040.pnt

Compare the closely-related rhodopsin sequences from rabbit and rat. Try the programme defaults first, and then explore other settings.

prompt> compare ocops.ge_om rnops.ge_ro -out=rhodm2114.pnt
...
prompt> dotplot rhodm2114.pnt

Finally, compare & dotplot can show regions of similarity within a sequence by comparing a sequence with itself or with its reverse compliment. For example, the 8 base duplication generated on either side of a P-element when it inserts can be seen as two short, off-centre diagonals, equidistant from the main diagonal.

View the structure of the degenerate CCA tandem repeat in pdrhod.ge_in between bases 1220 and 1400 using compare & dotplot.

prompt> compare pdrhod.ge_in pdrhod.ge_in -out=pdrhod2114.pnt
...
prompt> dotplot pdrhod2114.pnt -all

Look!


Table of Contents Please continue with Part 8 - Searching Databases - 1   Searching Databases 1

Comments? Questions? Accolades? Comments? Questions? Accolades?
Please send them to David Featherston Please   ( dwf@biobase.dk )
Updated on Thursday, 24 October, 1996
Copyright © 1995-1996 by Gary Williams, Peter Woollard, &David W. Featherston