Use e-PCR to map sequences using STS
database
Use re-PCR to map STSes or short primers in sequence
database
Use famap and fahash to prepare
sequence database for re-PCR searches.
Forward e-PCR
Example
work> e-PCR -w9 -f 1 -m100 mystsdb.sts D=100-400 myfastafile.fa N=1 G=1 T=3
Synopsis
e-PCR [-hV] [posix-options] stsfile [fasta ...] [compat-options]
where posix-options are:
-m ## Margin (default 50)
-w ## Wordsize (default 7)
-n ## Max mismatches allowed (default 0)
-g ## Max indels allowed (default 0)
-f ## Use ## discontiguos words
-o ## Set output file
-t ## Set output format:
1 - classic, range (pos1..pos2)
2 - classic, midpoint
3 - tabular
4 - tabular with alignment in comments (slow)
-d ##-## Set default sts size
-p +- Turn hits postprocess on/off
-v +- Verbose on/Off
-a a|f Use presize alignmens (only if gaps>0), slow
a - Allways or f - as Fallback
-x +- Use 5'-end lowercase masking of primers (default -)
-u +- Uppercase all primers (default -)
and compat-options (duplicate posix-options) are:
M=## Margin (default 50)
W=## Wordsize (default 7)
N=## Max mismatches allowed (default 0)
G=## Max indels allowed (default 0)
F=## Use ## discontinuos words
O=## Set output file to ##
T=## Set output format (1..4)
D=##-## Set default sts size
P=+- Postprocess hits on/off
V=+- Verbose on/Off
A=a|f Use presize alignmens (only if gaps>0), slow
a - Allways or f - as Fallback
X=+- Use 5'-end lowercase masking of primers (default -)
U=+- Uppercase all primers (default -)
-mid Same as T=2
Description
e-PCR parses stsfile in unists
format, then reads nucleotide sequence data in
FASTA format from files listed in
commandline if any, or from stdin otherwise. For input
sequences e-PCR finds matches and prints output in one of
three formats.
Options
Two sets of options are used: POSIX-compatible and
old-style provided for compatibility with previous versions of
e-PCR.
Posix-style options can appear only before first
parameter not starting with '-'. Argument '--' explicitely stops
parsing arguments as posix options.
Compatibility options can appear anywhere in commandline.
'-mid' can appear anywhere and do not stop posix options
recognision.
General options
- -V
- Print version, exit after parsing
commandline
- -h
- Print help, exit after parsing
commandline
Hash building options
- -w wordsize | W=wordsize
- Set word size for
primers hash (nucleotide positions). Longer word size decreases
hash collision rate, but increases memory usage. Also no
mismatches are allowed within word size near "inner" boundary of
primers unless one uses discontiguous words, and no
gaps are ever allowed in that region.
- -f wordcnt | W=wordcnt
- Set discontiguous word
count for primers hash (1 means "use contiguous
words"). Discontiguous words increase number of hash
tables and decrease "effective" word size (thus increasing
hash collision rate), so make search significantly slower,
but increase sencitivity by allowing mismatches within
word size. Reasonable values are 1 (contiguous words)
and 3.
- -d lo-hi | D=lo-hi
- Set ddefault STS size
range - values used for STSs that have no size associated
in file.
Hit quality options
- -m margin | M=margin
- Set maximal allowed
deviation of hit product size from expected STS size.
- -n mism | N=mism
- Set maximal number of
mismatches allowed in primer-to-sequence alignment
(per primer!).
- -g mism | G=mism
- Set maximal number of
gaps allowed in primer-to-sequence alignment (per primer!).
Alignment algorithms options
- -a a|f | A=a|f
- Use NW algorithm to align
primers to sequence: a - always, f - as fallback if fast
algorithm gives no hit at this position.
- -x +|- | X=+|-
- Turn on/off recognising of
lowercase characters at 5'-ends of primers as nucleotides
that don't need to be aligned to sequence (floppy tails).
- -u +|- | U=+|-
- Uppercase primers. To use
with files prepared for ``-x=+'' mode, but requiring full
primer alignment.
If STS file contains primers with lowercase charactars, you have
to use either -x+ or -u+ flag.
Report options
- -o output | O=output
- Set output file.
- -t 1|2|3|4 | T=1|2|3|4
- Set output format.
- -p +|- | P=+|-
-
Set hit grouping on/off: when using discontiguous words
and gaps, some hits may be reported multiple times with
little different quality. This option controls reporting
only best hit of group of overlapping hits. Default
depends on F and G values.
- -v +|- | V=+|-
-
Report sequence ids to stderr on/off.
Ouput formats
- 1: Traditional: reports whitespace-separated
-
- Sequence FASTA identifier
- POS1..POS2 -- start and end positions of hit
(includes length floppy tail)
- STS identifier (col. 1 from STS file)
- STS description (columns 5..last from STS file)
In this format product size equals to POS2-POS1+1
- 2: Traditional midpoint: reports whitespace-separated
-
- Sequence FASTA identifier
- POS -- middle point position of hit
- STS identifier (col. 1 from STS file)
- STS description (columns 5..last from STS file)
- 3: Tab-separated detailed
-
- Sequence FASTA identifier
- STS identifier (col. 1 from STS file)
- +|- -- strand of hit (order of primers in hit)
- POS1 -- start position of hit (does not include
floppy tail if any)
- POS2 -- end position of hit (does not include
floppy tail)
- SIZE/MIN..MAX -- observed size of hit/expected
size range of STS
- MISM -- Total number of mismatches for two primers
- GAPS -- Total number of gaps for two primers
- STS description (columns 5..last from STS file)
In this format product size may be greater then
POS2-POS1+1 for probes with floppy tails
- 4: Tab-separated detailed with alignment
-
Is same as format 3, but also containing visualisations of
alignments in comment lines (lines starting with ``#'')
Exit codes
Zero on success, nonzero on fail
Reverse e-PCR
Example
work> famap -tN -b genome.famap org/chr_*.fa
work> fahash -b genome.hash -w 12 -f3 ${PWD}/genome.famap
work> re-PCR -s genome.hash -n1 -g1 ACTATTGATGATGA AGGTAGATGTTTTT 120-200
Synopsis
famap [-hV]
famap -b mmapped-file [-t cvt] [fasta-file ...]
famap -d mmapped-file [ord ...]
famap -l mmapped-file [ord ...]
where cvt is one of: off n N nx NX
fahash [-hV]
fahash -b hash-file [build-options] mmapped-file ...
fahash -T hash-file [-o output]
where:
-b hash-file Build hash tables (hash-file) from sequence files,
-T hash-file Print word usage statistics for hash-file
-o outfile Set output file name for -T
build-options:
-w wordsize Set word size when building hash tables
-f period Set discontiguity when building hash tables
-k Skip repeats when building indexfile
-F min,max Set watermarks for fragment size (in Mb) for -v1
-v 1|2 Build file of format version 1 or 2
-c cachesize Use cache size cachesize (for -v2)
re-PCR [-hV]
re-PCR -p hash-file [-g gaps] [-n mism] [primer ...]
re-PCR -P hash-file [-g gaps] [-n mism] [primer-file ...]
re-PCR -s hash-file [search-options] [-O output] [left right lo hi [...]]
re-PCR -S hash-file [search-options] [-O output] [-C bcnt] [stsfile ...]
where:
-p hash-file Perform primer lookup using hash-file
-P hash-file Perform primer lookup using hash-file
-s hash-file Perform STS lookup using hash-file, STSs in cmdline
-S hash-file Perform STS lookup using hash-file, STSs in file
search-options:
-n mism Set max allowed mismatches per primer for lookup
-g gaps Set max allowed indels per primer for lookup
-m margin Set variability for STS size for lookup
-d min-max Set default STS size (for STSs without size set)
-r +|- Enable/disable reverse STS lookup
-O +|- Enable/disable syscall optimisation
-C batchcnt Set number of STSes per batch
-o outfile Set output file name
Description
Reverse e-PCR (re-PCR) performs STS or
primer lookup against sequence database. Two files are
required for database: mmapped-file with sequence data in fast
random-accessible format and hash-file, that keeps
precalculated positions of all words of sequence
database
Use famap to build mmapped-file from FASTA
files.
Use fahash to build hash-file, and output
word usage statistics.
Use re-PCR to perform STS and primer searches.
Discontiguous words are supported by re-PCR as well as
contiguous.
Options
Common options
- -V
- Print version, exit after parsing
commandline
- -h
- Print help, exit after parsing
commandline
famap options
- -b mmapped-file
- Build famap-file from input fasta
file(s). If no fasta files are set in commandline, use
stdin as input.
- -d mmapped-file
- Dump famap-file contents in
fasta format. If ord number(s) are set, print only
sequences with given ordinals.
- -l mmapped-file
- List fama-file sequence
identifiers. If ord number(s) are set, print only
sequences with given ordinals.
- -t cvt-table
- Use compiled-in table to
convert input.
- n
- Nucleotides. Allowed characters are
[actgACTGnN]. Other letters are converted to n or N.
Rest of symbols are ignored. Case is preserved.
- nx
- Nucleotides with extended ambiquity
codes iupac_na, lowercase are allowed. Other letters
are converted to n or N.
Rest of symbols are ignored. Case is preserved.
- N
- Nucleotides. Allowed characters are
[ACTGN]. [actgn] are converted to uppercase.
Other letters are converted to N.
Rest of symbols are ignored.
- NX
- Nucleotides with extended ambiquity
codes iupac_na, lowercase are converted to uppercase.
Other letters are converted to N.
Rest of symbols are ignored.
Fahash
- -b hash-file
- Build hash-file for
mmapped-file(s).
- -T hash-file
- Dump word usage statustics for
hash-file.
- -v version
- Build hash-file of version 1 or 2
(2 is default).
- -w wordsize
- Build hash-file for word
wordsize nucleotides long.
- -f wordcnt
- Build hash-file for
wordcnt discontiguous words. 1 stands for
contiguous words.
- -F min,max
- Use memory watermarks (Mbytes)
for hash table size (for -v 1).
- -c cachesize
- Set cache size for -v 2.
- -o output-file
- Use output-file for output
result of -T.
Commands
- -p hash-file
- Perform lookup for primers
given in commandline.
- -s hash-file
- Perform lookup for STSes
given in commandline.
- -S hash-file
- Perform lookup for STSes
taken from unists file(s) given in commandline.
Search options
- -n mism
- Number of mismatches allowed per
primer.
- -g gaps
- Number of gaps allowed per
primer.
- -m margin
- Maximal deviation of observed
product size to expected STS size.
- -d lo-hi
- Set ddefault STS size
range - values used for STSs that have no size associated
in file.
- -r +|-
- Enable|disable flipped STS lookup
(default is "enabled").
- -O +|-
- Enable|disable syscall optimisation.
Since lookup is i/o expensive, enabling this parameter may
improve search performance diskwise. On the other hand, it
takes significantly more memory and CPU.
- -C batchcount
- How many STSs from input file
to look at one pass. May effect on performance, especialy
when used with -O +.
- -o output-file
- Use output-file for output.
Output format
Is tab-separated file with following fields:
For primer lookup
- Primer ID
- Sequence ID
- Strand
- Hit start
- Hit end
- Mismatches
- Gaps
- Size
For STS lookup
- STS ID
- Sequence ID
- Strand
- Hit start
- Hit end
- Mismatches
- Gaps
- Observed Size/Expected size range
Exit codes
Zero on success, non-zero on errors
Bugs and features
- Mmapped-file path is hardcoded to hash-file as it is
in commandline when hash-file is being built, which means
that when one performs searches mmapped-file should be
accessible with same name from current directory, as it is
hardcoded.
- Mmapped-file is a proprietary format, that could be
substituted with megablast database format, but is not
(yet?) for performance reasons.
- If sequence sizes are large, it may be tricky to
create database with discontiguous words because of memory
usage requirements. Changing parameter -F (for -v 1) or -c
(for -v 2) may help.