SSS - Sequence Similarity Search service

HGC Sequence Similarity Search service manual

Execute several sequence similarity search programs against various biological sequence databases supported at Human Genome Center.

Manual pages

NAME
     exonerate - a generic tool for sequence comparison

SYNOPSIS
     exonerate [ options ] <query path>

DESCRIPTION
     exonerate is a general tool for sequence comparison.

     It uses the C4 dynamic programming library.  It is  designed
     to  be  both general and fast.  It can produce either gapped
     or ungapped alignments, according to a variety of  different
     alignment  models.  The C4 library allows sequence alignment
     using a reduced space full dynamic  programming  implementa-
     tion,  but  also  allows  automated generation of heuristics
     from the alignment models, using bounded sparse dynamic pro-
     gramming,  so that these alignments may also be rapidly gen-
     erated.  Alignments generated using  these  heuristics  will
     represent  a  valid  path  through  the alignment model, yet
     (unlike the exhaustive  alignments),  the  results  are  not
     guaranteed to be optimal.

CONVENTIONS
     A number of conventions (and idiosyncracies) are used within
     exonerate.  An understanding of them facilitates interpreta-
     tion of the output.

     Coordinates
          An in-between coordinate  system  is  used,  where  the
          positions  are counted between the symbols, rather than
          on the symbols.   This  numbering  scheme  starts  from
          zero.   This  numbering is shown below for the sequence
          "ACGT":

           A C G T
          0 1 2 3 4

          Hence the subsequence "CG" would have  start=1,  end=3,
          and  length=2.   This  coordinate system is used inter-
          nally in exonerate, and for all the output formats pro-
          duced with the exception of the "human readable" align-
          ment display and the GFF output  where  convention  and
          standards dictate otherwise.

     Reverse Complements
          When an alignment is reported on the reverse complement
          of  a sequence, the coordinates are simply given on the
          reverse complement copy of the sequence.   Hence  posi-
          tions  on the sequences are never negative.  Generally,
          the forward strand is indicated  by  '+',  the  reverse
          strand  by '-', and an unknown or not-applicable strand
          (as in the case of a protein sequence) is indicated  by
          '.'

     Alignment Scores
          Currently, only the raw alignment scores are displayed.
          This  score  just is the sum of transistion scores used
          in the dynamic programming.  For example, in  the  case
          of  a Smith-Waterman alignment, this will be the sum of
          the substitution matrix scores and the gap penalties.

GENERAL OPTIONS
     Most arguments have short and long forms.   The  long  forms
     are  more likely to be stable over time, and hence should be
     used in scripts which call exonerate.

     -h | --shorthelp <boolean>
          Show help.  This will display a concise summary of  the
          available options, defaults and values currently set.

     --help <boolean>
          This shows all the help options including the defaults,
          the  value  currently set, and the environment variable
          which may be used to set each parameter.  There will be
          an  indication  of which options are mandatory.  Manda-
          tory options have no default, and  must  have  a  value
          supplied  for  exonerate  to run.  If mandatory options
          are used in order, their flags may be skipped from  the
          command  line  (see  examples  below).  Unlike this man
          page, the information from this option will  always  be
          up to date with the latest version of the program.

     -v | --version <boolean>
          Display the version number.  Also displays other infor-
          mation such as the build date and glib version used.

SEQUENCE INPUT OPTIONS
     Pairwise comparisons will be  performed  between  all  query
     sequences and all target sequences.  Generally, for the best
     performance, shorter sequences  (eg.  ESTs,  shotgun  reads,
     proteins)  should be used as the query sequences, and longer
     sequences (eg. genomic sequences) should be used as the tar-
     get sequences.

     -q | --query  <paths>
          Specify the query sequences required.  These must be in
          a   FASTA  format  file.   Single  or  muiltiple  query
          sequences  may  be  supplied.   Additionally   multiple
          copies  of  the  fasta file may be supplied following a
          --query flag, or by using with multiple --query flags.

     -t | --target <paths>
          Specify the target sequences required.  Also,  must  be
          in  a  FASTA format file.  As with the query sequences,
          single or multiple target sequences and  files  may  be
          supplied.

     -Q | --querytype <sequence type>
          Specify the query type to use.  If  this  is  not  sup-
          plied,  the  query  type  is assumed to be DNA when the
          first sequence in  the  file  contains  more  than  85%
          [ACGTN] bases.  Otherwise, it is assumed to be peptide.
          This option forces the query type  as  some  nucleotide
          and  peptide  sequences  can  fall  either side of this
          threshold.

     -T | --targettype <sequence type>
          Specify the  target  type  to  use.   The  same  as  --
          querytype  (above),  except that it applies to the tar-
          get.  Specifying the sequence type will avoid the over-
          head  of having to read the first sequence in the data-
          base twice (which may be significant  with  chromosome-
          sized sequences)

     --querychunkid <id>

     --querychunktotal <total>

     --targetchunkid <id>

     --targetchunktotal <total>
          These options to facilitate running exonerate  on  com-
          pute farms, and avoid having to split up sequence data-
          bases into small chunks to run on different nodes.  If,
          for  example,  you  wished to split the target database
          into three parts, you would run three exonerate jobs on
          different nodes including the options:

          --targetchunkid 1 --targetchunktotal 3
          --targetchunkid 2 --targetchunktotal 3
          --targetchunkid 3 --targetchunktotal 3
          NB. The granularity offered by this  option  only  goes
          down  to  a  single  sequence,  so  when there are more
          chunks than sequences in the database,  some  processes
          will do nothing.

ANALYSIS OPTIONS
     -E | --exhaustive <boolean>
          Specify whether or not exhaustive alignment  should  be
          used.  By default, this is FALSE, and alignment heuris-
          tics will be used.  If it is set to TRUE, an exhaustive
          alignment  will be calculated.  This requires quadratic
          time, and will be much, much slower, but  will  provide
          the optimal result for the given model.
     -B | --bigseq <int>
          Perform alignment of large (multi-megabase)  sequences.
          This  is  very  memory  efficient  and  fast  when both
          sequences are chromosome-sized, but currently does  not
          currently  permit  the use of a word neighbourhood (ie.
          exactly matching seeds only).
     -V | --verbose <int>
          Be verbose - show information about what  is  going  on
          during  the  analysis.   The  default is 0 (no informa-
          tion), the higher the number given, the  more  informa-
          tion is printed.
     --forcescan <none | query | target>
          Force the FSM to scan the query  sequence  rather  than
          the target.  This option is useful, for example, if you
          have a single piece of genomic sequence and you with to
          compare  it  to  the  whole  of dbEST.  By scanning the
          database, rather than the query, the analysis  will  be
          completed much more quickly, as the overheads of multi-
          ple query FSM construction, multiple target reading and
          splice  site  predictions will be removed.  By default,
          exonerate will guess  the  optimal  strategy  based  on
          database sequence sizes.
     -w --dnawordlen <bases>
     -W --proteinwordlen <residues>
          The word length used for DNA  or  protein  words.   NB.
          When  this is set to zero, the word length used will be
          the full length of the sequence query  sequence.   This
          can  be  useful  for searches using short oligos.  When
          performing  DNA  vs  protein  comparisons,  a  the  DNA
          wordlength  will  always  (automatically) be triple the
          protein wordlength.
     --saturatethreshold <number>
          When set to zero, this option does nothing.  Otherwise,
          once more than this number of words (in addition to the
          expected number of words  by  chance)  have  matched  a
          position  on  the query, the position on the query will
          be 'numbed' (ignore further matches)  for  the  current
          pairwise comparison.
FASTA DATABASE OPTIONS
     --fastasuffix <boolean>
          If any of the inputs given with --query or --target are
          directories,  then  exonerate  will recursively descent
          these directories, reading all files starting with this
          suffix as fasta format input.
UNGAPPED ALIGNMENT OPTIONS
     -M | --fsmmemory <Mb>
          Specify the amount of memory to  use  for  the  FSM  in
          heuristic analyses.  exonerate multiplexes the query to
          accelerate  large-throughput  database  queries.   This
          figure  should  always be less than the physical memory
          on the machine, but  when  searching  large  databases,
          generally,  the  more  memory it is allowed to use, the
          faster it will go.
     --forcefsm <none | normal | compact>
          Force the use of more compact finite state machines for
          analyses  involving big sequences and large word neigh-
          bourhoods.  By default, exonerate will pick a  sensible
          strategy, so this option will rarely need to be set.
     --useworddropoff <boolean>
          When this is TRUE, the score  threshold  for  admitting
          words into the word neighbourhood is set to be the ini-
          tial word score minus the word threshold  (see  below).
          This  strategy  is  designed to prevent restricting the
          word neighbourhood of low scoring words.  When this  is
          FALSE,  the  word  threshold is taken to be an absolute
          value.
     --wordjump <int>
          The jump between query words used  to  yield  the  word
          neighbourhood.  If set to 1, every word is used, if set
          to 2, every other word is  used,  and  if  set  to  the
          wordlength,  only  non-overlapping  words will be used.
          This option reduce the memory requirements  when  using
          very  large  query  sequences,  and make the search run
          faster, but it also damage search sensitivity when high
          values are set.
     --dnawordthreshold <score>
     --proteinwordthreshold <score>
          The threshold for admitting DNA or protein  words  into
          the  word  neighbourhood.  The behaviour of this option
          is altered by the --useworddropoff option (see above).

GAPPED ALIGNMENT OPTIONS
     -m | --model <alignment model>
          Specify  the  alignment  model  to  use.   The   models
          currently supported are:
          ungapped        The simplest type  of  model,  used  by
                          default.   An appropriate model with be
                          selected automatically for the type  of
                          input sequences provided.
          ungapped:trans  This ungapped model  includes  transla-
                          tion  of  all  frames of both the query
                          and target sequences.  This is  similar
                          to an ungapped tblastx type search.
          affine:global   This performs gapped global  alignment,
                          similar  to  the Needleman-Wunsch algo-
                          rithm, except with affine gaps.  Global
                          alignment   requires   that   both  the
                          sequences   in   their   entirety   are
                          included in the alignment.
          affine:bestfit  This performs a best fit or best  loca-
                          tion  alignment  of  the query onto the
                          target  sequence.   The  entire   query
                          sequence will be included in the align-
                          ment, but only the  best  location  for
                          its alignment on the target sequence.
          affine:local    This is  local  alignment  with  affine
                          gaps,  similar  to  the Smith-Waterman-
                          Gotoh  algorithm.   A   general-purpose
                          alignment  algorithm.  As this is local
                          alignment, any subsequence of the query
                          and  target  sequence may appear in the
                          alignment.
          affine:overlap  This type of alignment finds  the  best
                          overlap  between  the query and target.
                          The overlap alignment must include  the
                          start  of  the  query or target and the
                          end  of  the  query   or   the   target
                          sequence,   to  align  sequences  which
                          overlap at the ends,  or  in  the  mid-
                          section of a longer sequence..  This is
                          the type of alignment  frequently  used
                          in assembly algorithms.
          est2genome      This   model   is   similar   to    the
                          affine:local   model,   but   it   also
                          includes intron modelling on the target
                          sequence  to allow alignment of spliced
                          to unspliced coding sequences for  both
                          forward  and  reversed  genes.  This is
                          similar to the alignment models used in
                          programs such as EST_GENOME and sim4.
          ner             NERs  are  non-equivalenced  regions  -
                          large  regions  in  both  the query and
                          target which  are  not  aligned.   This
                          model  can  be  used for protein align-
                          ments where  strongly  conserved  helix
                          regions  will  be  aligned,  but weakly
                          conserved loop regions are not.   Simi-
                          larly, this model could be used to look
                          for co-linearly  conserved  regions  in
                          comparison of genomic sequences.
          protein2dna     This model compares a protein  sequence
                          to  a  DNA  sequence, incorporating all
                          the appropriate gaps and frameshifts.
          protein2genome  This model allows alignment of  a  pro-
                          tein sequence to genomic DNA.   This is
                          similar to the protein2dna model,  with
                          the  addition  of  modelling of introns
                          and intron phases.  This model is  sim-
                          liar to those used by genewise.
          coding2coding   This   model   is   similar   to    the
                          ungapped:trans  model, except that gaps
                          and frameshifts  are  allowed.   It  is
                          similar to a gapped tblastx search.
          coding2genome   This  is  similar  to  the   est2genome
                          model,  except  that the query sequence
                          is translated during comparison, allow-
                          ing a more sensitive comparison.
          genome2genome   This   model   is   similar   to    the
                          coding2coding model, except introns are
                          modelled on both sequences.

     The short names u, u:t, a:g, a:b, a:l, a:o, e2g, ner,
          p2d, p2g, c2c, c2g and g2g can also be used for  speci-
          fying models.

     -s | --score <threshold>
          This is the overall score threshold.   Alignments  will
          not  be  reported  below this threshold.  For heuristic
          alignments, the higher this threshold,  the  less  time
          the analysis will take.

     --percent <percentage>
          Report only alignments scoring at least this percentage
          of the maximal score for each query.  eg. use --percent
          90 to report alignments with 90% of the  maximal  score
          optainable  for  that query.  This option is useful not
          only because it reduces the  spurious  matches  in  the
          output,  but because it generates query-specific thres-
          holds (unlike --score ) for a set of queries of differ-
          ing  lengths, and will also speed up the search consid-
          erably.  NB. with this option, it is possible to have a
          cDNA  match  its  corresponding gene exactly, yet still
          score less than 100%, due to the addition of the intron
          penalty  scores,  hence  this  option must be used with
          caution.

     --showalignment <boolean>
          Show the alignments in an human readable form.

     -S | --showsugar <boolean>
          Display "sugar" output for ungapped alignments.   Sugar
          is  Simple  UnGapped  Alignment  Report, which displays
          ungapped  alignments  one-per-line.   The  sugar   line
          starts  with  the  string  "sugar:" for easy extraction
          from the output, and is followed by the the following 9
          fields in the order below:

          query_id        Query identifier
          query_start     Query position at alignment start
          query_end       Query position alignment end
          query_strand    Strand of query matched
          target_id       |
          target_start    | the same 4 fields
          target_end      | for the target sequence
          target_strand   |
          score           The raw alignment score

     --showcigar <boolean>
          Show the alignments in "cigar" format.  Cigar is a Com-
          pact   Idiosyncratic  Gapped  Alignment  Report,  which
          displays gapped alignments  one-per-line.   The  format
          starts  with  the  same  9  fields as sugar output (see
          above), and is followed  by  a  series  of  <operation,
          length>  pairs  where operation is one of match, insert
          or delete, and the length describes the number of times
          this operation is repeated.

     --showvulgar <boolean>
          Shows the alignments in  "vulgar"  format.   Vulgar  is
          Verbose  Useful  Labelled Gapped Alignment Report, This
          format also starts with the same 9 fields as sugar out-
          put (see above), and is followed by a series of <label,
          query_length, target_length> triplets.  The  label  may
          be one of the following:

          M    Match
          G    Gap
          N    Non-equivalenced region
          5    5' splice site
          3    3' splice site
          I    Intron
          S    Split codon
          F    Frameshift

     --showquerygff <boolean>
          Report GFF output for features on the  query  sequence.
          See   http://www.sanger.ac.uk/Software/formats/GFF  for
          more information.

     --showtargetgff <boolean>
          Report GFF output for features on the target sequence.

     --ryo <format>
          Roll-your-own output format.  This allows specification
          of  a printf-esque format line which is used to specify
          which information to include in the output, and how  it
          is  to be shown.  The format field may contain the fol-
          lowing fields:

          %[qt][idlsSt]
               For    either    {query,target},    report     the
               {id,definition,length,sequence,Strand,type}
               Sequences are  reported  in  a  fasta-format  like
               block (no headers).
          %[qt]a[bels]
               For either {query,target} region which  occurs  in
               the          alignment,         report         the
               {begin,end,length,sequence}
          %s   The raw score
          %r   The rank (in results from a bestn search)
          %m   Model name
          %e[tism]
               Equivalenced {total,id,similarity,mismatches} (ie.
               %em == (%et - %ei))
          %p[is]
               Percent {id,similarity} over the equivalenced por-
               tions  of  the  alignment.  (ie. %pi == 100*(%ei /
               %et))
          %g   Gene orientation ('+' = forward,  '-'  =  reverse,
               '.' = unknown)
          %S   Sugar block (the 9 fields  used  in  sugar  output
               (see above)
          %C   Cigar block (the fields of a cigar line after  the
               sugar portion)
          %V   Vulgar block (the fields of a  vulgar  line  after
               the sugar portion)
          %%   Expands to a percentage sign (%)
          \n   Newline
          \t   Tab
          \\   Expands to a backslash (\)
          \{   Open curly brace
          \}   Close curly brace
          {    Begin per-transition output section
          }    End per-transition output section
          %P[qt][sabe]
               Per-transition    output    for     {query,target}
               {sequence,advance,begin,end}
          %P[nsl]
               Per-transition output for {name,score,label}

     This option is very useful and flexible.   For  example,  to
     report  all the sections of query sequences which feature in
     alignments in fasta format, use:

     --ryo ">%qi %qd\n%qaS\n"

     To output all the symbols and scores in  an  alignment,  try
     something like:

     --ryo "%V{%Pqs %Pts %Ps\n}"

     -n | --bestn <number>
          Report the  best  N  results  for  each  query.   (Only
          results scoring better than the score threshold
           will be reported).  The option reduces the  amount  of
          output generated, and also allows exonerate to speed up
          the search.

     -S | --subopt <boolean>
          This option allows  for  the  reporting  of  (Waterman-
          Eggert  style)  suboptimal  alignments.   (It  is on by
          default.)  All suboptimal (ie. non-intersecting) align-
          ments will be reported for each pair of sequences scor-
          ing at least the threshold provided by --score.
          When this option is used  with  exhaustive  alignments,
          several full quadratic time passes will be required, so
          the running time will be considerably increased.

     -g | --gappedextension <boolean>
          Causes a gapped extension stage  to  be  performed  ie.
          dynamic  programming  is  applied in arbitrarily shaped
          and dynamically sized regions  surrounding  HSP  seeds.
          The  extension  threshold  is  controlled  by  the same
          options, --dnahspdropoff and --proteinhspdropoff as for
          normal HSP extension.

          Although  often  slower  than  BSDP,  gapped  extension
          improves  sensitivity  with  weak,  gap-rich alignments
          such as during cross-species comparison.

          NB. This option is not yet supported withing the  code-
          generation  framework,  so will be slow for large scale
          analyses (to be fixed in 1.1).


VITERBI ALGORITM OPTIONS
     -C | --compiled <boolean>
          This option allows  disabling  of  generated  code  for
          dynamic programming.  It is mainly used during develop-
          ment of exonerate.  When set to FALSE, an "interpreted"
          version  of  the  dynamic programming implementation is
          used, which is much slower.

     -D | --dpmemory <Mb>
          The  exhaustive  alignment  traceback  routines  use  a
          Hughey-style  reduced  memory  technique.   This option
          specifies how much memory will be used for this.   Gen-
          erally,  the  more memory is permitted here, the faster
          the alignments will be produced.

HEURISTIC OPTIONS
     --terminalrangeint
     --terminalrangeext
     --joinrangeint
     --joinrangeext
     --spanrangeint
     --spanrangeext
          These options are used to specify the size of the  sub-
          alignment  regions  to  which  DP is applied around the
          ends of the HSPs.  This can be at the HSP ends  (termi-
          nal  range), between HSPs (join range), or between HSPs
          which may be connected by a large  region  such  as  an
          intron  or non-equivalenced region (span range).  These
          ranges can be specified for a number  of  matches  back
          onto  the  HSP  (internal  range)  or  out from the HSP
          (external range).

     --refine <strategy>
          Force  exonerate  to  refine  alignments  generated  by
          heuristics   using   dynamic  programming  over  larger
          regions.  This takes more time, but improves the  qual-
          ity of the final alignments.

          The strategies available for refinement are:

          none The default - no refinement is used.
          full An exhaustive alignment  is  calculated  from  the
               pair of sequences in their entirety.
          region
               DP is applied just to the region of the  sequences
               covered by the heuristic alignment.

     --refineboundary <size>
          Specify an extra boundary to be included in the  region
          subject to alignment during refinement by region.

BSDP OPTIONS
     --joinfilter <limit>
          (experimental)

          Only allow consider this number  of  SARs  for  joining
          HSPs together.  The SARs with the highest potential for
          appearing in a high-scoring alignment  are  considered.
          This  option  useful for limiting time and memory usage
          when searching unmasked data with repetitive sequences,
          but  should not be set too low, as valid matches may be
          ignored.  Something like --joinfilter 32 seems to  work
          well.

UNGAPPED MODEL OPTIONS
     -d | --dnasubmat <name>
          Specify the the substitution matrix to be used for  DNA
          comparison.   This  should  be a path to a substitution
          matrix in same format as that which is used by blast.

     -p | --proteinsubmat <name>
          Specify the the substitution matrix to be used for pro-
          tein  comparison.   (Both  DNA and protein substitution
          matrices are required for some types of analysis).  The
          use  of  the  special names, nucleic, blosum62, pam250,
          edit  or  identity  will  cause  built-in  substitution
          matrices to be used.

AFFINE MODEL OPTIONS
     -o | --gapopen <penalty>
          This is the gap open penalty.

     -e | --gapextend <penalty>
          This is the gap extension penalty.  Duh.

NER OPTIONS
     --minner <boolean>
          Minimum NER length allowed.

     --maxner <length>
          Maximum NER  length  allowed.   NB.  this  option  only
          affects heuristic alignments.

     --neropen <penalty>
          Penalty for opening a non-equivalenced region.

INTRON MODELLING OPTIONS
     --forcegtag <boolean>
          Only allow splice sites at gt....ag sites (or  ct....ac
          sites  when the gene is reversed) With this restriction
          in place, the splice site prediction scores  are  still
          used and allow tie breaking when there is more than one
          possible splice site.

     --minintron <length>
          Minimum intron length  limit.   NB.  this  option  only
          affects heuristic alignments.  This is not a hard limit
          - it only affects size of introns which are sought dur-
          ing heuristic alignment.

     --maxintron <length>
          Maximum intron length limit.  See notes  above  for  --
          minintron

     --intronpenalty <penalty>
          Penalty for introduction of an intron.

FRAMESHIFT MODELLING OPTIONS
     -f | --frameshift <penalty>
          The penalty for the inclusion of  a  frameshift  in  an
          alignment.

ALPHABET OPTIONS
     --useaatla <boolean>
          Use three-letter abbreviations for AA names.  ie.  when
          displaying alignment "Met" is used instead of " M "

HSP CREATION OPTIONS
     --hspfilter <threshold>
          Use aggressive HSP  filtering  to  speed  up  heuristic
          searches.   The  threshold specifies the number of HSPs
          centred about a  point  in  the  query  which  will  be
          stored.  Any lower scoring HSPs will be discarded.

PAIRWISE COMPARISON OPTIONS
     --softmaskquery <boolean>
          Indicate that the query is softmasked.  See description
          below for --softmasktarget

     --softmasktarget <boolean>
          Indicate that the target is  softmasked.   In  a  soft-
          masked  sequence file, instead of masking regions by Ns
          or Xs they are masked by putting those regions in lower
          case  (and  with unmasked regions in upper case).  This
          option allows the masking to be ignored by  some  parts
          of the program, combining the speed of searching masked
          data with sensitivity of searching unmasked data.   The
          utility  fastasoftmask  supplied which is supplied with
          exonerate can be used for producing softmasked sequence
          from conventionally masked sequence.  This is an exper-
          imental option to handle speed problems caused by  some
          sequences.  A value of about 100 seems to work well.

     --dnahspdropoff <score>

     --proteinhspdropoff <score>
          The amount by which an HSP score  will  be  allowed  to
          degrade  during  HSP extension.  Separate threshold can
          be set for dna or protein comparisons.

     --dnahspthreshold <score>

     --proteinhspthreshold <score>
          The HSP score thresholds.  An HSP must score  at  least
          this  much  before  it  will  be reported or be used in
          preparation of a heuristic alignment.

ALIGNMENT OPTIONS
     --alignmentwidth <width>
          Width of alignment display.  The default is 80.

     --forwardcoordinates <boolean>
          By default, all coordinates are reported on the forward
          strand.   Setting  this  option to false reverts to the
          old behaviour (pre-0.8.3)  whereby  alignments  on  the
          reverse  complement  of  a  sequence are reported using
          coordinates on the reverse complement.

SUB-ALIGNMENT REGION OPTIONS
     --quality <percent>
          This option excludes HSPs from  BSDP  when  their  com-
          ponents  outside  of  the  SARs fall below this quality
          threshold.


STRATEGIES FOR SPEED
     Keep all data on local disks.

     Apply  the  highest  acceptable  score  thresholds  using  a
     combination of --score, --percent and --bestn.

     Repeat mask and dust the genomic (target) sequence.   (Soft-
     mask these sequences and use --softmasktarget).

     Increase the --fsmmemory option to allow more  query  multi-
     plexing.

     If you are compiling exonerate yourself, see the README file
     supplied  with  the  source code for details of compile-time
     optimisations.

STRATEGIES FOR SENSITIVITY
     Not documented yet.

     Increase the word neighbourhood.  Decrease  the  HSP  thres-
     hold.  Increase the SAR ranges.  Run exhaustively.

ENVIRONMENT
     Not documented yet.

EXAMPLES
     exonerate cdna.fasta genomic.fasta
          This simplest way in which exonerate may be  used.   By
          default, the est2genome model will be used.

     exonerate --exhaustive cdna.fasta genomic.masked.fasta
          Exhaustively align cdnas  to  genomic  sequence.   This
          will  be  much,  much  slower, but more accurate.  This
          option causes exonerate to behave like EST_GENOME.

     exonerate  --exhaustive  --model  affine:local   query.fasta
     target.fasta
          If the  affine:local  model  is  used  with  exhaustive
          alignment, you have the Smith-Waterman algorithm.

     exonerate --exhaustive --model  affine:global  protein.fasta
     protein.fasta
          Switch to a  global  model,  and  you  have  Needleman-
          Wunsch.

     exonerate  --wordthreshold  1  --gapped  no  --showhsp   yes
     protein.fasta genome.fasta
          Generate ungapped Protein:DNA alignments

     exonerate --model coding2coding --score  1000  --bigseq  yes
     --proteinhspthreshold 90 chr21.fa chr22.fa
          Perform quick-and-dirty translated  pairwise  alignment
          of two very large DNA sequences.

     Many similar combinations should work.  Try them out.

VERSION
     This  documentation  accompanies  version   1.0.0   of   the
     exonerate package.

AUTHOR
     Guy St.C. Slater.  <guy@ebi.ac.uk>.  See  the  AUTHORS  file
     accompanying the source code for a list of contributors.

AVAILABILITY
     This source code for  the  exonerate  package  is  available
     under the terms of the GNU _l_e_s_s_e_r public licence.

     Please see the file COPYING which was distrubuted with  this
     package,   or   http://www.fsf.org/copyleft/lesser.html  for
     details.

     This package has been developed as part of the ensembl  pro-
     ject.   Please see http://www.ensembl.org/ for more informa-
     tion.

SEE ALSO
     blast(1L),

Allowed combinations

Some programs have a limitation in combinations between query-target sequence types (aa: amino acid sequence, nt: nucleic acid sequence).

  • BLAST: aa-aa, aa-nt, nt-aa, nt-nt
  • FASTA: aa-aa, aa-nt, nt-aa, nt-nt
  • SSEARCH: aa-aa, nt-nt
  • EXONERATE: aa-aa, aa-nt, nt-aa, nt-nt
  • TRANS: aa-nt, nt-aa, nt-nt

The University of Tokyo The Institute of Medical Science

Copyright©2005-2012 Human Genome Center