SSS - Sequence Similarity Search service

HGC Sequence Similarity Search service manual

Execute several sequence similarity search programs against various biological sequence databases supported at Human Genome Center.

Manual pages

(Updated December, 2003)


                        COPYRIGHT NOTICE

Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.
Pearson and the University of Virginia.  All rights reserved. The
FASTA program and documentation may not be sold or incorporated
into a commercial product, in whole or in part, without written
consent of William R. Pearson and the University of Virginia.
For further information regarding permission for use or
reproduction, please contact: David Hudson, Assistant Provost for
Research, University of Virginia, P.O. Box 9025, Charlottesville,
VA 22906-9025, (434) 924-6853

The FASTA program package

Introduction

     This documentation describes the version 3 of the FASTA
program package (see W. R. Pearson and D. J. Lipman (1988),
"Improved Tools for Biological Sequence Analysis", PNAS
85:2444-2448 (Pearson and Lipman, 1988); W. R.  Pearson (1996)
"Effective protein sequence comparison" Meth. Enzymol.
266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics
46:24-36 (Zhang et al., 1997);  Pearson, (1999) Meth. in
Molecular Biology 132:185-219 (Pearson, 2000).  Version 3 of the
FASTA packages contains many programs for searching DNA and
protein databases and one program (prss3) for evaluating
statistical significance from randomly shuffled sequences.
Several additional analysis programs, including programs that
produce local alignments, are available as part of version 2 of
the FASTA package, which is still available.

     This document is divided into three sections: (1) A summary
overview of the programs in the FASTA3 package; (2) A guide to
installing the programs and databases; (3) A guide to using the
FASTA programs. The revision history of the programs can be found
in the readme.v30..v34, files. The programs are easy to use, so
if you are using them on a machine that is administered by
someone else, you can skip section (2) and focus on (1) and (3)
to learn how to use the programsIf you are installing the
programs on your own machine, you will need to read section (2)
carefully.

1.  An overview of the FASTA programs

     Although there are a large number of programs in this
package, they belong to three groups: (1) "Conventional" Library
search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,
TFASTY3, SSEARCH3; (2) Programs for searching with short
fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical
significance: PRSS3.  Programs that start with fast search
protein databases, while tfast programs search translated DNA
databases.  Table I gives a brief description of the programs.


            Table I. Comparison programs in the FASTA3 package

---------------------------------------------------------------------------
fasta3             Compare  a  protein  sequence  to  a  protein  sequence
                   database  or  a DNA sequence to a DNA sequence database
                   using the FASTA algorithm  (Pearson  and  Lipman, 1988,
                   Pearson, 1996).   Search speed and selectivity are con-
                   trolled with the ktup(wordsize) parameter.  For protein
                   comparisons,  ktup = 2 by default; ktup =1 is more sen-
                   sitive but slower.  For DNA comparisons, ktup=6 by  de-
                   fault;  ktup=3  or  ktup=4 provides higher sensitivity;
                   ktup=1 should be used for oligonucleotides  (DNA  query
                   lengths < 20).

ssearch3           Compare  a  protein  sequence  to  a  protein  sequence
                   database or a DNA sequence to a DNA  sequence  database
                   using  the  Smith-Waterman  algorithm (Smith and Water-
                   man, 1981).  ssearch3 is  about  10-times  slower  than
                   FASTA3,  but  is more sensitive for full-length protein
                   sequence comparison.

fastx3/ fasty3     Compare a DNA sequence to a protein sequence  database,
                   by  comparing  the  translated  DNA  sequence  in three
                   frames and allowing gaps and frameshifts.  fastx3  uses
                   a  simpler, faster algorithm for alignments that allows
                   frameshifts only between codons; fasty3 is  slower  but
                   produces  better alignments with poor quality sequences
                   because frameshifts are allowed within codons.

tfastx3/ tfasty3   Compare a protein sequence to a DNA sequence  database,
                   calculating  similarities  with frameshifts to the for-
                   ward and reverse orientations.

tfasta3            Compare a protein sequence to a DNA sequence  database,
                   calculating similarities (without frameshifts) to the 3
                   forward and three reverse reading frames.  tfastx3  and
                   tfasty3 are preferred because they calculate similarity
                   over frameshifts.

fastf3/tfastf3     Compares an ordered peptide mixture, as  would  be  ob-
                   tained  by  Edman  degredation  of a CNBr cleavage of a
                   protein, against a  protein  (fastf)  or  DNA  (tfastf)
                   database.

fasts3/tfasts3     Compares  set  of  short peptide fragments, as would be
                   obtained from mass-spec. analysis of a protein, against
                   a protein (fasts) or DNA (tfasts) database.
---------------------------------------------------------------------------

2.  Installing FASTA and the sequence databases

2.1.  Obtaining the libraries

     The FASTA program package does not include any protein or
DNA sequence libraries.  Protein databases are available on CD-
ROM from the PIR and EMBL (see below), or via anonymouse FTP from
many different sources.  As this document is updated in the fall
of 1999, no DNA databases are available on CD-ROM from the major
sequence databases: Genbank at the National for Biotechnology
Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and
EMBL at the European Bioinformatics Institute (www.ebi.ac.uk).
However, the databases are available via anonymous FTP from both
sites.

2.1.1.  The GENBANK DNA sequence library

     Because of the large size of DNA databases, you will
probably want to keep DNA databases in only one, or possibly two,
formats.  The FASTA3 programs that search DNA databases - fasta3,
tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile
(not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),
and BLAST2.0 (formatdb) formats, as well as EMBL format.  If you
are also running the GCG suite of sequence analysis programs, you
should use GCG/compressed-binary format or BLAST2.0 format for
your fasta3 searches.  If not, BLAST2.0 is a good choice.  These
files are considerably more compact than Genbank flat files, and
are preferred.  The NCBI does not provide software for converting
from Genbank flat files to Blast2.0 DNA databases, but you can
use the Blast formatdb program to convert ASN.1 formated Genbank
files, which are available from the NCBI ftp site.

     The NCBI also provides the nr, swissprot, and several EST
databases that are used by BLAST in FASTA format from:
ftp://ncbi.nlm.nih.gov/blast/db.  These databases are updated
nightly.

2.1.2.  The NBRF protein sequence library

     You can obtain the PIR protein sequence database (Barker et
al., 1998) from:

    National  Biomedical Research Foundation
    Georgetown  University  Medical  Center
    3900 Reservoir Rd, N.W.
    Washington, D.C. 20007

or via ftp from nbrf.georgetown.edu or from the NCBI
(ncbi.nlm.nih.gov/repository/PIR). The data in the ascii
directory is in PIR Codata format, which is not widely used.  I
recommend the PIR/VMS format data (libtype=5) in the vms
directory.

2.1.3.  The EBI/EMBL CD-ROM libraries

     The European Bioinformatics Institute (EBI) distributes both
the EMBL DNA database and the SwissProt database on CD-ROM
(Bairoch and Apweiler, 1996), and they are available from:

    EMBL-Outstation  European Bioinformatics Institute
    Wellcome Trust Genome Campus,
    Hinxton Hall
    Hinxton,
    Cambridge CB10 1SD
    United Kingdom
    Tel: +44 (0)1223 494444
    Fax: +44 (0)1223 494468
    Email: DATALIB@ebi.ac.uk

In addition, the SWISS-PROT protein sequence database is
available via anonymous FTP from
ftp://ftp.expasy.ch/databases/swiss-prot/ (also see
www.expasy.ch).

2.2.  Finding the libraries: FASTLIBS

     The major problem that most new users of the FASTA package
have is in setting up the program to find the databases and their
library type.  In general, if you cannot get fasta3 to read a
sequence database, it is likely that something is wrong with the
FASTLIBS file.  A common problem is that the database file is
found, but either no sequences are read, or an incorrect number
of entries is read.  This is almost always because the library
format (libtype) is incorrect.  Note that a type 5 file (PIR/VMS
format) can be read as a type 0 (default FASTA) format file, and
the number of entries will be correct, but the sequence lengths
will not.

     All the search programs in the FASTA3 package use the
environment variable FASTLIBS to find the protein and DNA
sequence libraries.  The FASTLIBS variable contains the name of a
file that has the actual filenames of the libraries.  The
fastlibs file included with the distribution on is an example of
a file that can be referred to by FASTLIBS. To use the fastlibs
file, type:

    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
    or
    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)

Then edit the fastlibs file to indicate where the protein and DNA
sequence libraries can be found.  If you have a hard disk and
your protein sequence library is kept in the file
/usr/lib/aabank.lib and your Genbank DNA sequence library is kept
in the directory: /usr/lib/genbank, then fastgbs might contain:

    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
    GB Primate$1P@/usr/lib/genbank/gpri.nam
    GB Rodent$1R@/usr/lib/genbank/grod.nam
    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
    ^   1    ^^^^       4                   ^     ^
              23                             (5)

The first line of this file says that there is a copy of the NBRF
protein sequence database (which is a protein database) that can
be selected by typing "P" on the command line or when the
database menu is presented in the file /usr/lib/seq/aabank.lib.

     Note that there are 4 or 5 fields in the lines in fastgbs.
The first field is the description of the library which will be
displayed by FASTA; it ends with a '$'.  The second field (1
character), is a 0 if the library is a protein library and 1 if
it is a DNA library.  The third field (1 character) is the
character to be typed to select the library.

     The fourth field is the name of the library file.  In the
example above, the /usr/lib/seq/aabank.lib file contains the
entire protein sequence library.  However the DNA library file
names are preceded by a '@', because these files (gpri.nam,
grod.nam, gmammal.nam) do not contain the sequences; instead they
contain the names of the files which contain the sequences.  This
is done because the GENBANK DNA database is broken down in to a
large number of smaller files.  In order to search the entire
primate database, you must search more than a dozen files.

     In addition, an optional fifth field can be used to specify
the format of the library file.  Alternatively, you can specify
the library format in a file of file names (a file preceded by an
'@').  This field must be separated from the file name by a space
character (' ') from the filename.  In the example above, the
aabank.lib file is in Pearson/FASTA format, while the swiss.seq
file is in PIR/VMS format (from the EMBL CD-ROM). Currently,
FASTA can read the following formats:

    0 Pearson/FASTA (>SEQID - comment/sequence)
    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2 NBRF CODATA (ENTRY/SEQUENCE)
    3 EMBL/SWISS-PROT (ID/DE/SQ)
    4 Intelligenetics (;comment/SEQID/sequence)
    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
    6 GCG (version 8.0) Unix Protein and DNA (compressed)
    11 NCBI Blast1.3.2 format  (unix only)
    12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)

In particular, this version will work with the EMBL and PIR VMS
formats that are distributed on the EMBL CD-ROM. The latter
format (PIR VMS) is much faster to search than EMBL format.  This
release also works with the protein and DNA database formats
created for the BLASTP and BLASTN programs by SETDB and PRESSDB
and with the new NCBI search format.  If a library format is not
specified, for example, because you are just comparing two
sequences, Pearson/FASTA (format 0) is used by default. To
specify a library type on the command line, add it to the library
filename and surround the filename and library type in quotes:

    fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"

     You can specify a group of library files by putting a '@'
symbol before a file that contains a list of file names to be
searched.  For example, if @gmam.nam is in the fastgbs file, the
file "gmam.nam" might contain the lines:

    </seqdb/genbank
    gbpri1.seq 1
    gbpri2.seq 1
    gbpri3.seq 1
    gbpri4.seq 1
    gbrod.seq 1
    gbmam.seq 1

In this case, the line beginning with a '<' indicates the
directory the files will be found in.  The remaining lines name
the actual sequence files.  So the first sequence file to be
searched would be:

    /usr/lib/genbank/gbpri.seq

The notation "<PIRNAQ:" might be used under the VAX/VMS operating
system. Under UNIX, the trailing '/' is left off, so the library
directory might be written as "</usr/seqlib".

     The FASTA programs can search a database composed of
different files in different sequence formats.  For example, you
may wish to search the Genbank files (in GenBank flat file
format) and the EMBL DNA sequence database on CD-ROM.  To do
this, you simply list the names and filetypes of the files to be
searched in a file of filenames.  For example, to search the
mammalian portion of Genbank, the unannotated portion of Genbank,
and the unannotated portion of the EMBL library, you could use
the file:

    </usr/lib/DNA
    gbpri.seq 1
    #  (this '#' causes the program to display the size of the library)
    gbrod.seq 1
    ...
    gbmam.seq 1
    ...
    gbuna.seq 1
    ...
    unanno.seq 5
    #

    You do not need to include library format numbers if  you
    only use the Pearson/FASTA version of the PIR protein se-
    quence library.  If no library  type  is  specified,  the
    program assumes that type 0 is being used.

     Test the setup by running FASTA.  Enter the sequence file
'mgstm1.aa' when the program requests it (this file is included
with the programs).  The program should then ask you to select a
protein sequence library.  Alternatively, if you run the TFASTA
program and use the mgstm1.aa query sequence, the program should
show you a selection of DNA sequence libraries.  Once the fastgbs
file has been set up correctly, you can set FASTLIBS=fastgbs in
your AUTOEXEC.BAT file, and you will not need to remember where
the libraries are kept or how they are named.

3.  Using the FASTA Package

3.1.  Overview

     The FASTA sequence comparison programs all require similar
information, the name of a query sequence file, a library file,
and the ktup parameter.  All of the programs can accept arguments
on the command line, or they will prompt for the file names and
ktup value.

To use FASTA, simply type:

    FASTA
    and you will be prompted for :
         the name of the test sequence file
         the name of the library file
         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)
         (ktup of 2 is about 5 times faster than ktup = 1)

The program can also be run by typing

    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)

Included with the package are several test files.  To check to
make certain that everything is working, you can try:

    fasta musplfm.aa prot_test.lib
    and
    tfastx mgstm1.aa gst.nlib

3.2.  Sequence files

     The fasta3 programs know about three kinds of sequence
files: (1) plain sequence files - files that contain nothing but
sequence residues - can only be used as query sequences. (2)
FASTA format files.  These are the same as plain sequence files,
each sequence is preceded by a comment line with a '>' in the
first column. (3) distributed sequence libraries (this is a broad
class that includes the NBRF/PIR VMS and blocked ascii formats,
Genbank flat-file format, EMBL flat-file format, and
Intelligenetics format.  All of the files that you create should
be of type (1) or (2).  FASTA format files (ones with a '>' and
comment before the sequence) are preferred, because they can be
used as query or library sequence files by all of the programs.

     I have included several sample test files, *.aa and *.seq as
well as two small sequence libraries, prot_test.lib and gst.nlib.
The first line may begin with a '>' by a comment.  Spaces and
tabs (and anything else that is not an amino-acid code) are
ignored.

     Library files should have the form:

    >Sequence name and identifier
    A F A S Y T .... actual sequence.
    F S S       .... second line of sequence.
    >Next sequence name and identifier

This is often referred to as "FASTA" or format.  You can build
your own library by concatenating several sequence files.  Just
be sure that each sequence is preceded by a line beginning with a
'>' with a sequence name.

     The test file should not have lines longer than 120
characters, and sequences entered with word processors should use
a document mode, with normal carriage returns at the end of
lines.

     A different format is required to specify the ordered
peptide mixture for fastf3/tfastf3. For example:

    >mgstm1
    MGCEN,
    MIDYP,
    MLLAY,
    MLLGY

indicates m in the first position of all three peptides (as from
CNBr), G, I, L (twice) in the second position (first cycle),
C,D,L (twice) in the third position, etc.  The commas (,) are
required to indicate the number of fragments in the mixture, but
there should be no comma after the last residue.

     For the fasts3/tfasts3 program, the format is the same,
except that there is no requirement for the peptides to be the
same length.

4.  Statistical Significance

     All the programs in the FASTA3 package attempt to calculate
accurate estimates of the statistical significance of a match.
For fasta3, ssearch3, and fastx3/y3, these estimates are very
accurate (Pearson, 1998, Zhang et al., 1997)..  Altschul et al.
(Altschul et al., 1994) provides an excellent review of the
statistics of local similarity scores.  Local sequence similarity
scores follow the extreme value distribution, so that P(s > x) =
1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are
the lengths of the query and library sequence. This formula can
be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that
the average score for an unrelated library sequence increases
with the logarithm of the length of the library sequence.  The
fasta3 programs use simple linear regression against the the log
of the library sequence length to calculate a normalized "z-
score" with mean 50, regardless of library sequence length, and
variance 10. (Several other estimation methods are available with
the -z option.) These z-scores can then be used with the extreme
value distribution and the poisson distribution (to account for
the fact that each library sequence comparison is an independent
test) to calculate the number of library sequences to obtain a
score greater than or equal to the score obtained in the search.
The original idea and routines to do the linear regression on
library sequence length were provided Phil Green, U. Washington.
This version uses a slightly different strategy for fitting the
data than those originally provided by Dr. Green.

     The expected number of sequences is plotted in the histogram
using an "*". Since the parameters for the extreme value
distribution are not calculated directly from the distribution of
similarity scores, the pattern of "*'s" in the histogram gives a
qualitative view of how well the statistical theory fits the
similarity scores calculated by the programs.  For fasta3, if
optimized scores are calculated for each sequence in the database
(the default), the agreement between the actual distribution of
"z-scores" and the expected distribution based on the length
dependence of the score and the extreme value distribution is
usually very good.  Likewise, the distribution of ssearch3 Smith-
Waterman scores typically agrees closely with the actual
distribution of "z-scores."  The agreement with unoptimized
scores, ktup=2, is often not very good, with too many high
scoring sequences and too few low scoring sequences compared with
the predicted relationship between sequence length and similarity
score.  In those cases, the expectation values may be
overestimates.

     With version 33t01, all the FASTA programs also report a
"bit" score, which is equivalent to the bit score reported by
BLAST2.  The FASTA33/BLAST2 bit score is calculated as: (lambda*S
- ln K)/ln 2, where S is the raw similarity score, lambda and K
are statistical parameters estimated from the distribution of
unrelated sequence similarity scores.  The statistical
signficance of a given bit score depends on the lengths of the
query and library sequences and the size of the library, but a 1
bit increase in score corresponds to a 2-fold reduction in
expectation; a 10-bit increase implies 1000-fold lower
expectation, etc.

     The statistical routines assume that the library contains a
large sample of unrelated sequences.  If this is not true, then
statistical parameters can be estimated by using the -z 11-15,
options.  -z options greater than 10 calculate a shuffled
similarity score for each library sequence, in addition to the
unshuffled score, and estimate the statistical parameters from
the scores of the shuffled sequences.  If there are fewer than 20
sequences in the library, the statistical calculations are not
done.

     For protein searches, library sequences with E() values <
0.01 for searches of a 10,000 entry protein database are almost
always homologous. Frequently sequences with E()-values from 1 -
10 are related as well, but unrelated sequences ( 1 - 10 per
search) will have scores in this renage as well. Remember,
however, that these E() values also reflect differences between
the amino acid composition of the query sequence and that of the
"average" library sequence.  Thus, when searches are done with
query sequences with "biased" amino-acid composition, unrelated
sequences may have "significant" scores because of sequence bias.
PRSS3 can address this problem by calculating similarity scores
for random sequences with the same length and amino acid
composition.

5.  Options

     Command line options are available to change the scoring
parameters and output display. Command line options must preceed
other program arguments, such as the query and library file
names.

5.1.  Command line options

-a   (fasta3, ssearch3 only) show both sequences in their
     entirety.

-A   force Smith-Waterman alignments for fasta3 DNA sequences.
     By default, only fasta3 protein sequence comparisons use
     Smith-Waterman alignments.

-B   Show normalized score as a z-score, rather than a bit-score
     in the list of best scores.

-b # Number of sequence scores to be shown on output.  In the
     absence of this option, fasta (and tfasta and ssearch)
     display all library sequences obtaining similarity scores
     with expectations less than 10.0 if optimized score are
     used, or 2.0 if they are not. The -b option can limit the
     display further, but it will not cause additional sequences
     to be displayed.

-c # Threshold score for optimization (OPTCUT).  Set "-c 1" to
     optimize every sequence in a database.

-E # Limit the number of scores and alignments shown based on the
     expected number of scores.  Used to override the expectation
     value of 10.0 used by default.  When used with -Q, -E 2.0
     will show all library sequences with scores with an
     expectation value <= 2.0.

-d # Maximum number of alignments to be displayed.  Ignored if
     "-Q" is not used.

-f   Penalty for the first residue in a gap (-12 by default for
     proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

-F # Limit the number of scores and alignments shown based on the
     expected number of scores. "-E #" sets the highest E()-value
     shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
     will not show any matches or alignments with E() < 0.0001.
     This allows one to skip over close relationships in searches
     for more distant relationships.

-g   Penalty for additional residues in a gap (-2 by default for
     proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

-h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).

-H   Omit histogram.

-i   Invert (reverse complement) the query sequence if it is DNA.
     For tfasta3/x3/y3, search the reverse complement of the
     library sequence only.

-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).

-l file
     Location of library menu file (FASTLIBS).

-L   Display more information about the library sequence in the
     alignment.

-M low-high
     Range of amino acid sequence lengths to be included in the
     search.

-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10

             -m 0        -m 1          -m 2          -m 3        -m 4
         MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
         ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
         MWKSCGYPYT   MWKSCGYPYT

     -m 5 provides a combination of -m 4 and -m 0. -m 6 provides
     -m 5 plus HTML formatting.

-m 9 provides coordinates and scores with the best score
     information.  A simple " -m 9 extends the normal best score
     information:

         The best scores are:                                      opt bits E(14548)
         XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79

     to include the additional information (on the same line,
     separated by a <tab>):

         %_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs
         0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0

      -m 9c provides additional information: an encoded alignment
     string.  Thus:

                10        20        30        40        50          60         70
         GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
                :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
         XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
                        20        30                 40        50        60

     would be encoded:

         =23+9=13-2=10-1=3+1=5

     The alignment encoding is with repect to the alignment, not
     the sequences.  The coordinate of the alignment is given
     earlier in the " -m 9c" line.

-m 10
     -m 10 is a new, parseable format for use with other
     programs.  See the file "readme.v20u4" for a more complete
     description.

     As of version "fa34t23b2", it has become possible to combine
     independent "-m" options.  Thus, one can use "-m 1 -m 6 -m
     9".

-M low-high
     Include library sequences (proteins only) with lengths
     between low and high.

-n   Force the query sequence to be treated as a DNA sequence.
     This is particularly useful for query sequences that contain
     a large number of ambiguous residues, e.g. transcription
     factor binding sites.

-O   Send copy of results to "filename."  Helpful for
     environments without STDOUT (mostly for the Macintosh).

-o   Turn off default optimization of all scores greater than
     OPTCUT. Sort results by "initn" scores (reduces the accuracy
     of statistical estimates).

-p   Force query to be treated as protein sequence.

-Q,-q
     Quiet - does not prompt for any input.  Writes scores and
     alignments to the terminal or standard output file.

-r   Specify match/mismatch scores for DNA comparisons.  The
     default is "+5/-4". "+3/-2" can perform better in some
     cases.

-R file
     Save a results summary line for every sequence in the
     sequence library.  The summary line includes the sequence
     identifier, superfamily number (if available) position in
     the library, and the similarity scores calculated.  This
     option can be used to evaluate the sensitivity and
     selectivity of different search strategies (Pearson, 1995,
     Pearson, 1998).

-s file
     Specify the scoring matrix file.  fasta3 uses the same
     scoring matrices as Blast1.4/2.0.  Several scoring matrix
     files are included in the standard distribution.  For
     protein sequences: codaa.mat - based on minimum mutation
     matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
     matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
     pam120.mat - a PAM120 matrix.  The default scoring matrix is
     BLOSUM50 ("-s BL50"). Other matrices available from within
     the program are: PAM250/"-s P250", PAM120/"-s P120",
     PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
     (MDM are modern PAM matrices from Jones et al. (Jones et
     al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
     BL80".

-S   Treat lower-case characters in the query or library
     sequences as "low-complexity" ("seg"-ed) residues.
     Traditionally, the "seg" program (Wootton and
     Federhen, 1993) is used to remove low complexity regions in
     DNA sequences by replacing the residues with an "X".  When
     the "-S" option is used, the FASTA33 programs provide a
     potentially more informative approach.  With "-S", lower
     case characters in the query or database sequences are
     treated as "X"'s during the initial scan, but are treated as
     normal residues during the final alignment display.  Since
     statistical significance is calculated from the similarity
     score calculated during the library search, when the lower
     case residues are "X"'s, low complexity regions will not
     produce statistically significant matches.  However, if a
     significant alignment contains low complexity regions, their
     alignmen is shown.  With "-S", lower case characters may be
     included in the alignment to indicate low complexity
     regions, and the final alignment score may be higher than
     the score obtained during the search.

     The pseg program can be used to produce databases (or query
     sequences) with lower case residues indicating low
     complexity regions using the command:

         pseg database.fasta -z 1 -q  > database.lc_seg

     (seg can also be used with some post processing, see
     readme.v33tx.)

-U   Treat the query sequence an RNA sequence.  In addition to
     selecting a DNA/RNA alphabet, this option causes changes to
     the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored
     as 'G:G'.

-V str
     It is now possible to specify some annotation characters
     that can be included (and will be ignored), in the query
     sequence file.  Thus, One might have a file with:
     "ACVS*ITRLFT?", where "*" and "?"  are used to indicate
     phosphorylation.  By giving the option -V '*?', those
     characters in the query will be moved to an "annotation
     string", and alignments that include the annotated residues
     will be highlighted with the appropriate character above the
     sequence (on the number line).

-w # Line length (width) = number (<200)

-W #  context length (default is 1/2 of line width -w) for
     alignment, like fasta and ssearch, that provide additional
     sequence context.

-x # Specify the penalty for a match to an 'X', independently of
     the PAM matrix.  Particularly useful for fastx3/fasty3,
     where termination codons are encoded as 'X'.

-X   Specifies offsets for the beginning of the query and library
     sequence.  For example, if you are comparing upstream
     regions for two genes, and the first sequence contains 500
     nt of upstream sequence while the second contains 300 nt of
     upstream sequence, you might try:

         fasta -X "-500 -300" seq1.nt seq2.nt

     If the -X option is not used, FASTA assumes numbering starts
     with 1.  (You should double check to be certain the negative
     numbering works properly.)

-y   Set the width of the band used for calculating "optimized"
     scores.  For proteins and ktup=2, the width is 16.  For
     proteins with ktup=1, the width is 32 by default.  For DNA
     the width is 16.

-z -1,0,1,2,3,4,5
     -z -1 turns off statistical calculations. z 0 estimates the
     significance of the match from the mean and standard
     deviation of the library scores, without correcting for
     library sequence length.  -z 1 (the default) uses a weighted
     regression of average score vs library sequence length; -z 2
     uses maximum likelihood estimates of Lambda and K; -z 3 uses
     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
     uses two variations on the -z 1 strategy. -z 1 and -z 2 are
     the best methods, in general.

-z 11,12,14,15
     estimate the statistical parameters from shuffled copies of
     each library sequence.  This doubles the time required for a
     search, but allows accurate statistics to be estimated for
     libraries comprised of a single protein family.

-Z db_size
     set the apparent size of the database to be used when
     calculating expectation E() values.  If you searched a
     database with 1,000 sequences, but would like to have the
     E()-values calculated in the context of a 100,000 sequence
     database, use '-Z 100000'.

-1   sort output by init1 score (for compatibility with FASTP -
     do not use).

-3   translate only three forward frames

For example:

    fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and
display the results with 80 residues on an output line, showing
all of the residues in both sequences.  Be sure to enter the
options before entering the file names, or just enter the options
on the command line, and the program will prompt for the file
names.

     (November, 1997) In addition, it is now possible to provide
the fasta programs with the query sequence (fasta, fasty,
ssearch, tfastx), or two sequences (prss, lalign, plalign) from
the unix "stdin" stream.  This makes it much easier to set up
FASTA or PRSS WWW pages.  To specify that stdin be used, rather
than a file, the file name should be specified as '-' or '@' (the
latter file name makes it possible to specify a subset of the
sequence).  Thus:

    cat query.aa | fasta -q @:25-75 s

would take residues 25-75 from query.aa and search the 's'
library (see the discussion of FASTLIBS).

5.2.  Environment variables

     Because the current version of the program allows the user
to set virtually every option on the command line (except the
ktup, which must be set as the third command line argument), only
the FASTLIBS environment variable is routinely used.

FASTLIBS
     specifies the location of the file which contains the list
     of library descriptions, locations, and library types (see
     section on finding library files).

6.  Frequently Asked Questions

 (1)   Which program should I use? See Table I.

 (2)   How do I search with both DNA strands with fasta3 and
       fastx3? With version 32 of the FASTA program package, all
       searches that use DNA queries (e.g. fasta3, fastx3/y3)
       examine both strands. To revert to earlier FASTA behavior
       - only looking at the forward or reverse strand - use -3
       to search only the forward strand and -i -3 to search only
       the reverse strand.

 (3)   When I search Genbank - the program reports: 0 residues in
       0 sequences.  This typically happens because the program
       does not know that you are searching a Genbank flatfile
       database and is looking for a FASTA format database.  Be
       certain to specify the library type ("1" for Genbank
       flatfile) with the database name.

 (4)   What is the difference between fastx3 and fasty3 (or
       tfastx3 and tfasty3).  [t]fastx3 uses a simpler codon
       based model for alignments that does not allow frameshifts
       in some codon positions (see ref. (Zhang et al., 1997)).
       tfastx3 is about 30% faster, but tfasty3 can produce
       higher quality alignments in some cases.

 (5)   When I run fasta3 -q, I don't see any (or very little)
       output, but I get lots of scores when I run interactively.
       With the -Q option, the number of high scores displayed is
       limited by the -E # cutoff, which is 10.0 for protein
       comparisons, 2.0 for DNA comparisons, and 5.0 for
       translated DNA:protein comparisons.  In interactive mode
       (without -Q), by default you see 20 high scores,
       regardless of E() value.

 (6)   What is ktup - All of the programs with fast in their name
       use a computer science method called a lookup table to
       speed the search.  For proteins with ktup=2, this means
       that the program does not look at any sequence alignment
       that does not involve matching two identical residues in
       both sequences.  Likewise with DNA and ktup = 6, the
       initial alignment of the sequences looks for 6 identical
       adjacent nucleotides in both sequences.  Because it is
       less likely that two identical amino-acids will line up by
       chance in two unrelated proteins, this speeds up the
       comparison.  But very distantly related sequences may
       never have two identical residues in a row but will have
       single aligned identities.  In this case, ktup = 1 may
       find alignments that ktup=2 misses.

 (7)   Sometimes, in the list of best scores, the same sequence
       is shown twice with exactly the same score.  Sometimes,
       the sequence is there twice, but the scores are slightly
       different. When any of the fasta3 programs searches a long
       sequence, it breaks the sequence up into overlapping
       pieces.  The length of the piece depends on the length of
       the query and the particular program being used (it can
       also be controlled with the -N #### option).  Since the
       pieces overlap by the length of the query sequence (or
       3*query_length for fastx/y3 and tfasta/x/y3), if the
       highest scoring alignment is at the end of one piece, it
       will be scored again at the beginning of the next piece.
       If the alignment is not be completely included in the
       overlap region, one of the pieces will give a higher score
       than the other.  These duplications can be detected by
       looking at the coordinates of the alignment.  If either
       the beginning or end coordinate is identical in two
       alignments, the alignments are at least partially
       duplicates.

As always, please inform me of bugs as soon as possible.

William R. Pearson
Department of Biochemistry
Jordan Hall Box 800733
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU

7.  References

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
(1994). Issues in searching molecular sequence databases. Nature
Genet. 6,119-129.

Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
Methods Enzymol. 266,460-480.

Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
sequence data bank and its new supplement TrEMBL. Nucleic Acids.
Res. 24,21-25.

Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
(1998). The PIR-International Protein Sequence Database. Nucleic
Acids Res 26,27-32.

Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
of evolutionary change in proteins. In Atlas of Protein Sequence
and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
Spring, MD: National Biomedical Research Foundation), pp.
345-352.

Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
rapid generation of mutation data matrices from protein
sequences. Comp. Appl. Biosci. 8,275-282.

Pearson, W. R. (2000). Flexible similarity searching with the
FASTA3 program package. In Bioinformatics Methods and Protocols,
S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
185-219.

Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
biological sequence comparison. Proc. Natl. Acad. Sci. USA
85,2444-2448.

Pearson, W. R. (1995). Comparison of methods for searching
protein sequence databases. Prot. Sci. 4,1145-1160.

Pearson, W. R. (1996). Effective protein sequence comparison.
Methods Enzymol. 266,227-258.

Pearson, W. R. (1998). Empirical statistical estimates for
sequence similarity searches. J. Mol. Biol. 276,71-84.

Smith, T. F. and Waterman, M. S. (1981). Identification of common
molecular subsequences. J. Mol. Biol. 147,195-197.

Wootton, J. C. and Federhen, S. (1993). Statistics of local
complexity in amino acid sequences and sequence databases.
Comput. Chem. 17,149-163.

Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
sequence with a protein sequence. J. Computational Biology
4,339-349.


Allowed combinations

Some programs have a limitation in combinations between query-target sequence types (aa: amino acid sequence, nt: nucleic acid sequence).

  • BLAST: aa-aa, aa-nt, nt-aa, nt-nt
  • FASTA: aa-aa, aa-nt, nt-aa, nt-nt
  • SSEARCH: aa-aa, nt-nt
  • EXONERATE: aa-aa, aa-nt, nt-aa, nt-nt
  • TRANS: aa-nt, nt-aa, nt-nt

The University of Tokyo The Institute of Medical Science

Copyright©2005-2012 Human Genome Center