0) Copyrights
Copyright (c) 1999-2002, by Mathieu Blanchette and Martin Tompa.
All rights reserved. Redistribution is not permitted without the
express written permission of the authors.
1) Description of the program
FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned homologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved, according to a parsimony criterion. The regions identified are good candidates for regulatory elements. By default, the program searches for regions that are well conserved across all of the input sequences, but this can be relaxed to allow finding regions conserved in only a subset of the species. The use of the program is described in [0]. The algorithms used by FootPrinter are described in [1], and examples of results it produced are discussed in [2].
The simplest way to use FootPrinter is to access it via our web
server. You may also want to download FootPrinter
to run it on your own machine.
For now, FootPrinter is only available for Unix/Linux,. After downloading,
follow these instructions:
gunzip FootPrinter.tar.gz
tar -xvf FootPrinter.tar
cd FootPrinter2.0
make FootPrinter
When you run FootPrinter on your own machine, the following command line should be used:
FootPrinter <input_sequences> <phylogenetic_tree> [options...]
where:
1. Input
sequences
FootPrinter takes as input a set of orthologous or paralogous sequences in
which you want to find motifs. The input sequences must all be listed in
the same file, in FASTA format. See example.fasta
for an example. If the sequence contains characters other than a,c,g
or t, any substring containing these characters will be ignored. The name
of the sequences (one word, after ">") must correspond to the name of
some species in the phylogenetic tree (see (2)). Each sequence must have
a name that is different from that of the other sequences. Notice that this
does not prevent you from having several sequences from the same species;
all you need to do is to name them differently (e.g. mouse1, mouse2...).
However, this in case, you will have to provide your own phylogenetic tree,
with mouse1, mouse2... as leaves.
2. Phylogenetic
tree
FootPrinter uses a phylogenetic tree to evaluate the conservation of each
motif. Motifs that have a low parsimony score on the given phylogenetic tree
are reported.
The distribution of FootPrinter comes with a default phylogenetic tree (the
file called tree_of_life), which contains several eukaryotic species. See
the file nice_tree_of_life
for a more readable format. If the sequences you are using come from species
listed in this tree and you have only one sequence per species, you do not
need to provide your own tree. Just make sure that the name of your sequences
correspond to the name of the species in the tree. If you need to provide
your own tree, the phylogenetic tree is given in postfix notation (the standard
bracket format, also used by the PHYLIP package, among others). Only leaves
are labeled, with the name of the species. Species names followed by * will
match any name starting with the given name. In the downloadable version
of the program, branch lengths can be specified for some branches (branch
lengths are used only when the -losses option is used). The phylogeny can
contain more species than just those specified in the sequence file. In that
case, unused species will be removed from the tree and branch lengths will
be adjusted accordingly. All text that follows // on the same line
will be considered as comments and will be ignored. See this example.
3. Sequence
type
Choose among "upstream", "downstream", or "other". The sequence type only
matters when the options (10) and (11) are used. For upstream sequences, the
3' end of the sequences are assumed to be aligned, whereas for downstream
or other sequences, the 5' end is assumed to be aligned.
4. Motif
size:
Specifies the size of the motifs sought. FootPrinter will report all regions
of the given size that have at most the parsimony score specified in (5).
Command line option: -size <integer>
Default value : 10.
Valid range: 4 to 16.
Typical values: 6 to 12.
5. Maximum
Parsimony score:
The maximum parsimony score allowed for the motifs. If the maximum parsimony
score allowed is small, FootPrinter will run quickly and report only the
most conserved motifs. We suggest to start with a small parsimony score,
and increase it later until the desired number of motifs is reported. This
option is used only if the -losses option is not. If the -losses option is
used, the <losses.config> file contains this information. If the loss_cost
option or the subregion_change_cost option are used, the maximum parsimony
threshold actually applies to the sum of the actual parsimony score, the
loss cost, and the subregion change cost.
Command line option: -max_mutations <float>
Default value: 2.
Valid range: 0 to 20.
Typical values: 0 to 6, depending on the motif size and on the diversity of
the input sequences.
6. Maximum
number of mutations per branch:
Allows at most a fixed number of mutations per branch of the tree. Setting
this number to a small constant like one or two has a number of advantages:
it considerably reduces the running time, and reduces the incidence of some
types of spurious motifs.
Command line option: -max_mutations_per_branch <float>
Default: Not used.
Valid range : 0 to 20.
Typical values: 1 to 2.
7. Allow regulatory
elements losses:
FootPrinter can find regulatory elements even if some
species do not contain those regulatory elements. When this option is used,
FootPrinter first approximates the lengths of each branches of your tree,
and then compares the parsimony score of a motif to the length of the tree
spanned by the species containing that motif.
For the downloadable version of FootPrinter, this option is subsumed by option (8).
8. Spanned tree
significance level:
When regulatory element losses are allowed, the parsimony score of a motif
is compared to the length of the tree spanned by the species containing the
motif.
For the web version of FootPrinter, we have precomputed these lengths so
that the motifs reported are statistically significant. You have the choice
of three significance levels: "Very significant" will only report motifs
that are the most significant, while "Significant" and "Somewhat significant"
will report more motifs.
Default: "Significant".
If you are running FootPrinter on your own machine, you get to specify
the constraints on the parsimony score and span of the motifs to reported.
These constraints are specified in a file like example.config . This file contains a list of
pairs of the form (0,span_0) (1,span_1) (2, span_2)... (n, span_n), which
specifies that you want to have reported all motifs with score 0 that span
a tree of length at least span_0, and all those with parsimony score 1 spanning
a tree of length at least span_1, etc. For this option to be used, you need
to either (i) provide branch lengths with the phylogenetic tree that you
give as input, or (ii) use the -compute_branch_lengths option, which will
estimate these branch lengths from your sequences. A set of *.config files
are included with the package. These files give span constraints so that
the motifs reported are statistically significant. For a motif of size X,
three files are provided: universalXloose.config , universalX.config and
universalXtight.config, which will respectively report motifs that are somewhat
significant, significant or very significant, approximatively corresponding
to p-values of 0.2, 0.1 and 0.05 respectively. These files should be used
only when the -compute_branch_lengths option is used.
These files are provided only for your convenience and may under- or over-estimate
that statistical significance of motifs in your data set.
Command line option: -losses <example.config>
Default: Not used.
9. Motif loss
cost:
When regulatory elements losses are allowed, a cost can be given to losing
a particular motif along some branch of the tree.
Command line option: -loss_cost <float>
Valid range: 0 to 20.
Typical values: 0 to 2.
Default: Not used.
10. Subregion
size:
In some case, you may want to penalize motifs whose position in the sequence
varies too much. This is done using this option, together with option (11).
The subregion size option divides the input sequences in subregions of the
given size, and penalizes motifs whose position (subregion) varies too much
from species to species. The cost of changing subregion is given by option
(11).
Command line option: -subregion_size <integer>
Valid range: 1 to +infinity.
Typical values: 20 to 200 (in general, I set it to about one tenth of the
sequence length).
Default: Not used.
11. Subregion change cost:
Cost for changing subregion. See option (10).
Command line option: -position_change_cost <float> :
Default: Not used.
Valid range: 0 to 20.
Typical values: 0 to 2.
Options specific to the downloadable version of FootPrinter.
In the downloadable verion of FootPrinter, various filtering strategies (option
(12), (13), (14)) can be used to greatly improve the running time and space.
This options can only be used if regulatory elements losses are not allowed.
12. Triplet filtering
Performs a pre-filtering step that removes from consideration
any substring that doesn't have a sufficiently good pair of matching substrings
in some pair of the other input sequences. This is often very effective to
reduce the memory used by the program. However, in some case (e.g. too large
maximum parsimony score), it may greatly increase the running time. Still,
in general, it is worth using this option. N.B.: The solutions returned
will be the same whether this option is used or not. This option cannot be
used when motif losses are allowed.
Command line option:-triplet_filtering
Default: Not used.
13. Pair filtering
Same as triplet filtering, but looks only for one match
per other sequence. This filtering takes less time, but is much less stringent.
This option cannot be used when motif losses are allowed.
Command line option:-pair_filtering
Default: Not used.
14. Post-filtering
When used in conjunction with the triplet filtering
option, this usually often significantly speeds up the program, while still
garanteeing optimal results.
Command line option: -post_filtering
Default: Not used.
15. Insertion and deletion cost (Not
fully supprted yet)
If this option is specified, insertions and deletions will be allowed in
the motifs sought, at cost the given cost. NB: This slows down the program
by a factor of 10 to 20!
Command line option: -indel_cost <float>
Default: Not used.
Valid range: 1 to 5.
Typical values: 1 to 5.
16. Inversion cost (Not fully supported
yet)
This option allows for motifs to undergo inversions, at the given cost.
Command line option: -inversion_cost <x>
Default: not used.
Valid range: 1 to 5.
17. No .*** output.
Command line option:
-no_html : Doesn't produce the html files.
-no_interactive_html: Doesn't produce the (numerous) files needed to make
the html output interactive.
-no_ps: Doesn't produce the postscript files.
-no_txt: Doesn't produce the text files.
18. Computation details
Command line option: -details: Shows the details of the computation.
Four types of output files are produced.
1) Standard
output
Standard output produces information on the input sequences and options used.
If the -details option is used, this is where those details appear.
2) HTML
output
The HTML output called *.main.html is the best way to view the results of
FootPrinter. The file contains a graphical representation of the motifs found.
Motifs are highlighted in colors. The number of mutations of a motif is shown
through the size of the font: the larger the font, the fewer mutations
the motif contains. When you move the mouse over the highlighted regions,
the score, position, evolutionary span and significance score of the region
appears in the lower left corner of the browser. A high significance score
means that the motif is unexpectedly well conserved. Corresponding motifs
in different sequences are highlighted with the same color. Notice that a
motif can be present several times in the same sequence, in which case it
will be highlighted with the same color. If there are a large number of motifs
found, the color representation can be a bit messy. By clicking on the colored
regions of the sequences the position of those regions will be highlighted
in the graphical representation. Moreover, if regulatory elements losses
are allowed, the tree spanned by the species containing the motif is highlighted.
To come back to the original representation, click the motif again. Only
one motif at a time can be highlighted. On the right of the screen, a list
of the motifs found is available. When you click on a motif in the left window,
the corresponding motif instances are displayed in the right window.
3) Motif list output
The list of all motifs found is available in the right frame of the browser.
This list is perhaps more suitable to computational parsing than the main
HTML output is. For each motif, we report its parsimony score and (when motifs
losses are allowed), its span and approximated statistical significance.
Notice that in this list, overlapping motifs are not merged, so the list
can be rather long. Sometimes, a motif can contain more than one instances
in a given sequence. This simply means that both instances can be used to
obtain the prescribed parsimony score.
4) PostScript
output
Another way to look at the output is through the postscript file called *.seq.ps,
a postscript view of the html sequence file. This file can be printed (whereas
there seems to be some problems with printing the HTML file). The file *.orders.ps
shows an abstract view of the set of motifs found. Here, each motif is labeled
with one letter. This output is useful to visualize the order in which motifs
occur in the input sequences.
5) Text
output
The text output files *.seq.txt contains almost the same information
as the HTML file, but in a simpler representation. The first line below the
sequences corresponds to the number of mutations found in the best
motif overlapping with that position. The second line below the sequence corresponds
to the identity of the motif (the corresponding motif in other sequences
will have the same identity number). The third lines gives the parsimony
score.
FootPrinter example.fasta tree_of_life -max_mutations 2 -size 12 -triplet_filtering -post_filtering
FootPrinter example.fasta tree_of_life -max_mutations 1 -size 10 -details -pair_filtering
FootPrinter example.fasta tree_of_life -max_mutations 3 -size 14 -max_mutations_per_branch 1 -details
FootPrinter example.fasta tree_of_life -size 10 -losses universal10tight.config -compute_branch_lengths -details -max_mutations 1 -no_html_interactive
FootPrinter example.fasta tree_of_life -size 10 -details -subregion_size
100 -position_change_cost 1 -max_mutations 2 -max_mutations_per_branch 1
-sequence_type downstream
Notice the difference between the previous result and
FootPrinter example.fasta tree_of_life -size 10 -details -subregion_size 100
-position_change_cost 1 -max_mutations 2 -max_mutations_per_branch 1 -sequence_type
upstream
The program FootPrinter is based on the following papers:
[0] "FootPrinter: a program designed for phylogenetic footprinting", by Mathieu Blanchette and Martin Tompa. Nucleic Acids Research,vol. 31, no. 13, 2003, 3840-3842.
[1] " Algorithms for phylogenetic footprinting", by Mathieu Blanchette, Benno Schwikowski, and Martin Tompa. Journal of Computational Biology 9(2):211-223. 2002.
[2] "Discovery of Regulatory elements by a computation method for phylogenetic footprinting", by Mathieu Blanchette and Martin Tompa. Genome Research 12(5):739-748. 2002.
[3] "An
exact algorithm for finding motifs in orthologous sequences from multiple
species" by Mathieu Blanchette, Benno Schwikowski, and Martin Tompa,
Eight International Conference on Intelligent Systems for Molecular Biology,
La Jolla, USA, 37-45. 2001.
[4] "Algorithms for phylogenetic footprinting", by Mathieu Blanchette, RECOMB 2001, Montreal, Canada. 2001.
Mathieu Blanchette
blanchem@cs.washington.edu
Department of Computer Science
University of Washington