FootPrinter 2.0 Manual





0) Copyrights

Copyright (c) 1999-2002, by Mathieu Blanchette and Martin Tompa.
All rights reserved.  Redistribution is not permitted without the  express written permission of the authors.
 

1) Description of the program

FootPrinter is a program that performs phylogenetic footprinting. It takes as input a set of unaligned homologous sequences from various species, together with a phylogenetic tree relating these species. It then searches for short regions of the sequences that are highly conserved, according to a parsimony criterion.  The regions identified are good candidates for regulatory elements. By default, the program searches for regions that are well conserved across all of the input sequences, but this can be relaxed to allow finding regions conserved in only a subset of the species. The use of the program is described in [0]. The algorithms used by FootPrinter are described in [1], and examples of results it produced are discussed in [2].



2) Installation

The simplest way to use FootPrinter is to access it via our web server. You may also want to download  FootPrinter to run it on your own machine.
For now, FootPrinter is only available for Unix/Linux,. After downloading, follow these instructions:

gunzip FootPrinter.tar.gz
tar -xvf FootPrinter.tar
cd FootPrinter2.0
make FootPrinter



3) Input

When you run FootPrinter on your own machine, the following command line should be used:

FootPrinter <input_sequences> <phylogenetic_tree> [options...]

where:

1. Input sequences
FootPrinter takes as input a set of orthologous or paralogous sequences in which you want to find motifs. The input sequences must all be listed in the same file, in FASTA format. See example.fasta for an example. If the sequence contains characters other than a,c,g or t, any substring containing these characters will be ignored. The name of the sequences (one word, after ">") must correspond to the name of some species in the phylogenetic tree (see (2)). Each sequence must have a name that is different from that of the other sequences. Notice that this does not prevent you from having several sequences from the same species; all you need to do is to name them differently (e.g. mouse1, mouse2...). However, this in case, you will have to provide your own phylogenetic tree, with mouse1, mouse2... as leaves.

2. Phylogenetic tree
FootPrinter uses a phylogenetic tree to evaluate the conservation of each motif. Motifs that have a low parsimony score on the given phylogenetic tree are reported.
The distribution of FootPrinter comes with a default phylogenetic tree (the file called tree_of_life), which contains several eukaryotic species. See the file nice_tree_of_life for a more readable format. If the sequences you are using come from species listed in this tree and you have only one sequence per species, you do not need to provide your own tree. Just make sure that the name of your sequences correspond to the name of the species in the tree. If you need to provide your own tree, the phylogenetic tree is given in postfix notation (the standard bracket format, also used by the PHYLIP package, among others). Only leaves are labeled, with the name of the species. Species names followed by * will match any name starting with the given name. In the downloadable version of the program, branch lengths can be specified for some branches (branch lengths are used only when the -losses option is used). The phylogeny can contain more species than just those specified in the sequence file. In that case, unused species will be removed from the tree and branch lengths will be adjusted accordingly.  All text that follows // on the same line will be considered as comments and will be ignored. See this example.

3. Sequence type
Choose among "upstream", "downstream", or "other". The sequence type only matters when the options (10) and (11) are used. For upstream sequences, the 3' end of the sequences are assumed to be aligned, whereas for downstream or other sequences, the 5' end is assumed to be aligned.

4. Motif size:
Specifies the size of the motifs sought. FootPrinter will report all regions of the given size that have at most the parsimony score specified in (5).
Command line option: -size <integer>
Default value : 10.
Valid range: 4 to 16.
Typical values: 6 to 12.

5. Maximum Parsimony score:
The maximum parsimony score allowed for the motifs. If the maximum parsimony score allowed is small, FootPrinter will run quickly and report only the most conserved motifs. We suggest to start with a small parsimony score, and increase it later until the desired number of motifs is reported. This option is used only if the -losses option is not. If the -losses option is used, the <losses.config> file contains this information. If the loss_cost option or the subregion_change_cost option are used, the maximum parsimony threshold actually applies to the sum of the actual parsimony score, the loss cost, and the subregion change cost.
Command line option: -max_mutations <float>
Default value: 2.
Valid range: 0 to 20.
Typical values: 0 to 6, depending on the motif size and on the diversity of the input sequences.

6. Maximum number of mutations per branch:
Allows at most a fixed number of mutations per branch of the tree. Setting this number to a small constant like one or two has a number of advantages: it considerably reduces the running time, and reduces the incidence of some types of spurious motifs.
Command line option: -max_mutations_per_branch <float>
Default: Not used.
Valid range : 0 to 20.
Typical values: 1 to 2.

7. Allow regulatory elements losses:
FootPrinter can find regulatory elements even if some species do not contain those regulatory elements. When this option is used, FootPrinter first approximates the lengths of each branches of your tree, and then compares the parsimony score of a motif to the length of the tree spanned by the species containing that motif.

For the downloadable version of FootPrinter, this option is subsumed by option (8).

8. Spanned tree significance level:
When regulatory element losses are allowed, the parsimony score of a motif is compared to the length of the tree spanned by the species containing the motif.

For the web version of FootPrinter, we have precomputed these lengths so that the motifs reported are statistically significant. You have the choice of three significance levels: "Very significant" will only report motifs that are the most significant, while "Significant" and "Somewhat significant" will report more motifs.
Default: "Significant".

If you are running FootPrinter on your own machine, you get to specify the constraints on the parsimony score and span of the motifs to reported. These constraints are specified in a file like  example.config . This file contains a list of pairs of the form (0,span_0) (1,span_1) (2, span_2)... (n, span_n), which specifies that you want to have reported all motifs with score 0 that span a tree of length at least span_0, and all those with parsimony score 1 spanning a tree of length at least span_1, etc. For this option to be used, you need to either (i) provide branch lengths with the phylogenetic tree that you give as input, or (ii) use the -compute_branch_lengths option, which will estimate these branch lengths from your sequences. A set of *.config files are included with the package. These files give span constraints so that the motifs reported are statistically significant. For a motif of size X, three files are provided: universalXloose.config , universalX.config and universalXtight.config, which will respectively report motifs that are somewhat significant, significant or very significant, approximatively corresponding to p-values of 0.2, 0.1 and 0.05 respectively. These files should be used only when the -compute_branch_lengths option is used.
These files are provided only for your convenience and may under- or over-estimate that statistical significance of motifs in your data set.
Command line option: -losses <example.config>
Default: Not used.

9. Motif loss cost:
When regulatory elements losses are allowed, a cost can be given to losing a particular motif along some branch of the tree.
Command line option: -loss_cost <float>
Valid range: 0 to 20.
Typical values: 0 to 2.
Default: Not used.

10. Subregion size:
In some case, you may want to penalize motifs whose position in the sequence varies too much. This is done using this option, together with option (11).
The subregion size option divides the input sequences in subregions of the given size, and penalizes motifs whose position (subregion) varies too much from species to species. The cost of changing subregion is given by option (11).
Command line option: -subregion_size <integer>
Valid range: 1 to +infinity.
Typical values: 20 to 200 (in general, I set it to about one tenth of the sequence length).
Default: Not used.
 

11. Subregion change cost:
Cost for changing subregion. See option (10).
Command line option: -position_change_cost <float> :
Default: Not used.
Valid range: 0 to 20.
Typical values: 0 to 2.

Options specific to the downloadable version of FootPrinter.
In the downloadable verion of FootPrinter, various filtering strategies (option (12), (13), (14)) can be used to greatly improve the running time and space. This options can only be used if regulatory elements losses are not allowed.

12. Triplet filtering
Performs a pre-filtering step that removes from consideration any substring that doesn't have a sufficiently good pair of matching substrings in some pair of the other input sequences. This is often very effective to reduce the memory used by the program. However, in some case (e.g. too large maximum parsimony score), it may greatly increase the running time. Still, in general, it is worth using this option. N.B.: The solutions returned will be the same whether this option is used or not. This option cannot be used when motif losses are allowed.
Command line option:-triplet_filtering
Default: Not used.

13. Pair filtering
Same as triplet filtering, but looks only for one match per other sequence. This filtering takes less time, but is much less stringent. This option cannot be used when motif losses are allowed.
Command line option:-pair_filtering
Default: Not used.

14. Post-filtering
When used in conjunction with the triplet filtering option, this usually often significantly speeds up the program, while still garanteeing optimal results.
Command line option: -post_filtering
Default: Not used.

15. Insertion and deletion cost (Not fully supprted yet)
If this option is specified, insertions and deletions will be allowed in the motifs sought, at cost the given cost. NB: This slows down the program by a factor of 10 to 20!
Command line option: -indel_cost <float>
Default: Not used.
Valid range: 1 to 5.
Typical values: 1 to 5.

16. Inversion cost (Not fully supported yet)
This option allows for motifs to undergo inversions, at the given cost.
Command line option: -inversion_cost <x>
Default: not used.
Valid range: 1 to 5.

17. No .*** output.
Command line option:
-no_html : Doesn't produce the html files.
-no_interactive_html: Doesn't produce the (numerous) files needed to make the html output interactive.
-no_ps: Doesn't produce the postscript files.
-no_txt: Doesn't produce the text files.

18. Computation details
Command line option: -details: Shows the details of the computation.



4) The output

Four types of output files are produced.

1) Standard output
Standard output produces information on the input sequences and options used. If the -details option is used, this is where those details appear.

2) HTML output
The HTML output called *.main.html is the best way to view the results of FootPrinter. The file contains a graphical representation of the motifs found. Motifs are highlighted in colors. The number of mutations of a motif is shown through  the size of the font: the larger the font, the fewer mutations the motif contains. When you move the mouse over the highlighted regions, the score, position, evolutionary span and significance score of the region appears in the lower left corner of the browser. A high significance score means that the motif is unexpectedly well conserved. Corresponding motifs in different sequences are highlighted with the same color. Notice that a motif can be present several times in the same sequence, in which case it will be highlighted with the same color. If there are a large number of motifs found, the color representation can be a bit messy. By clicking on the colored regions of the sequences the position of those regions will be highlighted in the graphical representation. Moreover, if regulatory elements losses are allowed, the tree spanned by the species containing the motif is highlighted. To come back to the original representation, click the motif again. Only one motif at a time can be highlighted. On the right of the screen, a list of the motifs found is available. When you click on a motif in the left window, the corresponding motif instances are displayed in the right window.

3) Motif list output

The list of all motifs found is available in the right frame of the browser. This list is perhaps more suitable to computational parsing than the main HTML output is. For each motif, we report its parsimony score and (when motifs losses are allowed), its span and approximated statistical significance. Notice that in this list, overlapping motifs are not merged, so the list can be rather long. Sometimes, a motif can contain more than one instances in a given sequence. This simply means that both instances can be used to obtain the prescribed parsimony score.

4) PostScript output
Another way to look at the output is through the postscript file called *.seq.ps, a postscript view of the html sequence file. This file can be printed (whereas there seems to be some problems with printing the HTML file). The file *.orders.ps shows an abstract view of the set of motifs found. Here, each motif is labeled with one letter. This output is useful to visualize the order in which motifs occur in the input sequences.

5) Text output
The text output files *.seq.txt  contains almost the same information as the HTML file, but in a simpler representation. The first line below the sequences  corresponds to the number of mutations found in the best motif overlapping with that position. The second line below the sequence corresponds to the identity of the motif (the corresponding motif in other sequences will have the same identity number). The third lines gives the parsimony score.



5) Examples of typical usage:

FootPrinter example.fasta tree_of_life -max_mutations 2 -size 12 -triplet_filtering -post_filtering

FootPrinter example.fasta tree_of_life -max_mutations 1 -size 10 -details -pair_filtering

FootPrinter example.fasta tree_of_life -max_mutations 3 -size 14 -max_mutations_per_branch 1 -details

FootPrinter example.fasta tree_of_life -size 10 -losses universal10tight.config -compute_branch_lengths -details -max_mutations 1 -no_html_interactive

FootPrinter example.fasta tree_of_life -size 10 -details -subregion_size 100 -position_change_cost 1 -max_mutations 2 -max_mutations_per_branch 1 -sequence_type downstream
Notice the difference between the previous result and
FootPrinter example.fasta tree_of_life -size 10 -details -subregion_size 100 -position_change_cost 1 -max_mutations 2 -max_mutations_per_branch 1 -sequence_type upstream



6) Too many motifs, too few motifs
What should you do if FootPrinter finds
A) Too many motifs B) Too few motifs

7) Running time and memory requirements
The running time and space of FootPrinter depends crucially on the value of certain input parameters. In general, the more mutations are allowed, the longer it will take to find all solutions. If FootPrinter takes too much time or space, try the following:

8) References

The program FootPrinter is based on the following papers:

[0] "FootPrinter: a program designed for phylogenetic footprinting", by Mathieu Blanchette and Martin Tompa. Nucleic Acids Research,vol. 31, no. 13, 2003, 3840-3842.

[1] " Algorithms for phylogenetic footprinting", by Mathieu Blanchette, Benno Schwikowski, and Martin Tompa. Journal of Computational Biology 9(2):211-223. 2002.

[2] "Discovery of Regulatory elements by a computation method for phylogenetic footprinting", by Mathieu Blanchette and Martin Tompa. Genome Research 12(5):739-748. 2002.

[3] "An exact algorithm for finding motifs in orthologous sequences from multiple species" by Mathieu Blanchette, Benno Schwikowski, and Martin Tompa,
Eight International Conference on Intelligent Systems for Molecular Biology, La Jolla, USA, 37-45. 2001.

[4] "Algorithms for phylogenetic footprinting", by Mathieu Blanchette, RECOMB 2001, Montreal, Canada. 2001.



FootPrinter is still under development, and any comments from users are most welcome.
 

Mathieu Blanchette
blanchem@cs.washington.edu
Department of Computer Science
University of Washington