******************************************************* * * * G A R D E N I A * * * ******************************************************* 1. WHAT IS IT ? Gardenia is a program that compares RNA secondary structures, without pseudoknots. The resulting alignment takes into account both the primary structure and the secondary structure. 2. HOW TO INSTALL IT ? First you must extract the .zip archive file. It will create a main directory called 'gardenia/' and three subdirectories: headers/ contains the header files of the source code sample/ contains a list of sample RNA sequences src/ contains the source code and the make file Go to 'src' directory, and type 'make'. The compilation operation creates an executable file named 'gardenia'. The program is written in ANSI C, and you need a C compiler to build it. The name of the C compiler may be modified in the 'makefile' file that is in the 'src' directory. 3. HOW TO RUN IT ? 3.1 Synopsis ------------ gardenia inputFile [-S scoreFile] [-o outFile] [-M matrixFile] [-lefghcnms] [--fasta] [--eq] Input sequences should be given in one file. Gardenia recognizes bracket-dot format, which is a fasta-like format. The secondary structure accompanying the sequence is described in the third line: Each base pair is denoted by a pair of brackets, and each free base is denoted by a dot. > sequence 1 AUCGGCGGACGCC .(....)((..)) > sequence 2 CGGCCAAUCCGGCUU (((.....))).... By default, the multiple alignment is displayed in Clustal format, and the scoring scheme comes from the 'score.txt' file. See Section 4.3 to have more information on the syntax of this file. More information on the multiple alignment algorithm is available in Section 4.1. Gardenia also offers a series of pairwise alignment algorithms. By default, pairwise alignment uses affine gap penalties. Alternative algorithms are described in Section 4.2. 3.2 Options ----------- General options: to be used both for multiple or pairwise alignment. -S specify the score scheme in an external file. If no file name is given, the file 'score.txt' in used. See Section 4.3 to have more information on the syntax of this file. -c add constraints coming from the primary structure in the alignment --fasta output in fasta format -o specify the output file in which to save the alignment -M use RIBOSUM scoring matrices. -s similar structures. This option should be used when the input sequences are similar. It speeds up the alignment algorithm. --eq This option allows you to give external constraints that will guide the alignment process. For that, each sequence in the input file shoud contain a line with a one character identifier for each position of the sequence, or a blank character when there is no constraint. > sequence 1 AUCGGCGGACGCC .(....)((..)) # # RR Constraints correspond approximately to equivalence classes. A position with constraint # can be matched only with a position of constraint # or with a position with no constraint. All these options are mutually compatible. Additional options: for pairwise alignment only. -e select the edit distance method for comparing the structures (see algorithm #2 below) -f select the edit distance method with full edit operations for comparing the structures (see algorithm #3 below) -g select the tree alignment method for comparing the structures (see algorithm #4 below) -h select the tree alignment method with full edit operation for comparing the structures (see algorithm #5 below) -l local comparison -m multiple alignment (no affine gap penalty) -n compute only the value of the optimal score, not the mapping. These options are valid provided that your input file contains exactly two sequences. They are all compatible with general options. Options -e, -h, -g, -h and -m are mutually incompatible. Option -l is compatible with options -e, -f, -g, -h and -m, but not yet fully implemented. 4 FURTHER EXPLANATION ON THE COMPARISON METHODS AND THE SCORING SYSTEM 4.1 Multiple alignment ---------------------- The method relies on the dynamic programming algorithm for pairwise comparison, as described in Algorithm #1 below. RNA sequences are encoded as arc-annotated sequences, and a multiple alignment for a set of arc-annotated sequences is a nested common supersequence. The edit scheme incorporates evolutionary operations concerning free bases (base substitution, base deletion) and base pairs (arc-mismatch, arc-removing, arc-breaking, arc-altering), originally defined in [6]. We take a heuristic approach and use a progressive method. It starts with constructing all pairwise alignments to determine the degree of similarity for each pair of sequences. Then it combines sequences into a multiple alignment by an ascending hierarchical clustering. Pairwise alignment of supersequences rely on the same algorithm as paiwise alignments for arc-annotated sequences. This is made possible because supersequences can be viewed as a nested arc-annotated sequences on an extended alphabet. The score of one node is its SP (sum-of-pairs) score. 4.2 Pairwise alignment ---------------------- Gardenia provides seven different algorithms for comparing two RNA structures given as input. We briefly present the distinctive features of each of them. Algorithm #0 - Alignment of arc-annotated sequences with affine gap penalty. This algorithm searches for the optimal nested common superstructure of the two structures given as input, such as described in [1]. This is our favorite method. It can explicitly deal with all edit operations: base-match, base-mismatch, base-deletion, arc-match, arc-mismatch, arc-removing, arc-breaking and arc-altering, and is still very fast. Affine gap costs are also allowed (which is not described in [1]). Concerning edit operation on free bases, gap penalties apply in he same way as with sequence alignment. Concerning edit operation on base pairings, gap opening penalty applies when the general shape of the structure is modified by the edit operation with the creation of a new helix. See below. Gap opening: creation of a new helix (((...))) (((...))) --------- ......... Gap extension: modification of an existing helix (((...))) (((...))) (((...))) (((...))) -((...))- .((...)). ((-...-)) ((.....)) Algorithm #0.1 - Local alignment of arc-annotated sequences(option -l). This is a variation of algorithm #0 that performs local comparison instead of global comparison. The best local alignment is the best alignment between any two closed subsequences of the input sequences. Option -l is also compatible with Algorithms #1, #2, #3, #4 and #5. Algorithm #1 - Alignment of arc-annotated sequences with linear gap penalty (option -m) This is the same algorithm as algorithm #0, but without affine gap cost (so this is exactly algorithm described in [1]). This algorithm is also used for multiple alignment construction. Algorithm #2 - Usual tree edit distance (option -e) Here, RNA structures are represented as ordered trees: each base pair is an internal node, and each free base is a leaf. This model is described in further detail in [2]. It authorizes only base-match, base-mismatch, base-deletion, arc-match, arc-mismatch and arc-removing operations. The implementation of the edit distance algorithm is based on Zhang-Shasha's algorithm [3] using the edit graph structure introduced in [4]. Algorithm #3 - Tree edit distance with enriched tree encoding (option -f) The difference with algorithm #2 lies in the construction of the tree for the RNA sequence. Each base pair is encoded by three nodes: an inner node for the arc and two leaves for the nucleotides (the same encoding is used in [4]). This model allows one to simulate arc-breaking operations (it consists in deleting the inner node), and arc-altering operations (it consists in deleting it and one of its children). With edit edistance, the method may be slow, and gives permissive alignments. Algorithm #4 - Tree alignment (option -g) This algorithm is based on the tree alignment, such as described in [6]. Algorithm #5 - Tree alignment with enriched tree encoding (option -h) This last method combines the tree alignment algorithm (#4) with the tree encoding of algorithm #3. This solution is very similar to RNAforester [5]. [1] Alignment of RNA secondary structures. G. Blin, A. Denise, S. Dulucq, C. Herrbach, H. Touzet, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2008) [2] Computing similarity between RNA structures. B. Ma, L. Wang,and K. Zhang, Theoretical Computer Sciences 276(1-2): 111-132 (2002) [3] Simple fast algorithms for the editing distance between trees and related problems. K. Zhang and D. Shasha,SIAM Journal of Computing, 18(6)} :1245-1262 (1989) [4] Comparing similar ordered trees in linear-time. H. Touzet, Journal of Discrete Algorithms, to appear, available on WWW [5] Local Similarity in RNA Secondary Structures. M. Höchsmann, T. Töller, R. Giegerich, S. Kurtz, IEEE Bioinformatics Conference 2003 (CSB 2003): 159-168 [6] Alignment of trees - an alternative to tree edit. T. Jiang, L. Wang, K.Zhang Theoretical Computer Science, 143-1: 137-148 (1995) 4.3 Scoring system ------------------- For all methods, you can specify your own score scheme in an external file with -S option. This file should contain one line per edit operation, in the following order. base-match base-mismatch base-deletion arc-match arc-mismatch(1) - one base renaming arc-mismatch(2) - two base renamings arc-removing arc-altering(1) - no base renaming arc-altering(2) - one base renaming arc-breaking(1) - no base renaming arc-breaking(2) - one base renaming arc-breaking(3) - two base renamings gap-opening Look at the 'score.txt' file in the 'src' directory to see an example. Only algorithms #0 and #0.1 use all edit operations with arbitratry score penalties. Multiple alignment and algorithm #1 do not require any value for gap opening, and algorithms #2,#3,#4 and #5 do not require values for arc-altering and arc-breaking operations. Indeed, algorithms #2 and #4 do not involve such operations. For algorithms #3 and #5, scores are mutually dependent. They fulfill the following relations: arc-removing = arc-breaking(1) + 2 base-deletion arc-altering = arc-breaking(1) + 1 base-deletion+ 1 base-match/mismatch arc-breaking(2)= arc-breaking(1) + 1 base-mismatch arc-breaking(3)= arc-breaking(1) + 2 base-mismatch So be careful when using it. For pairwise alignment, if you find it complicated, use algorithm #0 for global comparison or #0.1 for local comparison. ------------------------------------------------------------------------------- Thank you for using Gardenia. For questions or for bug reports, please contact Helene Touzet (helene.touzet@univ-lille1.fr)