Comparative Genomics-seq, user manual
Table of contents
- Introduction
- Getting started
- Command Line Interface
- Graphical User Interface
- Sample data
- Load data
- Load the target sequence
- Load other sequences
- Provide an output directory
- Sequence parameters
- Run analysis
- Preprocessing sequences
- Alignments
- Conserved regions
- RNA structures
- Viewing results
- References
1 Introduction
CG-seq is a software pipeline to identify noncoding RNAs in
a genomic sequence by comparative analysis. It takes as input a
genomic sequence (called the target sequence) and a set of other
sequences coming from a variety of species to be compared against
the target sequences.
The algorithm of CG-seq proceeds in four steps.
- Preprocessing. Sequences are preprocessed to mask CDSs, or to
remove redundancy between strains coming from the same species (optional).
- Alignment. The target sequence is compared to all other sequences
to detect similar sequences across species.
- Conserved regions. Pairwise alignments are
combined into clusters of significantly conserved regions.
- RNA structure. Conserved regions are investigated by
inspection of evolutionary patterns to select sequences exhibiting a
conserved consensus secondary structures.
2 Getting started
2.1 Command Line Interface
To run the command line interface, you have to invoque the CG-seq.pl
script located at the root of the CG-seq directory.
Usage
perl CG-seq.pl <target file> <species1..speciesN> -o
output_directory [options]
Required parameters
- target file: genomic sequence to process, in
FASTA format.
- species1..speciesN: files and/or directories
containing sequequences that will be compared to
the target sequence to find conserved regions..
- output_directory: directory where result files
will be saved.
Options
- --all: execute all program tools (Default)
- --mask-regions=CONF: mask CDSs and/or known ncRNAs in the target sequence.
This option is effective only if a GENBANK file is provided with the FASTA file.
- --mask-redundancy=CONF: clean up redundant
regions between several strains of a same species. Sequences are considered as soon as they are placed in the same directory.
- --yass=CONF: execute the pairwise alignment between the target
sequence and each other sequences using YASS.
- --blast=CONF: execute the pairwise alignment between
the target sequence and each other sequences using BLAST.
- --conserved=CONF: execute CGseqcore program to find
conserved regions from alignments.
- --carnac=CONF: execute Carnac program to predict secondary
structure of conserved regions found.
- --rnaz=CONF: execute RNAz program to predict secondary
structure of conserved regions found.
- --quiet: do not print any messages in STDOUT.
- -h,--help: print the script usage and help.
All tools are provided with default configuration files (located
in the directory "cfg"), that provide default option
values. If you wish to use alternative values, you have to
create your own configuration file and invocate it through the
command line (for example --yass=MY_CONF_FILE). The
syntax of configuration files is as follows. Each option is
first decribed, then its default value is given. Lines that
begin with the character "#" are considered as
comments.
Example: gap opening/extension costs (yass.conf)
#################################
# Gap opening/extension costs
#########################################################
-G -16,-4
Parameters and options are described in further details in Sections Load data, Run analysis and Viewing results.
2.2 Graphical User Interface
The graphical user interface is available under the jar
file CG-seq.jar.
To launch the graphical user interface you can double click the jar file, or type
java -jar CG-seq.jar in command prompt.
For Unix users, if you need to recompile the jar file, a Makefile is available
in the "gui" directory to do so.
The meaning of buttons is a follows.
Access this contextual help
Upload a file or a directory
Cancel
2.3 Sample data
Sample data can be found in directory "sample".
Example of command:
CG-seq.pl sample/Yersinia_pestis/NC_003143.fna
sample/Yersinia_frederiksenii/ sample/Yersinia_enterocolitica/
-o test_output
--mask-redundancy --mask-regions --yass --conserved --carnac
The target sequence is sample/Yersinia_pestis/NC_003143.fna. This sequence is compared to Yersinia frederiksenii and Yersinia enterocolitica. --mask-redundancy and --mask-regions are for step 1, --yass is for step 2, --conserved is for step 3, --carnac is for step 4. All tools are used with default configuration files.
3 Load data
3.1 Load the target sequence
Command line parameter: <target>
The target sequence is the sequence on which you want to identify
noncoding RNAs. You have to upload a FASTA file containing the
sequence. Additionally, if you want to mask functional annotated regions (CDSs, ncRNAs) in your FASTA file, a
GENBANK annotation file is required. This annotation file must be suffixed by
".gb" or ".gbk" and must be located in the same directory as the FASTA file.
Example for Unix
/home/user/target_sequence/target.fna: FASTA file.
/home/user/target_sequence/target.gbk: GENBANK annotation file.
Example for Windows
C:\Documents and Settings\user\target_sequence\target.fna: FASTA file.
C:\Documents and Settings\user\target_sequence\target.gbk: GENBANK annotation
file.
3.2 Load other sequences
Command line parameter:
<species1 .. speciesN>
Other sequences are sequences against which the target sequence will
be compared. Ideally, sequences are gathered into species. For each species,
you can specify a FASTA file, or a directory containing one or several FASTA
files (chromosome, plasmids, several strains), and one or several GENBANK
files. Every FASTA file that will be found under a directory will be
considered as part of the same species. FASTA files must be suffixed by
".fna", ".fa", ".fasta", ".seq" or ".dna" to be recognized as FASTA files.
For each FASTA file, the GENBANK file must have the same name, with a different suffix: ".gb" or ".gbk".
The number of species should range between 1 and 10 (or even more than 10).
If possible, provide at least three species.
On the command line interface, species must be separated by a blank space " ".
With the graphical user
interface, you can select several directories at one time by keeping pressing
the "CTRL" button while clicking on the wanted directories. Each time you
choose a directory, it will be added to the previously selected ones. You can also
provide them direclty in the text area on the form. Note that you have to
separate each directory with a comma ",".
3.3 Provide an output directory
Command line parameter: -o output_directory
All result files will be stored in the output directory. See Section Viewing results for further information on all genenerated files.
If the given directory does not exist, it is created. If it already exists, files will be overwritten.
3.4 Sequence parameters
Option in CGseqcore.conf: -p
This parameter setting allows you to give a weight to each species
that is to compared to the target sequence.
This weight corresponds rougly to the expected probability to observe an alignment
between the species and the target sequence at a given position.
Proposed values are
- mandatory: Only conserved regions showing at least one alignment
between the target sequence and this species will be reported (probability=1)
- prohibited: Only conserved regions showing no alignment between the target
sequence and this species will be reported (probability=0)
- default: Probability=0.8
Advanced parameter setting allows you to define exactly the expected
probability of each species.
Sequence parameters (cgseqcore.conf)
Probabilities (-p):
you must provide one value per species, between 0 and 1. Values are
separated by commas.
4 Run analysis
As said before, CG-seq analysis features four main steps.
- preprocessing sequences
- alignments
- conserved regions
- RNA structures
For each step, one or
several program tool is available. You can select as many tools as you
wish. All choices are compatible.
Each tool comes with default parameter values. You can modify these
values by clicking on the software configuration button. This
opens a new form that allows you to set parameters for tools that have
been previously selected. You have two possiblities:
- Change a selection of basic parameters using the form.
- Change parameters directly in the configuration file with
the Advanced button. It will open a new window containing the content
of the configuration file. This way, more advanced options that are not part
of the basic parameter form are accessible and can be modified.
Both ways, once your changes are done, you can save them with the Save
button. You can also cancel them with the Cancel button. The Reset to
default button will restore original values (but you have to save them if
you want to reuse these default values).
4.1 Preprocessing sequences
Mask known CDSs/ Mask known RNAs
Command line option: --mask_regions=CONF
When selected, known coding regions and/or known noncoding RNAs are masked
in all sequences (target sequences or other sequences) for which a
GENBANK file is provided. For each genome, the GENBANK file should be
placed in the same directory as the FASTA file and have the same name, except that it should be suffixed by ".gb" or ".gbk".
It will be automatically detected when uploading the files.
CDSs corresponds to features annotated as CDS in the GENBANK file, and noncoding RNAs correspond to features annotated as tRNA, rRNA and misc_RNA in the GENBANK file.
Annotated regions options (mask_regions.conf)
Masking mode (--mode):This option has three values:
0 for CDSs, 1 for ncRNAs, 2 for both.
Default value is 0.
Clean up redundancy
Command line option: --mask_redundancy=CONF
This option allows you
to eliminate redundant parts between several sequences (such as different strains) within a same species.
Redundancy is defined as two identical regions between two sequences that have
an identity percentage greater than the selected threshold (default 98%) and a length greater than the
minmal length
(default 100 bp).
For each species, only one copy of the redundant local sequence is kept. This option works only
if all genomes of a given species are located in the same directory.
Redundancy options (mask_redundancy.conf)
Minimum identity (--id): Minimum identity percentage to consider a redundant
region. Default value is 98%.
Minimum length (--length): Minimum length to consider a redundant
region. Default value is 50 bp.
4.2 Alignments
In the second step, the method performs pairwise alignment between the
target sequence and the other sequences. By default, CG-seq comes
with the YASS software. It is also designed to run with BLAST.
YASS
Command line option: --yass=CONF
YASS is a homology search tool based on local sequence similarity. It relies on
spaced seeds, that have proven to achieve high sensitivity.
YASS options (yass.conf)
Like most of the heuristic DNA local alignment software, YASS uses
seeds to detect potential similarity regions, and then tries to
extend them to actual alignments. It has the distinctive feature of
using multiple transition-constrained spaced seeds to ensure a good
sensitivity/selectivity trade-off. That enables to search more
fuzzy conserved sequences, as noncoding RNAs.
Spaced seed is a non-contiguous pattern, that allows for some
errors. Transition mutations are purine to purine [A<->G]
or pyrimidine to pyrimidine [C<->T].
Filter low complexity regions in the query sequence (-e): This
function masks off segments of the query sequence that have low
compositional complexity. Filtering can eliminate statistically
significant but biologically uninteresting reports from the
YASS output. Filtering is only applied to the query sequence,
not to database sequences. Default value is yes.
Seed Pattern(s) (-p): You can specify the seed pattern used in
the search step. The "#" symbol is for matches, that is positions
without error. The "@" symbol is for matches or transitions, and
the "_" is for any possibility; match, transition or transversion.
The comma "," can be used as a seed separator. The program speed
depends on the weight of each pattern (number of # + 1/2 number of
@): decreasing the weight increases sensitivity, but slows down the
search.
Seed hit criterion (-c): you can specify the number of seeds
that have to match the query sequence. Default value is double.
Match/transversion/transition/undetermined symbol scores (-C): This is
simply the scoring system used to assess the quality of a pairwise
alignment. Default values are 5 for a match,-4 for a transversion,
-3 for a transition and -4 for undetermined symbols.
Gap costs (-G): This is the affine penalty scores associated to
gaps in the score of a pairwise alignment. Default values are -16
for a gap opening and -4 for a gap extension.
E-value threshold (-E): This setting
specifies the statistical significance threshold for reporting
alignments against database sequences. For example, when this
parameter is set to 5, it means that 5 such matches are expected to
be found merely by chance, according to the stochastic model of
Karlin and Altschul (1990). Lower E-value thresholds are more
stringent, leading to fewer chance matches being reported.
BLAST
Command line option: --blast=CONF
BLAST searches for high scoring sequence alignments between the
query sequence and sequences in the database using a heuristic
approach that approximates the Smith-Waterman algorithm for
local alignment. The main idea of BLAST is that there are often
high-scoring segment pairs (HSP) contained in a statistically
significant alignment. BLAST first search for these HSPs, then extend
them to construct valid alignments.
Blast options (blast.conf)
Filter low complexity regions in the target sequence (-F):
This function masks off segments of the target sequence that have low
compositional complexity, as determined by the DUST program of
Tatusov and Lipman. Filtering can eliminate statistically significant
but biologically uninteresting reports from the BLAST output.
Filtering is only applied to the query sequence, not to database
sequences. Default value is yes.
When this option is selected, the complementary option
"Mask for lookup table" is automatically set to true. It means that
no HSPs are found based upon low-complexity sequence, but the BLAST
extensions are performed without masking and so they can be extended
through low-complexity sequence.
Word size (-W): This is the size of the
HSPs (or k-mers) used in BLAST heuristics to identify potential
high scoring alignments. One normally regulates the sensitivity
and speed of the search by increasing or decreasing the
word size. The lower is this value, the more sensitive is the
algorithm. The higher is this value, the faster is the
algorithm. This value should ranges betwen 7 and 30. Default value is 11.
Match/mismatch/gap opening/gap extension scores (-r, -q, -G, -E): This is
simply the scoring system used to assess the quality of a
pairwise alignment. In BLAST, there are internal dependencies
between these score parameters. So they are available in a pull
down menu that shows the allowed reward/penalty pairs and their
associated gap existence and gap extension penalties. Default
values are 2 for a match, -3 for a mismatch, 5 for a gap
opening, and 2 for a gap extension.
E-value threshold (-e): This setting specifies
the statistical significance threshold for reporting alignments
against database sequences. For example, when this parameter is
set to 5, it means that 5 such matches are expected to be found
merely by chance, according to the stochastic model of Karlin
and Altschul (1990). Lower E-value thresholds are more
stringent, leading to fewer chance matches being reported. Default value is 0.001.
4.3 Conserved regions
Command line option:
--conserved=CONF
Once all pairwise alignments are built, CG-seq searches for regions of the target sequence
whose conservation is supported by a significantly high number of pairwise alignments.
For that, each position of the target sequence is assigned a score depending
on the number of alignments and of the weight of each sequence as specified previously
in the sequence parameters form.
CG-seq extracts all local regions that exhibit a high score.
Those regions are called conserved regions.
Once conserved regions have been identified, it is possible to refine the selection
according to the length, the identity percentage, the number of sequences,...
CG-seq options (CGseqcore.conf)
Minimum identity threshold (-i): Minimum identity percentage (1-100%) that
the target sequence region and the other sequence region must have to be part
of a conserved region. Default value is 60%.
Maximum identity threshold (-I): Maximum identity in percent (1-100%) that
the target sequence region and the other sequence region must have to be part
of a conserved region. Default value is 100%.
Minimum length (-l): Minimum length of a conserved region. Default value is
30 base pairs.
Maximum length (-L): Maximum length of a conserved region. Default value is
1000 base pairs.
Selection mode (-M): Selecting mode of other sequences to be part of a
conserved region. Three modes can be chosen :
- Star: every other sequence must share all given
properties (identity, length etc.) with the region of the target sequence to be part of
the final conserved set.
- Naïve clique: every other sequence must share all given
properties with the target region. After that selection, each other region
is include in a graph clique composed of the target region and every other
region that have previously been included and must share all properties with
all sequences inside the clique. If the wanted number of sequences is
reached, the region is saved. If one sequence fails before the number of
sequences is reached, the region is canceled. Other sequences are added to
the graph clique by their respective identity with the target region.
- Maximal clique: every pair of sequences (target and other sequences) must
share all properties.
Note that maximal clique is time-consuming. It requires more execution time than other modes. Default
value is Star.
Maximum number of sequences by set (-m): the number of sequences wanted in
each conserved region, target sequence included. This value is useful because
RNA prediction tools usually have difficulties to manage a large number of sequences.
For example, RNAz cannot deal with more than 6 sequences in each conserved region.
Conserved regions containing more than the specified number of sequences
are modified as follows: sequences are sorted by decreasing identity value against the target sequence, and only the
best sequences are retained.
Default value is 6.
4.4 RNA structures
The last step of the analysis consists in investigating conserved
regions to search for these that show a potential consensus secondary
structures For this task, GC-seq invocates the CARNAC software. You
can also use the RNAz software, from the Vienna Package. Both tools
share the same philosophy: They predict structurally conserved and
thermodynamically stable RNA secondary structures that are supported
by compensatory mutations.
Carnac
Command line option:
--carnac=CONF
Carnac is a tool for the prediction of conserved secondary
structure elements of a family of homologous non-coding RNAs that can
can successfully handle low similarity sequences. It combines three
features: energy minimization, phylogenetic comparison and sequence
conservation. It relies on a Sankoff-like folding algorithm that
simultaneously infers a potential consensus structure and align the
sequences. As a consequence, the method does not require any prior
multiple sequence alignment, and appears to be robust to noisy
datasets.
Carnac options (carnac.conf)
Take GC percent into account (-A): When this option is
selected (default), CARNAC uses variable energy thresholds for selecting
thermodynamically stable helices selection according to the
avearge GC percent of the involved sequence.
RNAz
Command line option:
--rnaz=CONF
RNAz is a program for predicting structurally conserved and
thermodynamically stable RNA secondary structures in multiple sequence
alignments. Here multiple alignments are built with ClustalW.
RNAz options(rnaz.conf)
Probability cut off (-p): This is the probability of the sequence
of being a ncRNA, as computed by the SVM. Results are
displayed only for sequences having a classification probability
higher than this value. Sequences with high probablity cut off
(e.g.>0.9) show strong evidence for a structural RNA. Default
value is 0.7.
RNAzwindow options (rnazwindow.conf)
Window size (-w) and Step size (-s) options:
The RNAz algorithm works globally, i.e. the given alignment is
scored as a whole. For long alignments (e.g alignment of a whole
chromosome), this is neither computationally tractable nor
biologically meaningful. Therefore, long alignments are scanned in
overlapping windows. Alignments longer than 300 will be
analyzed in sliding overlapping windows of the given window
size. The step size value specifies the number of shifted
positions. Default values are 120 and 40, respectively.
5. Viewing results
Once the computation is completed,
result files are stored in the output directory
that has been specified originally.
Results are available both as a HTML page and as a CSV file.
HTML page:
You can access the results by clicking the button See results in browser in the graphical interface, or by opening directly the results.html file with your browser. For each predicted noncoding RNA, the following files are available:
- .fna: raw sequence
- .out: secondary structure in bracket-dot format
- .svg: 2D drawing of the predicted secondary structure (RNAplot)
By default, prediction is given for the same strand as the input traget sequence.
Files with suffix reverse correspond to the reverse complement sequence.
CSV file:
This file contains the same information as the HTML page. It can be opened and modified with any spreadsheet, such as Excel.
The output directory also contains several subdirectories for all generated files during the
execution:
- alignments: alignment files from Yass and/or
Blast software.
- conserved_regions: MultiFASTA files of conserved
regions found by CGseqcore software from pairwise
alignments.
- mask_regions_genomes: masked FASTA files if the
masking step of CDSs and/or RNAs has been selected.
- mask_redundancy_genomes: FASTA files where redundant
regions have been cleaned up (if the step has been selected).
- RNA_predicted: contains all result files of every conserved
regions that have been predicted as RNA. These files are multifasta files,
multiple alignment files (ClustalW), result files from Carnac and/or RNAz and image
file, generated with RNAplot.
- structure_inference: all generated files from Carnac
and/or RNAz.
- warnings.txt: text file containing all warnings issued by CG-seq
during the execution.
6. References
Washietl S., Hofacker I.L., Stadler P.F.
Fast and reliable prediction of noncoding RNAs
Proc. Natl. Acad. Sci. U.S.A. 102, 2454-2459, 2005
Noe L., Kucherov G.
YASS: enhancing the sensitivity of DNA similarity search
Nucleic Acids Research, 33(2):W540-W543, 2005
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman,
D.J. (1990) "Basic local alignment search tool."
J. Mol. Biol. 215:403-410.
NCBI BLAST: a better web interface Mark Johnson, Irena
Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and
Thomas L. Madden Nucleic Acids Res. 2008 July 1; 36(Web Server
issue): W5-W9.
CARNAC: folding families of non coding RNAs
Touzet H. and Perriquet O.
Nucleic Acids Research 142, 2004