YASS : genomic similarity search tool

Options

Here are given some of the most commonly used command line parameters :

-r 0 to consider only the forward reading of the first sequence,
-r 1 to consider only the reverse complement of the first sequence,
-r 2 to consider both forward and reverse complement of the first sequence (default).
-d 1 to display quick alignments together with some statistics (see the output description),
-d 2 to display BLAST tabular format,
-d 3 to display a very short and very easy parsable output.
Others output formats can be generated from -d 1 by using the yass2blast.pl conversion tool: it produces "blast with alignments" output, and axt or multi-fasta.
-o <filename> to select the output file, default is the standard output (stdout).
-S <number> to select the sequence from the multi-fasta query file (ie the first file : see the input description).

The scoring system can be selected by:

-M <number> to use proposed scoring systems (it ranges from 0 to 3 and selects high to low identity matrices).
-C 1,-3 to choose +1/-3 scoring system, the -C (...) command-line parameter can be used with 2 to 4 numbers, if you give:
- 2 parameters : you fix the match reward and mismatch penalty,
- 3 parameters : match reward, transversion penalty, transition penalty,
- 4 parameters : match reward, transversion penalty, transition penalty, other penalty (non ACGTU letters),
- 16 parameters : gives a complete DNA matrix (ACGT order).
  example -C 4,-6,-2,-6,-6,5,-7,-2,-2,-7,5,-6,-6,-2,-6,4
Note that the 2,3 or 4 parameters versions are subject to bias correction (AT/GC rich sequences will see some changes in their scoring system), the last one, with 16 parameters is never modified.
-G -10,-4 to choose -10,-4 penalty for gap opening/extension.
-X <number> to fix the Xdrop threshold score (large values result in increased running time but better results).
-L 1.1,0.33 to force (Gumbel law) Lambda and K parameters.

You can increase/limit the number of alignments in the result :

-O <number> to limit the number of alignments in the output : default is fixed to 10000
-E 1e-3 to fix the Expectation value threshold : if an alignment does not reach the minimal score to have this evalue, then it is discarded.

Advanced Options

-s to specify how to sort the results
By default, the -s 70, namely query sorted, query/text blocks - score, first takes each fasta chunk in the multi-fasta query file; Then, each time this fasta chunk has a good hit against a text chunk, it produces a list of all the subsequent query/text hits of this query chunk against this text chunk as a block where hits are sorted by score. Note that this behavior is similar to BLAST.
More details:

We suppose that x is a value ≤ 8, that has the following meaning : "sort by score" (0), "sort by entropy" (1), "sort by mutual information" (2), "sort by both entropy and score" (3), "sort by end position on the 1st file" (4), "sort by end position on the 2nd file" (5), "sort by alignment % id" (6), "sort by 1st file sequence % id" (7), "sort by 2nd file sequence % id" (8).

For any -s value < 50, results are simply sorted accordingly :
- -s x (x ≤ 8) : by the criterion x alone,
- -s 1x : by the query rank first (blocks are then defined), then inside each block by the criterion x : text rank order can thus be heavily mixed, but query rank is kept,
- -s 2x : by the text rank first (blocks are then defined), then inside each block by the criterion x : query rank order can thus be heavily mixed, but text rank is kept,
- -s 3x : by the query rank first, then by the text rank for each query rank, (blocks are then defined), and then finally inside each block by the criterion x : it is similar to a concatenation of the results where each query fasta chunk, then each text fasta chunk, are processed in a double-loop with at most n × m blocks defined.
Otherwise, any -s value ≥ 50 calls a block post-sorting method; the criterion x is both ordering :
1. hits inside any block,
2. post-sorting blocks (or set of blocks) between them according to their respective first hit.
Post-sorting can be made on blocks, or set of blocks defined by the query only, the text only, or both ...
- -s 4x : post-sorting blocks using x on the query, when blocks have been defined per query chunk : this is equivalent to 1x where the query chunk rank order is replaced by the best hit ordering of each query chunk. Note that the text chunk rank order is kept for each query chunk result.
- -s 5x : post-sorting blocks using x on the text, when blocks have been defined per text chunk : this is equivalent to 2x where the text chunk rank order is replaced by the best hit ordering of each text chunk. Note that the query chunk rank order is kept for each text chunk result.
or more tricky (asymmetric sorting here, in the sense that there is no text/query equivalent for a given query/text sorting) :
- -s 6x (equivalent to 4x) : post-sorting set of blocks using x on the query, when blocks have been defined per query & text chunk : this keeps the text rank order per query result, but the queries are ordered by the x criterion on the best hit. This has a similar behavior to 4x {it would be different if two different criteria x and y where used for sorting inside and post processing ... but not yet implemented :-)}
- -s 7x : post-sorting set of blocks using x on the text (per query), when blocks have been defined by the query/text rank : this keeps the query rank order, and is similar to BLAST
- -s 8x : post-sorting set of blocks using x on the query/text, when blocks have been defined by the query/text rank : this list blocks of hits according to their best hits, whatever is the query rank or text rank (useful for full search) : this is equivalent to 3x where the query/text rank orders are replaced by the best hit block ordering

-p to specify a seed Pattern (for example -p ##@_#@#__#_###)

# match
@ match or transition
_ match or any mismatch (joker)

Note that the program speed depends on the weight of each pattern (number of # + ½ number of @) : decreasing the weight increases sensitivity, but slows down the search.
Examples of seed patterns proposed :

PatternHunter  :  ###--##-#--###
Mandala        :  ###---#-#--##-##
                  ###--#-##-### 
                  ####-##-### 
YASS           :  #@##-#--##--#@# 
                  ###@-#-#--#@-##

Some tools (namely IEDERA and HEDERA) are provided to design transition constrained seeds. Seeds used on the web interface are given in this text file

Example :


TGGGCGCATGGCCTAGCCTGGATAGGGCAACAGCCTCCTAAGCTGTAGATCAGGGGTCCAAATCCCCTTGCGCCCGCTCATAACCT
||.|:.|:|:||::|| ||||||||:|||.||||||:|||||:|||.|:||||||||:|:||||||:|:|.:|.|:| ||||||||
TGTGTCCGTAGCTCAG-CTGGATAGAGCATCAGCCTTCTAAGTTGTTGGTCAGGGGTTCGAATCCCTTCGGACACAC-CATAACCT
                                 ###-#@-##@##         ###-#@-##@##
                     ###--#-#--#-###               ###--#-#--#-###

-c <int> Select single (1) hit criterion, or double (2) hit criterion : one single hit of a seed is needed in the first case to detect an alignment, whereas two are necessary in the second case. Note that the single hit criterion is slower, but more sensitive.
-T <int> Forbid aligning too close regions on the same sequence (e.g. Tandem repeats) [valid only for comparing a single-sequence file against itself]
Other parameters are explained in the README file

Input

Choose either 1 or 2 nucleic sequences. Note that if only one DNA sequence is selected, then it is compared to itself. YASS input is either (Multi)Fasta or Plain text format :

Fasta file example :

>gi|26245917|ref|NC_004431.1| Escherichia coli CFT073, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACA...

Multi-fasta files can be considered as a database (and thus taken as a whole) if given as the second filename parameter. The first file parameter can also be a multi-fasta file, and all the sequences are considered in it (YASS v1.14): use -S to select the one to process.

Output

The option -d 0 shows alignments positions, length and Evalue.

The option -d 1 gives complete alignments in the following format :

*(969588-969895)(124765-125072) Ev: 3.08503e-21 s: 308/308 f
* "MC58" (2272351 bp) / "IX" (439885 bp)
* score = 346 : bitscore = 108.00
* mutations per triplet 29, 71, 34 (1.04e-07) | ts : 64 tv : 70 | entropy : 4.90003

  |969590   |969600   |969610   |969620   |969630   |969640   |969650   |969660 
ACTTATGTTCCTCTGCGCCATATGGGCGAAGGCATGGGCGAGTTCCTGGTTATCGACTCCATTTTGAACGAAGAAGCCGT
|:|||:::.||.||:.||..|.||::::|.|:.|.||..||.|||.:.:||::.:|:|||.:.||:..:||:.|.|.:||
ATTTACACACCGCTAAGCACTCTGAATAATGAAAAGGCAGACTTCACCATTGCAAATTCCTCGTTATCTGAGTACGGTGT
     |124770   |124780   |124790   |124800   |124810   |124820   |124830        

  |969670   |969680   |969690   |969700   |969710   |969720   |969730   |969740 
GATGGCGTTCGAGTACGGCTTTGCCTGCTCCGCACCTGACAAACTGACCATTTGGGAAGCTCAATTCGGTGACTTCGCCA
:||||..|||||:||:||:|.|.|.:...||.|.||.||:.|.||:::|||.|||||:|||||||||||||||||:||.|
AATGGGTTTCGAATATGGTTATTCGCTAACCTCCCCAGATTATCTAGTCATGTGGGAGGCTCAATTCGGTGACTTTGCAA
     |124850   |124860   |124870   |124880   |124890   |124900   |124910        

  |969750   |969760   |969770   |969780   |969790   |969800   |969810   |969820 
ACGGCGCGCAAGTGACTATTGACCAATTCCTGTCTTCAGGCGAAACCAAGTGGGGTCGTTTGTGCGGTCTGACTACCATC
|::..||:||:||.|:||||||||||||:.|..|:...||:|||...||:|||::.|:.:..|.:|||:|:::|.:..::
ATACAGCACAGGTTATTATTGACCAATTTATTGCCGGTGGTGAACAAAAATGGAAGCAACGCTCTGGTTTAGTTTTGTCT
     |124930   |124940   |124950   |124960   |124970   |124980   |124990        

  |969830   |969840   |969850   |969860   |969870   |969880  
CTGCCGCACGGCTACGACGGTCAAGGCCCCGAGCACTCTTCTGCACGCGTAGAACGTTGGTTGCAACT
:|:||.||:||:||:||:||:||:||.||.||:||:||.||||...|..|:|||.|.|..||||||||
TTACCCCATGGTTATGATGGCCAGGGGCCAGAACATTCGTCTGGTAGATTGGAAAGATTCTTGCAACT
     |125010   |125020   |125030   |125040   |125050   |125060

A script yass2blast.pl is proposed to convert yass -d 1 output in blast full alignment output (or in axt/fasta format):

 Score = 108 bits (346), Expect = 3.08503e-21
 Identities = 174/308 (56%)
 Strand = Plus / Plus


Query: 969588    ACTTATGTTCCTCTGCGCCATATGGGCGAAGGCATGGGCGAGTTCCTGGTTATCGACTCC 969647
                 | |||    || ||  ||  | ||    | |  | ||  || |||    ||    | |||
Sbjct: 124765    ATTTACACACCGCTAAGCACTCTGAATAATGAAAAGGCAGACTTCACCATTGCAAATTCC 124824


Query: 969648    ATTTTGAACGAAGAAGCCGTGATGGCGTTCGAGTACGGCTTTGCCTGCTCCGCACCTGAC 969707
                    ||    ||  | |  || ||||  ||||| || || | | |     || | || || 
Sbjct: 124825    TCGTTATCTGAGTACGGTGTAATGGGTTTCGAATATGGTTATTCGCTAACCTCCCCAGAT 124884


Query: 969708    AAACTGACCATTTGGGAAGCTCAATTCGGTGACTTCGCCAACGGCGCGCAAGTGACTATT 969767
                  | ||   ||| ||||| ||||||||||||||||| || ||    || || || | ||||
Sbjct: 124885    TATCTAGTCATGTGGGAGGCTCAATTCGGTGACTTTGCAAATACAGCACAGGTTATTATT 124944


Query: 969768    GACCAATTCCTGTCTTCAGGCGAAACCAAGTGGGGTCGTTTGTGCGGTCTGACTACCATC 969827
                 ||||||||  |  |    || |||   || |||   |     |  ||| |   |      
Sbjct: 124945    GACCAATTTATTGCCGGTGGTGAACAAAAATGGAAGCAACGCTCTGGTTTAGTTTTGTCT 125004


Query: 969828    CTGCCGCACGGCTACGACGGTCAAGGCCCCGAGCACTCTTCTGCACGCGTAGAACGTTGG 969887
                  | || || || || || || || || || || || || ||||   |  | ||| | |  
Sbjct: 125005    TTACCCCATGGTTATGATGGCCAGGGGCCAGAACATTCGTCTGGTAGATTGGAAAGATTC 125064


Query: 969888    TTGCAACT 969895
                 ||||||||
Sbjct: 125065    TTGCAACT 125072

The option -d 2 produces BLAST tabular output. Can be used to apply BLAST output parsers :

MC58        IX   56.49   308     134     0       969588  969895  124765  125072  3.1e-21 108
MC58        IX   60.84   263     93      10      751895  752157  213618  213880  2.1e-19 102
MC58        IX   59.06   276     100     13      752399  752665  214119  214394  1.9e-15 88.8
MC58        IX   65.52   145     50      0       752066  752210  213789  213933  7e-13   80.2
MC58        IX   58.71   201     83      0       1684840 1685040 423430  423230  4.7e-12 77.5
MC58        IX   58.29   199     77      6       968315  968513  123477  123675  9.4e-09 66.5
MC58        IX   63.72   113     41      0       968988  969100  124159  124271  2.8e-07 61.6
MC58        IX   71.76   85      22      2       773143  773225  370394  370478  2.3e-05 55.2
MC58        IX   71.76   85      22      2       1499839 1499921 370394  370478  2.3e-05 55.2

The option -d 3 produces a very light and easily parsable output :

969588  969895  124765  125072  308     308     f       107.998 3.08503e-21
751895  752157  213618  213880  263     263     f       101.899 2.11464e-19
752399  752665  214119  214394  267     276     f       88.7859 1.87322e-15
752066  752210  213789  213933  145     145     f       80.2473 6.96558e-13
1684840 1685040 423230  423430  201     201     r       77.5028 4.66817e-12
968315  968513  123477  123675  199     199     f       66.5246 9.41687e-09
968988  969100  124159  124271  113     113     f       61.6454 2.77133e-07
773143  773225  370394  370478  83      85      f       55.2415 2.34674e-05
1499839 1499921 370394  370478  83      85      f       55.2415 2.34674e-05
154025  154071  317763  317809  47      47      f       52.8019 0.000127308
968703  968848  123877  124022  146     146     f       52.497  0.000157273

The option -d 4 produces a BED output and the -d 5 produces a PSL output

Advanced parameters

Window range (min,max) : used to group several gapped alignments into larger alignments with a higher score. Start processing with a small sliding window (check the alignments entering in and going out, and keep the best score of a group) before processing with larger (geometric "x increase") windows.
Window incr: indicate how the window increases (geometric "x increase").
Indels: maximal indel rate in alignments (used for distance statistics between seed hits, and also before and after extremal hits).
Mutation: maximal mutation rate in alignments (used for distance statistics between seed hits, and also before and after extremal hits).
Entropy filter on triplet: used to suppress aligning low complexity regions (post-processing filter here, alignment based and not sequence based for historical and experimental reason).

YASS :: genomic similarity search tool

Options

Advanced Options

Input

Output

Advanced parameters