Options
Here are given some of the most commonly used command line parameters :
- -r 0 to consider only the forward reading of the first sequence,
- -r 1 to consider only the reverse complement of the first sequence,
- -r 2 to consider both forward and reverse complement of the first sequence (default).
- -d 1 to display quick alignments together with some statistics (see the output description),
- -d 2 to display BLAST tabular format,
- -d 3 to display a very short and very easy parsable output.
- Others output formats can be generated from -d 1 by using the yass2blast.pl conversion tool: it produces "blast with alignments" output, and axt or multi-fasta.
- -o <filename> to select the output file, default is the standard output (stdout).
- -S <number> to select the sequence from the multi-fasta query file (ie the first file : see the input description).
The scoring system can be selected by:
- -M <number> to use proposed scoring systems (it ranges from 0 to 3 and selects high to low identity matrices).
- -C 1,-3 to choose +1/-3 scoring system, the -C (...) command-line parameter can be used with 2 to 4 numbers, if you give:
- 2 parameters : you fix the match reward and mismatch penalty,
- 3 parameters : match reward, transversion penalty, transition penalty,
- 4 parameters : match reward, transversion penalty, transition penalty, other penalty (non ACGTU letters),
- 16 parameters : gives a complete DNA matrix (ACGT order).
example -C 4,-6,-2,-6,-6,5,-7,-2,-2,-7,5,-6,-6,-2,-6,4
- -G -10,-4 to choose -10,-4 penalty for gap opening/extension.
- -X <number> to fix the Xdrop threshold score (large values result in increased running time but better results).
- -L 1.1,0.33 to force (Gumbel law) Lambda and K parameters.
You can increase/limit the number of alignments in the result :
- -O <number> to limit the number of alignments in the output : default is fixed to 10000
- -E 1e-3 to fix the Expectation value threshold : if an alignment does not reach the minimal score to have this evalue, then it is discarded.
Advanced Options
- -s to specify how to sort the results
By default, the -s 70, namely query sorted, query/text blocks - score, first takes each fasta chunk in the multi-fasta query file; Then, each time this fasta chunk has a good hit against a text chunk, it produces a list of all the subsequent query/text hits of this query chunk against this text chunk as a block where hits are sorted by score. Note that this behavior is similar to BLAST.
More details:
We suppose that x is a value ≤ 8, that has the following meaning : "sort by score" (0), "sort by entropy" (1), "sort by mutual information" (2), "sort by both entropy and score" (3), "sort by end position on the 1st file" (4), "sort by end position on the 2nd file" (5), "sort by alignment % id" (6), "sort by 1st file sequence % id" (7), "sort by 2nd file sequence % id" (8).
For any -s value < 50, results are simply sorted accordingly :
- -s x (x ≤ 8) : by the criterion x alone,
- -s 1x : by the query rank first (blocks are then defined), then inside each block by the criterion x : text rank order can thus be heavily mixed, but query rank is kept,
- -s 2x : by the text rank first (blocks are then defined), then inside each block by the criterion x : query rank order can thus be heavily mixed, but text rank is kept,
- -s 3x : by the query rank first, then by the text rank for each query rank, (blocks are then defined), and then finally inside each block by the criterion x : it is similar to a concatenation of the results where each query fasta chunk, then each text fasta chunk, are processed in a double-loop with at most n × m blocks defined.
Otherwise, any -s value ≥ 50 calls a block post-sorting method; the criterion x is both ordering :
- hits inside any block,
- post-sorting blocks (or set of blocks) between them according to their respective first hit.
Post-sorting can be made on blocks, or set of blocks defined by the query only, the text only, or both ...
- -s 4x : post-sorting blocks using x on the query, when blocks have been defined per query chunk : this is equivalent to 1x where the query chunk rank order is replaced by the best hit ordering of each query chunk. Note that the text chunk rank order is kept for each query chunk result.
- -s 5x : post-sorting blocks using x on the text, when blocks have been defined per text chunk : this is equivalent to 2x where the text chunk rank order is replaced by the best hit ordering of each text chunk. Note that the query chunk rank order is kept for each text chunk result.
or more tricky (asymmetric sorting here, in the sense that there is no text/query equivalent for a given query/text sorting) :
- -s 6x (equivalent to 4x) : post-sorting set of blocks using x on the query, when blocks have been defined per query & text chunk : this keeps the text rank order per query result, but the queries are ordered by the x criterion on the best hit. This has a similar behavior to 4x {it would be different if two different criteria x and y where used for sorting inside and post processing ... but not yet implemented :-)}
- -s 7x : post-sorting set of blocks using x on the text (per query), when blocks have been defined by the query/text rank : this keeps the query rank order, and is similar to BLAST
- -s 8x : post-sorting set of blocks using x on the query/text, when blocks have been defined by the query/text rank : this list blocks of hits according to their best hits, whatever is the query rank or text rank (useful for full search) : this is equivalent to 3x where the query/text rank orders are replaced by the best hit block ordering
- -p to specify a seed Pattern (for example -p ##@_#@#__#_###)
- # match
- @ match or transition
- _ match or any mismatch (joker)
Examples of seed patterns proposed :PatternHunter : ###--##-#--### Mandala : ###---#-#--##-## ###--#-##-### ####-##-### YASS : #@##-#--##--#@# ###@-#-#--#@-##
Example :
TGGGCGCATGGCCTAGCCTGGATAGGGCAACAGCCTCCTAAGCTGTAGATCAGGGGTCCAAATCCCCTTGCGCCCGCTCATAACCT ||.|:.|:|:||::|| ||||||||:|||.||||||:|||||:|||.|:||||||||:|:||||||:|:|.:|.|:| |||||||| TGTGTCCGTAGCTCAG-CTGGATAGAGCATCAGCCTTCTAAGTTGTTGGTCAGGGGTTCGAATCCCTTCGGACACAC-CATAACCT ###-#@-##@## ###-#@-##@## ###--#-#--#-### ###--#-#--#-###
- -c <int> Select single (1) hit criterion, or double (2) hit criterion : one single hit of a seed is needed in the first case to detect an alignment, whereas two are necessary in the second case. Note that the single hit criterion is slower, but more sensitive.
- -T <int> Forbid aligning too close regions on the same sequence (e.g. Tandem repeats) [valid only for comparing a single-sequence file against itself]
- Other parameters are explained in the README file
Input
Choose either 1 or 2 nucleic sequences. Note that if only one DNA sequence is selected, then it is compared to itself. YASS input is either (Multi)Fasta or Plain text format :
Fasta file example :
>gi|26245917|ref|NC_004431.1| Escherichia coli CFT073, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACA...
Multi-fasta files can be considered as a database (and thus taken as a whole) if given as the second filename parameter. The first file parameter can also be a multi-fasta file, and all the sequences are considered in it (YASS v1.14): use -S to select the one to process.
Output
The option -d 0 shows alignments positions, length and Evalue.
The option -d 1 gives complete alignments in the following format :
*(969588-969895)(124765-125072) Ev: 3.08503e-21 s: 308/308 f * "MC58" (2272351 bp) / "IX" (439885 bp) * score = 346 : bitscore = 108.00 * mutations per triplet 29, 71, 34 (1.04e-07) | ts : 64 tv : 70 | entropy : 4.90003 |969590 |969600 |969610 |969620 |969630 |969640 |969650 |969660 ACTTATGTTCCTCTGCGCCATATGGGCGAAGGCATGGGCGAGTTCCTGGTTATCGACTCCATTTTGAACGAAGAAGCCGT |:|||:::.||.||:.||..|.||::::|.|:.|.||..||.|||.:.:||::.:|:|||.:.||:..:||:.|.|.:|| ATTTACACACCGCTAAGCACTCTGAATAATGAAAAGGCAGACTTCACCATTGCAAATTCCTCGTTATCTGAGTACGGTGT |124770 |124780 |124790 |124800 |124810 |124820 |124830 |969670 |969680 |969690 |969700 |969710 |969720 |969730 |969740 GATGGCGTTCGAGTACGGCTTTGCCTGCTCCGCACCTGACAAACTGACCATTTGGGAAGCTCAATTCGGTGACTTCGCCA :||||..|||||:||:||:|.|.|.:...||.|.||.||:.|.||:::|||.|||||:|||||||||||||||||:||.| AATGGGTTTCGAATATGGTTATTCGCTAACCTCCCCAGATTATCTAGTCATGTGGGAGGCTCAATTCGGTGACTTTGCAA |124850 |124860 |124870 |124880 |124890 |124900 |124910 |969750 |969760 |969770 |969780 |969790 |969800 |969810 |969820 ACGGCGCGCAAGTGACTATTGACCAATTCCTGTCTTCAGGCGAAACCAAGTGGGGTCGTTTGTGCGGTCTGACTACCATC |::..||:||:||.|:||||||||||||:.|..|:...||:|||...||:|||::.|:.:..|.:|||:|:::|.:..:: ATACAGCACAGGTTATTATTGACCAATTTATTGCCGGTGGTGAACAAAAATGGAAGCAACGCTCTGGTTTAGTTTTGTCT |124930 |124940 |124950 |124960 |124970 |124980 |124990 |969830 |969840 |969850 |969860 |969870 |969880 CTGCCGCACGGCTACGACGGTCAAGGCCCCGAGCACTCTTCTGCACGCGTAGAACGTTGGTTGCAACT :|:||.||:||:||:||:||:||:||.||.||:||:||.||||...|..|:|||.|.|..|||||||| TTACCCCATGGTTATGATGGCCAGGGGCCAGAACATTCGTCTGGTAGATTGGAAAGATTCTTGCAACT |125010 |125020 |125030 |125040 |125050 |125060
A script yass2blast.pl is proposed to convert yass -d 1
output in blast full alignment output (or in axt/fasta format):
Score = 108 bits (346), Expect = 3.08503e-21 Identities = 174/308 (56%) Strand = Plus / Plus Query: 969588 ACTTATGTTCCTCTGCGCCATATGGGCGAAGGCATGGGCGAGTTCCTGGTTATCGACTCC 969647 | ||| || || || | || | | | || || ||| || | ||| Sbjct: 124765 ATTTACACACCGCTAAGCACTCTGAATAATGAAAAGGCAGACTTCACCATTGCAAATTCC 124824 Query: 969648 ATTTTGAACGAAGAAGCCGTGATGGCGTTCGAGTACGGCTTTGCCTGCTCCGCACCTGAC 969707 || || | | || |||| ||||| || || | | | || | || || Sbjct: 124825 TCGTTATCTGAGTACGGTGTAATGGGTTTCGAATATGGTTATTCGCTAACCTCCCCAGAT 124884 Query: 969708 AAACTGACCATTTGGGAAGCTCAATTCGGTGACTTCGCCAACGGCGCGCAAGTGACTATT 969767 | || ||| ||||| ||||||||||||||||| || || || || || | |||| Sbjct: 124885 TATCTAGTCATGTGGGAGGCTCAATTCGGTGACTTTGCAAATACAGCACAGGTTATTATT 124944 Query: 969768 GACCAATTCCTGTCTTCAGGCGAAACCAAGTGGGGTCGTTTGTGCGGTCTGACTACCATC 969827 |||||||| | | || ||| || ||| | | ||| | | Sbjct: 124945 GACCAATTTATTGCCGGTGGTGAACAAAAATGGAAGCAACGCTCTGGTTTAGTTTTGTCT 125004 Query: 969828 CTGCCGCACGGCTACGACGGTCAAGGCCCCGAGCACTCTTCTGCACGCGTAGAACGTTGG 969887 | || || || || || || || || || || || || |||| | | ||| | | Sbjct: 125005 TTACCCCATGGTTATGATGGCCAGGGGCCAGAACATTCGTCTGGTAGATTGGAAAGATTC 125064 Query: 969888 TTGCAACT 969895 |||||||| Sbjct: 125065 TTGCAACT 125072
The option -d 2 produces BLAST tabular output. Can be used to apply BLAST output parsers :
MC58 IX 56.49 308 134 0 969588 969895 124765 125072 3.1e-21 108 MC58 IX 60.84 263 93 10 751895 752157 213618 213880 2.1e-19 102 MC58 IX 59.06 276 100 13 752399 752665 214119 214394 1.9e-15 88.8 MC58 IX 65.52 145 50 0 752066 752210 213789 213933 7e-13 80.2 MC58 IX 58.71 201 83 0 1684840 1685040 423430 423230 4.7e-12 77.5 MC58 IX 58.29 199 77 6 968315 968513 123477 123675 9.4e-09 66.5 MC58 IX 63.72 113 41 0 968988 969100 124159 124271 2.8e-07 61.6 MC58 IX 71.76 85 22 2 773143 773225 370394 370478 2.3e-05 55.2 MC58 IX 71.76 85 22 2 1499839 1499921 370394 370478 2.3e-05 55.2
The option -d 3 produces a very light and easily parsable output :
969588 969895 124765 125072 308 308 f 107.998 3.08503e-21 751895 752157 213618 213880 263 263 f 101.899 2.11464e-19 752399 752665 214119 214394 267 276 f 88.7859 1.87322e-15 752066 752210 213789 213933 145 145 f 80.2473 6.96558e-13 1684840 1685040 423230 423430 201 201 r 77.5028 4.66817e-12 968315 968513 123477 123675 199 199 f 66.5246 9.41687e-09 968988 969100 124159 124271 113 113 f 61.6454 2.77133e-07 773143 773225 370394 370478 83 85 f 55.2415 2.34674e-05 1499839 1499921 370394 370478 83 85 f 55.2415 2.34674e-05 154025 154071 317763 317809 47 47 f 52.8019 0.000127308 968703 968848 123877 124022 146 146 f 52.497 0.000157273
The option -d 4 produces a BED output and the -d 5 produces a PSL output
Advanced parameters
- Window range (min,max) : used to group several gapped alignments into larger alignments with a higher score. Start processing with a small sliding window (check the alignments entering in and going out, and keep the best score of a group) before processing with larger (geometric "x increase") windows.
- Window incr: indicate how the window increases (geometric "x increase").
- Indels: maximal indel rate in alignments (used for distance statistics between seed hits, and also before and after extremal hits).
- Mutation: maximal mutation rate in alignments (used for distance statistics between seed hits, and also before and after extremal hits).
- Entropy filter on triplet: used to suppress aligning low complexity regions (post-processing filter here, alignment based and not sequence based for historical and experimental reason).