This page contains experimental datasets, results and associated scripts related to the paper Seed design framework for mapping SOLiD reads presented to RECOMB 2010. The content is divided into sections Lossless seeds and Lossy seeds that are related to Section 4 Experiments and Discussion. Results are provided for read length 34 i.e. SOLiD reads of 35 colors.
For both settings, seeds have been designed to be applied either at any position of the read, or at a restricted set of positions. Both spaced seeds and symmetric spaced seeds have been designed.
Note: this work has since been extended in the ABI paper Designing efficient spaced seeds for SOLiD read mapping where extra seed design of section 1.2, together with software experiments using the SToRM read mapper, have been carried out.
Lossless seeds have the ability to detect all the possible mutational patterns with a given error threshold. For SOLiD reads, one can distinguish mismatches of two types: SNPs (i.e. bona fide nucleotide differences) and color errors (i.e. misreadings in the sequencing process). The error threshold is then specified with respect to these two parameters. For example, 2err_1snp
corresponds to the case of two color errors and one SNP.
Automata input files : (file format is described here)
Seed output files : (file format is described here)
Numbers in table entries stand for the number of read positions the seed(s) can be applied to. In case of several seeds, the number correspond to the total number of positions for all seeds. The specific positions for each seeds can be seen by clicking on the entry. Asterisk indicates the unrestricted position set (i.e. seed(s) can be applied to any read position).
A more complex setup allows to account for indels in addition to SNPs and color errors when verifying the lossless property. We have chosen a cost associated to each error type (eg cost(color error) = 2, cost(SNP) = 3, cost(indel) = 4) and a maximal cost threshold to define alignments that are in the target lossless set (for instance, with the previously mentioned cost choices, a cost threshold of 7 allows at most 1 SNP and 2 color errors, or at most 1 indel and 1 SNP, and several other less costly combinations).
In a similar way to section 1.1, lossless seeds designed in this framework have the ability to detect all the possible mutational patterns with this given error threshold.
Automata input files : (file format is described here)
err-snp-ind_2-3-4_thr_7
: [txt],Seed output files : (file format is described here)
Numbers in table entries stand for the number of read positions the seed(s) can be applied to. In case of several seeds, the number correspond to the total number of positions for all seeds. The specific positions for each seeds can be seen by clicking on the entry. Asterisk indicates the unrestricted position set (i.e. seed(s) can be applied to any read position).
err-snp-ind_2-3-4_thr_7
: [dir] (dec 2012)
Lossy seeds are designed to detect most of the frequent mutational patterns found on SOLiD reads. In other words, such seeds have the capacity to detect a very high fraction of read mappings typically occurring for SOLiD reads. The sensitivity of lossy seeds is ensured using probabilistic models of read alignments.
Two models are combined to distinguish mismatches caused respectively by SNPs and by color errors. Each of these models is represented by a probabilistic automaton (see the paper) generated by a perl script ([pl]).
Lossy seeds have been designed using the following automaton (iedera automaton file [txt]) and script (iedera bash script [sh]).
snp0.01_id0.15_ps0.01_pe0.10_ef0.02_er0.25
): [pl],Automata input files : (file format is described here)
3362states_snp0.01_id0.15_ps0.01_pe0.10_ef0.02_er0.25
) : [txt]Seed output files : (file format is described here)
3362states_snp0.01_id0.15_ps0.01_pe0.10_ef0.02_er0.25
) : [dir] (dec 2012)
unrestricted position set:
64 positions:
32 positions:
16 positions:
10 positions: