Seed design framework for mapping
SOLiD reads

[for double-hit results, see this webpage]

This page contains experimental datasets, results and associated scripts related to the paper Seed design framework for mapping SOLiD reads presented to RECOMB 2010. The content is divided into sections Lossless seeds and Lossy seeds that are related to Section 4 Experiments and Discussion. Results are provided for read length 34 i.e. SOLiD reads of 35 colors.

For both settings, seeds have been designed to be applied either at any position of the read, or at a restricted set of positions. Both spaced seeds and symmetric spaced seeds have been designed.

Note: this work has since been extended in the ABI paper Designing efficient spaced seeds for SOLiD read mapping where extra seed design of section 1.2, together with software experiments using the SToRM read mapper, have been carried out.

1. Lossless seeds

1.1 Lossless seeds with respect to SNPs and color errors

2 color error, 1 SNP

Lossless seeds have the ability to detect all the possible mutational patterns with a given error threshold. For SOLiD reads, one can distinguish mismatches of two types: SNPs (i.e. bona fide nucleotide differences) and color errors (i.e. misreadings in the sequencing process). The error threshold is then specified with respect to these two parameters. For example, 2err_1snp corresponds to the case of two color errors and one SNP.

Tools :

Data & Results :

1.2 Lossless seeds with respect to SNPs, color errors and indels

SNP, color error and indel sub-automata

A more complex setup allows to account for indels in addition to SNPs and color errors when verifying the lossless property. We have chosen a cost associated to each error type (eg cost(color error) = 2, cost(SNP) = 3, cost(indel) = 4) and a maximal cost threshold to define alignments that are in the target lossless set (for instance, with the previously mentioned cost choices, a cost threshold of 7 allows at most 1 SNP and 2 color errors, or at most 1 indel and 1 SNP, and several other less costly combinations).

In a similar way to section 1.1, lossless seeds designed in this framework have the ability to detect all the possible mutational patterns with this given error threshold.

Tools :

Data & Results :

2. Lossy seeds

Lossy seeds are designed to detect most of the frequent mutational patterns found on SOLiD reads. In other words, such seeds have the capacity to detect a very high fraction of read mappings typically occurring for SOLiD reads. The sensitivity of lossy seeds is ensured using probabilistic models of read alignments.

SNP x Read error

Two models are combined to distinguish mismatches caused respectively by SNPs and by color errors. Each of these models is represented by a probabilistic automaton (see the paper) generated by a perl script ([pl]).

Lossy seeds have been designed using the following automaton (iedera automaton file [txt]) and script (iedera bash script [sh]).

Tools :

Data & Results :