detecting optical duplicates #106

golobor · 2021-06-23T09:33:56Z

co-authored with @Phlya:

One of the key functions of deduplication is the estimation of the library depth. To do this properly, it's crucial to distinguish between PCR duplicates and optical duplicates, since the fraction of PCR duplicates increases with sequencing depth, while the rate of optical duplication is constant and independent of the sequencing depth. Currently, pairtools dedup do not distinguish between the two types of duplicates, which results in a potentially drastic underestimation of the library depth.

We can detect optical duplicates by relying on the fact that such reads are located in close physical proximity on the sequencing chip. These locations are recorded in readIDs in the FASTQ file, so we can potentially modify the dedup code to take this information into account. Unfortunately, the current code for dedup is rigid and doesn't allow easy modification, for several reasons. First, it's written in Cython, which makes it harder to modify. Second, readIDs are variable-length strings, which makes is not very Cython-friendly. Finally, the specific algorithm chosen for deduplication does not return any information on duplicated pairs, but only returns a boolean variable per each molecule, which says if it's duplicated or not. After a long discussion, we concluded that there is no easy way to modify the current code to enable optical duplicate detection.

The two alternatives are (a) re-write the dedup code in Cython or (b) try to re-purpose existing algortihms for deduplication. We believe that (b) is possible and reasonable. Specifically, we can use scipy.spatial.cKDTree algorithm to detect neighbours in chunks of pairs. This implementation is very lightweight - only ~20 lines of very transparent code https://gist.github.com/golobor/5daba27411671bb6d497046af649cec9 . The speed seems very reasonable - around 10 mins for 1Bln pairs, which is negligible compared to other steps, like mapping, parsing, and sorting. This code will be then easy to modify to analyse readIDs within duplicated clusters and to detect optical duplicates. As a bonus, this pure python code will also be easy to modify, e.g. to add extra criteria for detection of duplicates or picking of the "representative" molecule from the duplicated cluster.

Phlya · 2021-06-23T09:36:26Z

Also related discussion here.

agalitsyna · 2022-04-07T12:32:36Z

Note: much of cython dedup code is "dark-yellow", i.e. is not fully optimized. Particularly, some operations are too tightly interconnected with numpy operations which then use Python objects. Optimization of Cython code would require too much change in the existing code and does not seem to be worth the effort, given that we have a parallel working solution in scipy.

agalitsyna · 2022-04-20T19:53:08Z

Is there anything else we need to implement? @Phlya, can you take a look into pre 1.0.0 version and check? #116

Phlya · 2022-04-22T09:05:39Z

I think for us in the end there is no need to decide whether each particular duplicate is optical, clustering or PCR, and we just need to estimate the total number of PCR duplicates as part of stats... Something more fancy can be done customly using the dups file if one so desires.

agalitsyna · 2022-04-22T13:04:03Z

Okay, sounds good! Thank you, Ilya!

agalitsyna mentioned this issue Apr 6, 2022

pairtools v1.0.0 roadmap #116

Closed

31 tasks

agalitsyna added the enhancement label Apr 20, 2022

agalitsyna closed this as completed Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

detecting optical duplicates #106

detecting optical duplicates #106

golobor commented Jun 23, 2021

Phlya commented Jun 23, 2021

agalitsyna commented Apr 7, 2022

agalitsyna commented Apr 20, 2022

Phlya commented Apr 22, 2022

agalitsyna commented Apr 22, 2022

detecting optical duplicates #106

detecting optical duplicates #106

Comments

golobor commented Jun 23, 2021

Phlya commented Jun 23, 2021

agalitsyna commented Apr 7, 2022

agalitsyna commented Apr 20, 2022

Phlya commented Apr 22, 2022

agalitsyna commented Apr 22, 2022