-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
detecting optical duplicates #106
Comments
Also related discussion here. |
Note: much of cython dedup code is "dark-yellow", i.e. is not fully optimized. Particularly, some operations are too tightly interconnected with numpy operations which then use Python objects. Optimization of Cython code would require too much change in the existing code and does not seem to be worth the effort, given that we have a parallel working solution in scipy. |
I think for us in the end there is no need to decide whether each particular duplicate is optical, clustering or PCR, and we just need to estimate the total number of PCR duplicates as part of stats... Something more fancy can be done customly using the dups file if one so desires. |
Okay, sounds good! Thank you, Ilya! |
co-authored with @Phlya:
One of the key functions of deduplication is the estimation of the library depth. To do this properly, it's crucial to distinguish between PCR duplicates and optical duplicates, since the fraction of PCR duplicates increases with sequencing depth, while the rate of optical duplication is constant and independent of the sequencing depth. Currently, pairtools dedup do not distinguish between the two types of duplicates, which results in a potentially drastic underestimation of the library depth.
We can detect optical duplicates by relying on the fact that such reads are located in close physical proximity on the sequencing chip. These locations are recorded in readIDs in the FASTQ file, so we can potentially modify the dedup code to take this information into account. Unfortunately, the current code for dedup is rigid and doesn't allow easy modification, for several reasons. First, it's written in Cython, which makes it harder to modify. Second, readIDs are variable-length strings, which makes is not very Cython-friendly. Finally, the specific algorithm chosen for deduplication does not return any information on duplicated pairs, but only returns a boolean variable per each molecule, which says if it's duplicated or not. After a long discussion, we concluded that there is no easy way to modify the current code to enable optical duplicate detection.
The two alternatives are (a) re-write the dedup code in Cython or (b) try to re-purpose existing algortihms for deduplication. We believe that (b) is possible and reasonable. Specifically, we can use scipy.spatial.cKDTree algorithm to detect neighbours in chunks of pairs. This implementation is very lightweight - only ~20 lines of very transparent code https://gist.github.com/golobor/5daba27411671bb6d497046af649cec9 . The speed seems very reasonable - around 10 mins for 1Bln pairs, which is negligible compared to other steps, like mapping, parsing, and sorting. This code will be then easy to modify to analyse readIDs within duplicated clusters and to detect optical duplicates. As a bonus, this pure python code will also be easy to modify, e.g. to add extra criteria for detection of duplicates or picking of the "representative" molecule from the duplicated cluster.
The text was updated successfully, but these errors were encountered: