-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cutoff or threshold for determining similarity #13
Comments
Dear Jose @PlantHealth-Analytics, Thank you for your interest and your question. For a quick answer, please jump directly to 2) under 2. Differentiating genomes. I will answer your question in two parts with more detail. First, I will describe the thresholding mechanism used to determine the best mapping position for a read. Second, I will explain our efforts related to differentiating/classifying closely related genomes.
Lines 492 to 497 in d3956ea
1) The ratio of the mean score of all chains to the score of the best chain: Line 485 in d3956ea
2) The ratio of the mean MAPQ of all chains to the MAPQ of the best chain: Line 484 in d3956ea
3) The ratio of the MAPQ of the best chain to 30 (half of the maximum MAPQ, which is 60): Line 483 in d3956ea
These ratios (metrics) are summed in a weighted manner to calculate Line 487 in d3956ea
For each metric, we assign a corresponding weight to make the thresholding decision (similar to a simple perceptron). The default values of these weights are not shown in the help message (since this is an advanced feature), but you can still modify them using the arguments in the comments here (default values are also provided here): Lines 80 to 82 in d3956ea
Please note that the weights should sum to 1 for the thresholding mechanism to work effectively.
1) The mapping procedure is still at the base level. It is computationally expensive as it requires identifying chains (a costly operation) based on seed matches. To differentiate genomes, it should be possible to perform cheaper computations (e.g., seed voting). I tried this but did not succeed on the first attempt. My approach was to differentiate distant genomes by assigning a read to the genome with the highest seed match count. However, more sophisticated filters/selection strategies are likely needed. 2) I focused on differentiating distant genomes, not closely related ones. I am not entirely sure how well RawHash2 performs in such cases. What we do know is that performing DTW alignment after mapping significantly improves the confidence of relative abundance estimation. Although DTW from our earlier RawAlign work has been integrated into RawHash, it needs further optimization for RawHash2. You may want to integrate alignment and evaluate its performance. I am happy to help if you would like to explore this further. Thanks, |
Dear Can, Thanks for the detailed explanation and for sharing the logic behind your thresholding mechanism and genome differentiation strategies. I agree that tuning or modifying the parameters of the metrics might offer a pathway to better differentiate closely related genomes. Yes, resolving such genomes is indeed a challenging task. Looking forward to staying in touch! Best regards, |
Thank you for providing this tool! I am curious if there is an established cutoff or threshold for determining similarity during analysis. Specifically, have you tested how well the tool can discriminate between closely related genomes in real-time? For example, have there been evaluations using raw signals to differentiate such genomes effectively?
Thanks. Jose C. Huguet-Tapia
The text was updated successfully, but these errors were encountered: