Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cutoff or threshold for determining similarity #13

Open
PlantHealth-Analytics opened this issue Nov 18, 2024 · 2 comments
Open

cutoff or threshold for determining similarity #13

PlantHealth-Analytics opened this issue Nov 18, 2024 · 2 comments

Comments

@PlantHealth-Analytics
Copy link

PlantHealth-Analytics commented Nov 18, 2024

Thank you for providing this tool! I am curious if there is an established cutoff or threshold for determining similarity during analysis. Specifically, have you tested how well the tool can discriminate between closely related genomes in real-time? For example, have there been evaluations using raw signals to differentiate such genomes effectively?
Thanks. Jose C. Huguet-Tapia

@canfirtina
Copy link
Member

canfirtina commented Nov 24, 2024

Dear Jose @PlantHealth-Analytics,

Thank you for your interest and your question. For a quick answer, please jump directly to 2) under 2. Differentiating genomes.

I will answer your question in two parts with more detail. First, I will describe the thresholding mechanism used to determine the best mapping position for a read. Second, I will explain our efforts related to differentiating/classifying closely related genomes.

  1. Thresholding mechanism. RawHash2 identifies the best mapping position for a read by generating a weighted score based on several metrics from the best chain (i.e., the chain with the highest score). If the weighted score exceeds a certain threshold, the read is considered mapped to the position covered by the best chain:

RawHash/src/rmap.cpp

Lines 492 to 497 in d3956ea

if (weighted_sum >= opt->w_threshold || (opt->flag&RI_M_ALL_CHAINS && reg0->creg[ic].score >= opt->min_chaining_score2)) {
reg0->n_maps++;
reg0->maps = (ri_map_t*)ri_krealloc(0, reg0->maps, reg0->n_maps*sizeof(ri_map_t));
reg0->maps[reg0->n_maps-1].c_id = ic;
// fprintf(stderr, "Aligned ic: %d\n", ic);
}

weighted_sum is a weighted summation of several metrics. Currently, we calculate:

1) The ratio of the mean score of all chains to the score of the best chain:

r_bestmc = (bestC > 0)?(1.0f - (meanC/bestC)):0.0f; if(r_bestmc < 0) r_bestmc = 0.0f;

2) The ratio of the mean MAPQ of all chains to the MAPQ of the best chain:

r_bestmq = (bestQ > 0)?(1.0f - (meanQ/bestQ)):0.0f; if(r_bestmq < 0) r_bestmq = 0.0f;

3) The ratio of the MAPQ of the best chain to 30 (half of the maximum MAPQ, which is 60):

r_bestq = (bestQ > 0)?(bestQ/30.0f):0.0f; if(r_bestq > 1) r_bestq = 1.0f;

These ratios (metrics) are summed in a weighted manner to calculate weighted_sum:

weighted_sum = opt->w_bestq*r_bestq + opt->w_bestmq*r_bestmq + opt->w_bestmc*r_bestmc;

For each metric, we assign a corresponding weight to make the thresholding decision (similar to a simple perceptron). The default values of these weights are not shown in the help message (since this is an advanced feature), but you can still modify them using the arguments in the comments here (default values are also provided here):

RawHash/src/roptions.c

Lines 80 to 82 in d3956ea

opt->w_bestmq=0.05f; //--w-bestmq
opt->w_bestmc=0.6f; //--w-bestmc
opt->w_bestq=0.35f; //--w-bestq

Please note that the weights should sum to 1 for the thresholding mechanism to work effectively.

  1. Differentiating genomes. We evaluated relative abundance estimation based on the genomes that reads map to. The Euclidean distance between RawHash2's estimations and the ground truth is the smallest compared to previous tools. However, there are two potential caveats:

1) The mapping procedure is still at the base level. It is computationally expensive as it requires identifying chains (a costly operation) based on seed matches. To differentiate genomes, it should be possible to perform cheaper computations (e.g., seed voting). I tried this but did not succeed on the first attempt. My approach was to differentiate distant genomes by assigning a read to the genome with the highest seed match count. However, more sophisticated filters/selection strategies are likely needed.

2) I focused on differentiating distant genomes, not closely related ones. I am not entirely sure how well RawHash2 performs in such cases. What we do know is that performing DTW alignment after mapping significantly improves the confidence of relative abundance estimation. Although DTW from our earlier RawAlign work has been integrated into RawHash, it needs further optimization for RawHash2. You may want to integrate alignment and evaluate its performance.

I am happy to help if you would like to explore this further.

Thanks,
Can

@PlantHealth-Analytics
Copy link
Author

Dear Can,

Thanks for the detailed explanation and for sharing the logic behind your thresholding mechanism and genome differentiation strategies. I agree that tuning or modifying the parameters of the metrics might offer a pathway to better differentiate closely related genomes. Yes, resolving such genomes is indeed a challenging task.
I have raw signals from closely related bacterial genomes that could be useful for testing. I could run some analyses using RawHash2 and explore potential parameter adjustments to see if we can improve differentiation in these cases. Once I have results, I’ll be happy to share them with you and discuss further steps.

Looking forward to staying in touch!

Best regards,
Jose

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants