You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Partially solved in ba7aeae, where we do the following:
limit the number of positive samples to 500 (sub-sample if there are more);
limit the number of negative samples to 1,000 (sub-sample if there are more);
Sub-sample negative samples, if there are more than 20 times more negative than positive samples; this is reduced to 3 times more if there was only one sibling (line 346)
We want to have at least 5 times more negative than positive samples. If there are not, then we will pick them from outside the siblings. We choose randomly 5 positive samples and find the most similar samples outside of the siblings, and add those to the negative samples that we have already. Note (line 363): if we are at kingdom level, then it's not possible to add outside of the siblings (and possible_neg = 0).
At the moment, when we train a node, we take all possible genes from positive and negative class. This can result in unbalanced training set, example:
where we have 3 positive classes and 33k negative classes.
We need to improve the function
find_training_genes
increate_db.py
.The text was updated successfully, but these errors were encountered: