Have more balanced classes for training #8

AlessioMilanese · 2020-04-17T10:00:19Z

At the moment, when we train a node, we take all possible genes from positive and negative class. This can result in unbalanced training set, example:

[2020-04-16 18:17:28,615]    TRAIN:"1729712 Candidatus Fermentibacteria":Find genes
[2020-04-16 18:17:28,639]       SEL_GENES:"1729712 Candidatus Fermentibacteria": 3 positive, 33086 negative
[2020-04-16 18:17:28,639]          TRAIN:"1729712 Candidatus Fermentibacteria":Train classifier

where we have 3 positive classes and 33k negative classes.

We need to improve the function find_training_genes in create_db.py.

The text was updated successfully, but these errors were encountered:

AlessioMilanese · 2020-04-21T22:13:41Z

Partially solved in ba7aeae, where we do the following:

limit the number of positive samples to 500 (sub-sample if there are more);
limit the number of negative samples to 1,000 (sub-sample if there are more);
Sub-sample negative samples, if there are more than 20 times more negative than positive samples; this is reduced to 3 times more if there was only one sibling (line 346)
We want to have at least 5 times more negative than positive samples. If there are not, then we will pick them from outside the siblings. We choose randomly 5 positive samples and find the most similar samples outside of the siblings, and add those to the negative samples that we have already. Note (line 363): if we are at kingdom level, then it's not possible to add outside of the siblings (and possible_neg = 0).

Can we do better?

AlessioMilanese mentioned this issue Apr 17, 2020

Classifier with too many sequences #7

Closed

AlessioMilanese self-assigned this Apr 17, 2020

AlessioMilanese added the enhancement New feature or request label Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have more balanced classes for training #8

Have more balanced classes for training #8

AlessioMilanese commented Apr 17, 2020

AlessioMilanese commented Apr 21, 2020

Have more balanced classes for training #8

Have more balanced classes for training #8

Comments

AlessioMilanese commented Apr 17, 2020

AlessioMilanese commented Apr 21, 2020