-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
24 improved classifier #30
Conversation
0af9820
to
77c2691
Compare
# make training data | ||
skills_augment = oversample_skills_wordnet(skills) | ||
balanced_augment_training = skills_augment + [ | ||
(train, label) for train, label in zip(X_train, y_train) if label == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like you just add all the non-skills data here rather than taking a sample (as the name balanced_augment_training
suggests)? was this what you meant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same in experiment no. 7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see you sort it out in experiment 8!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the variable name is bad though!! I tried oversampling and balancing
stops = stopwords.words("english") | ||
aug = naw.ContextualWordEmbsAug(aug_min=1, stopwords=stops) | ||
augmented_embed_skill_sents = [] | ||
for index, train in enumerate(skills): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this will be causing a bug, but note that you're input variable name is skills_data
not skills
|
||
|
||
# %% [markdown] | ||
# ## Experiment No. 7 - Balance training data - use contextual word embeddings to oversample 1 class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a cool idea!
I've updated your scaled up sentence classifier to reflect the new pipeline! The current pipeline results in a precision score of 0.92 for the positive class and 0.90 for the negative class. It results in a recall score of 0.73 for the positive class and 0.97 for the negative class. This is for the new, updated training data where I modified some incorrect labels. I would take this with a slight pinch of salt as the training labels aren't perfect -qualitatively, i took a look at mislabelled sents where the true label was 1 and the model predicted 0 - it appeared that while there were some sentences that were certainly skills, many of them also contained 'characteristics' - I give some examples in
Can you let me know if:
Thanks so much Liz! |
7af6b6c
to
acba78d
Compare
skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py
Show resolved
Hide resolved
skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py
Outdated
Show resolved
Hide resolved
skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py
Outdated
Show resolved
Hide resolved
skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.md
Outdated
Show resolved
Hide resolved
From `2021.07.09`: | ||
|
||
This will run predictions on a random sample of 10 of the 686 data files. The outputs of this yielded 5,823,903 skill sentences from the 1,000,000 job adverts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have these numbers for the 2021.08.16.yaml yet?
requirements.txt
Outdated
@@ -17,6 +17,9 @@ networkx | |||
gensim | |||
bokeh | |||
umap-learn | |||
nlpaug | |||
nltk | |||
Sklearn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn is already in the requirements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops - deleted!!
@india-kerle looks good but see some of my comments as I think there may be some bugs. I also think we should store the models and data in S3 not in github, so if you agree perhaps we could remove them from this pull request? Just about to try and run the scripts (I pressed to submit my comments accidently before I'd finished reviewing!) |
|
||
### Sentence classifier - `2021.08.16.yaml` | ||
|
||
`python skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py --yaml_file_name 2021.08.16` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I ran this I got
precision recall f1-score support
0 0.99 1.00 1.00 2299
1 1.00 0.98 0.99 931
accuracy 0.99 3230
macro avg 1.00 0.99 0.99 3230
weighted avg 0.99 0.99 0.99 3230
and
precision recall f1-score support
0 0.90 0.97 0.93 406
1 0.91 0.73 0.81 164
accuracy 0.90 570
macro avg 0.90 0.85 0.87 570
weighted avg 0.90 0.90 0.90 570
So seems like maybe the data I have (/inputs/new_training_data/final_training_data.pickle
) is smaller than the version you have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmm - lemme check as the final training set should be larger
|
||
# Output file name | ||
output_dir = params["output_dir"] | ||
file_name = os.path.join(output_dir, yaml_file_name.replace(".", "_")) | ||
|
||
# Run flow | ||
training_data = load_training_data(training_data_file) | ||
|
||
training_data = load_training_data("final_training_data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might work better as a parameter in the config with the extension included rather than hardcoded here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! no longer hardcoded
@india-kerle don't forget to upload the code to create the training data (or any processing you did)? edit: this would be fine in another PR - e.g. the code to create the training data and a document about how to create it/how it was created |
d5e0169
to
7967c85
Compare
@lizgzil interested in your ideas on improving the model but i've gone through the sentence splitting and training data - I think the assumption that sentences that were not labelled skills were therefore not skills might have been wrong. It looks like the 1 class sentences are well labelled but the 0 class sentences aren't.
Experiments are in
improve_classifier.ipynb
andimprove_classifier.py
.