Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

24 improved classifier #30

Merged
merged 60 commits into from
Aug 25, 2021
Merged

24 improved classifier #30

merged 60 commits into from
Aug 25, 2021

Conversation

india-kerle
Copy link
Contributor

@india-kerle india-kerle commented Jul 20, 2021

@lizgzil interested in your ideas on improving the model but i've gone through the sentence splitting and training data - I think the assumption that sentences that were not labelled skills were therefore not skills might have been wrong. It looks like the 1 class sentences are well labelled but the 0 class sentences aren't.

Experiments are in improve_classifier.ipynb and improve_classifier.py.

@india-kerle india-kerle force-pushed the 24_improved_classifier branch from 0af9820 to 77c2691 Compare August 3, 2021 15:45
@india-kerle india-kerle marked this pull request as ready for review August 6, 2021 07:58
# make training data
skills_augment = oversample_skills_wordnet(skills)
balanced_augment_training = skills_augment + [
(train, label) for train, label in zip(X_train, y_train) if label == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like you just add all the non-skills data here rather than taking a sample (as the name balanced_augment_training suggests)? was this what you meant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same in experiment no. 7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see you sort it out in experiment 8!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the variable name is bad though!! I tried oversampling and balancing

stops = stopwords.words("english")
aug = naw.ContextualWordEmbsAug(aug_min=1, stopwords=stops)
augmented_embed_skill_sents = []
for index, train in enumerate(skills):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this will be causing a bug, but note that you're input variable name is skills_data not skills



# %% [markdown]
# ## Experiment No. 7 - Balance training data - use contextual word embeddings to oversample 1 class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a cool idea!

@india-kerle
Copy link
Contributor Author

@lizgzil

I've updated your scaled up sentence classifier to reflect the new pipeline!

The current pipeline results in a precision score of 0.92 for the positive class and 0.90 for the negative class. It results in a recall score of 0.73 for the positive class and 0.97 for the negative class. This is for the new, updated training data where I modified some incorrect labels.

I would take this with a slight pinch of salt as the training labels aren't perfect -qualitatively, i took a look at mislabelled sents where the true label was 1 and the model predicted 0 - it appeared that while there were some sentences that were certainly skills, many of them also contained 'characteristics' - I give some examples in sentence_classifier.md I also took a look at sents that the model labelled as skills that were manually labelled 0 and it also seems rather edge case-y!

  1. pipeline/sentence_classifier/sentence_classifier.py - modified to accommodate the new pipeline + utils
  2. pipeline/sentence_classifier/utils.py - for cleaning, splitting and generating 'verb' features
  3. pipeline/sentence_classifier/predict_sentence_class.py - barely modified

Can you let me know if:

  1. There are problems running the code
  2. Code could be better written
  3. any style issues w/ docstrings etc.

Thanks so much Liz!

@india-kerle india-kerle force-pushed the 24_improved_classifier branch from 7af6b6c to acba78d Compare August 20, 2021 12:05
Comment on lines 234 to 236
From `2021.07.09`:

This will run predictions on a random sample of 10 of the 686 data files. The outputs of this yielded 5,823,903 skill sentences from the 1,000,000 job adverts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have these numbers for the 2021.08.16.yaml yet?

skills_taxonomy_v2/pipeline/sentence_classifier/utils.py Outdated Show resolved Hide resolved
skills_taxonomy_v2/pipeline/sentence_classifier/utils.py Outdated Show resolved Hide resolved
requirements.txt Outdated
@@ -17,6 +17,9 @@ networkx
gensim
bokeh
umap-learn
nlpaug
nltk
Sklearn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn is already in the requirements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - deleted!!

@lizgzil
Copy link
Contributor

lizgzil commented Aug 20, 2021

@india-kerle looks good but see some of my comments as I think there may be some bugs. I also think we should store the models and data in S3 not in github, so if you agree perhaps we could remove them from this pull request?

Just about to try and run the scripts (I pressed to submit my comments accidently before I'd finished reviewing!)


### Sentence classifier - `2021.08.16.yaml`

`python skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py --yaml_file_name 2021.08.16`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I ran this I got

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      2299
           1       1.00      0.98      0.99       931

    accuracy                           0.99      3230
   macro avg       1.00      0.99      0.99      3230
weighted avg       0.99      0.99      0.99      3230

and

              precision    recall  f1-score   support

           0       0.90      0.97      0.93       406
           1       0.91      0.73      0.81       164

    accuracy                           0.90       570
   macro avg       0.90      0.85      0.87       570
weighted avg       0.90      0.90      0.90       570

So seems like maybe the data I have (/inputs/new_training_data/final_training_data.pickle) is smaller than the version you have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmmm - lemme check as the final training set should be larger


# Output file name
output_dir = params["output_dir"]
file_name = os.path.join(output_dir, yaml_file_name.replace(".", "_"))

# Run flow
training_data = load_training_data(training_data_file)

training_data = load_training_data("final_training_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might work better as a parameter in the config with the extension included rather than hardcoded here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! no longer hardcoded

@lizgzil
Copy link
Contributor

lizgzil commented Aug 20, 2021

@india-kerle don't forget to upload the code to create the training data (or any processing you did)?

edit: this would be fine in another PR - e.g. the code to create the training data and a document about how to create it/how it was created

@india-kerle india-kerle force-pushed the 24_improved_classifier branch from d5e0169 to 7967c85 Compare August 25, 2021 18:43
@india-kerle india-kerle reopened this Aug 25, 2021
@india-kerle india-kerle merged commit e7f7ed7 into dev Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants