24 improved classifier #30

india-kerle · 2021-07-20T14:40:44Z

@lizgzil interested in your ideas on improving the model but i've gone through the sentence splitting and training data - I think the assumption that sentences that were not labelled skills were therefore not skills might have been wrong. It looks like the 1 class sentences are well labelled but the 0 class sentences aren't.

Experiments are in improve_classifier.ipynb and improve_classifier.py.

…mproved_classifier

lizgzil · 2021-08-06T08:59:42Z

skills_taxonomy_v2/analysis/sentence_classifier/notebooks/improve_classifier.py

+# make training data
+skills_augment = oversample_skills_wordnet(skills)
+balanced_augment_training = skills_augment + [
+    (train, label) for train, label in zip(X_train, y_train) if label == 0


it seems like you just add all the non-skills data here rather than taking a sample (as the name balanced_augment_training suggests)? was this what you meant?

same in experiment no. 7

oh I see you sort it out in experiment 8!

yeah, the variable name is bad though!! I tried oversampling and balancing

lizgzil · 2021-08-06T09:04:04Z

skills_taxonomy_v2/analysis/sentence_classifier/notebooks/improve_classifier.py

+    stops = stopwords.words("english")
+    aug = naw.ContextualWordEmbsAug(aug_min=1, stopwords=stops)
+    augmented_embed_skill_sents = []
+    for index, train in enumerate(skills):


I'm not sure that this will be causing a bug, but note that you're input variable name is skills_data not skills

lizgzil · 2021-08-06T09:04:38Z

skills_taxonomy_v2/analysis/sentence_classifier/notebooks/improve_classifier.py

+
+
+# %% [markdown]
+# ## Experiment No. 7 - Balance training data - use contextual word embeddings to oversample 1 class


this is a cool idea!

india-kerle · 2021-08-20T12:05:12Z

@lizgzil

I've updated your scaled up sentence classifier to reflect the new pipeline!

The current pipeline results in a precision score of 0.92 for the positive class and 0.90 for the negative class. It results in a recall score of 0.73 for the positive class and 0.97 for the negative class. This is for the new, updated training data where I modified some incorrect labels.

I would take this with a slight pinch of salt as the training labels aren't perfect -qualitatively, i took a look at mislabelled sents where the true label was 1 and the model predicted 0 - it appeared that while there were some sentences that were certainly skills, many of them also contained 'characteristics' - I give some examples in sentence_classifier.md I also took a look at sents that the model labelled as skills that were manually labelled 0 and it also seems rather edge case-y!

pipeline/sentence_classifier/sentence_classifier.py - modified to accommodate the new pipeline + utils
pipeline/sentence_classifier/utils.py - for cleaning, splitting and generating 'verb' features
pipeline/sentence_classifier/predict_sentence_class.py - barely modified

Can you let me know if:

There are problems running the code
Code could be better written
any style issues w/ docstrings etc.

Thanks so much Liz!

skills_taxonomy_v2/pipeline/sentence_classifier/predict_sentence_class.py

skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.md

lizgzil · 2021-08-20T14:04:34Z

skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.md

+From `2021.07.09`:
+
+  This will run predictions on a random sample of 10 of the 686 data files. The outputs of this yielded 5,823,903 skill sentences from the 1,000,000 job adverts.


Do you have these numbers for the 2021.08.16.yaml yet?

skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py

skills_taxonomy_v2/pipeline/sentence_classifier/utils.py

skills_taxonomy_v2/config/sentence_classifier/2021.08.16.yaml

lizgzil · 2021-08-20T14:52:38Z

requirements.txt

@@ -17,6 +17,9 @@ networkx
 gensim
 bokeh
 umap-learn
+nlpaug
+nltk
+Sklearn


sklearn is already in the requirements

whoops - deleted!!

lizgzil · 2021-08-20T14:56:59Z

@india-kerle looks good but see some of my comments as I think there may be some bugs. I also think we should store the models and data in S3 not in github, so if you agree perhaps we could remove them from this pull request?

Just about to try and run the scripts (I pressed to submit my comments accidently before I'd finished reviewing!)

skills_taxonomy_v2/pipeline/sentence_classifier/utils.py

lizgzil · 2021-08-20T15:10:25Z

skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.md

+
+### Sentence classifier - `2021.08.16.yaml`
+
+`python skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py --yaml_file_name 2021.08.16`


when I ran this I got

precision recall f1-score support 0 0.99 1.00 1.00 2299 1 1.00 0.98 0.99 931 accuracy 0.99 3230 macro avg 1.00 0.99 0.99 3230 weighted avg 0.99 0.99 0.99 3230

and

precision recall f1-score support 0 0.90 0.97 0.93 406 1 0.91 0.73 0.81 164 accuracy 0.90 570 macro avg 0.90 0.85 0.87 570 weighted avg 0.90 0.90 0.90 570

So seems like maybe the data I have (/inputs/new_training_data/final_training_data.pickle) is smaller than the version you have?

hmmmm - lemme check as the final training set should be larger

lizgzil · 2021-08-20T15:11:56Z

skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py


    # Output file name
    output_dir = params["output_dir"]
    file_name = os.path.join(output_dir, yaml_file_name.replace(".", "_"))

    # Run flow
-    training_data = load_training_data(training_data_file)
-
+    training_data = load_training_data("final_training_data")


I think this might work better as a parameter in the config with the extension included rather than hardcoded here

good idea! no longer hardcoded

lizgzil · 2021-08-20T15:15:35Z

@india-kerle don't forget to upload the code to create the training data (or any processing you did)?

edit: this would be fine in another PR - e.g. the code to create the training data and a document about how to create it/how it was created

added skills codebook

e30eb38

lizgzil mentioned this pull request Jul 28, 2021

EPIC - Extract skills ‘communities’ from job ads #6

Closed

5 tasks

India Kerle added 4 commits August 3, 2021 17:40

added experiment notes

8e62ad9

added experiment notes

5666d7e

Merge branch 'dev' of github.com:nestauk/skills-taxonomy-v2 into 24_i…

c8e051e

…mproved_classifier

added experiment notes

77c2691

india-kerle force-pushed the 24_improved_classifier branch from 0af9820 to 77c2691 Compare August 3, 2021 15:45

India Kerle added 18 commits August 3, 2021 17:47

added experiment notes

f93a2d3

added experiment notes

9f10b83

added experiment notes

10bb51b

updated skills experiments

0b6dc32

updated skills experiments

6225b0d

updated skills experiments

0a02b2c

updated skills experiments

e53678a

updated experiments to reflect new goal

d5ec2a3

updated experiments

95d6419

add new .pkl training data

ded19da

added improve classifier notebook

c94f230

added improve_classifier.py

9aec621

added how to install xgboost

55f833a

added a few new modules

dc4eaad

slight change

290c748

added raw label studio training data

0627d39

updated notebook

dfd6774

updated py

32b536b

india-kerle marked this pull request as ready for review August 6, 2021 07:58

lizgzil reviewed Aug 6, 2021

View reviewed changes

adding updated training data

b53c15f

India Kerle added 6 commits August 20, 2021 12:31

slightly modified for utils

4f246fd

new predict sents config

f9a8277

new predict sents config

402a136

new sents config

7048b3a

new trained model

a65e0cf

gave examples of mislabelled sents

acba78d

india-kerle force-pushed the 24_improved_classifier branch from 7af6b6c to acba78d Compare August 20, 2021 12:05

lizgzil reviewed Aug 20, 2021

View reviewed changes

skills_taxonomy_v2/pipeline/sentence_classifier/utils.py Show resolved Hide resolved

lizgzil reviewed Aug 20, 2021

View reviewed changes

updated classifier

7967c85

india-kerle force-pushed the 24_improved_classifier branch from d5e0169 to 7967c85 Compare August 25, 2021 18:43

India Kerle and others added 11 commits August 25, 2021 19:44

updated predict class

44e16d5

updated yaml file

8fd6dbe

updated predict

5bc3f74

updated utils

5aed810

updated yaml

596c458

updated utils

e9d18b1

updated results

c535564

add annotate module

365cae3

updated reqs

f8d3f86

updated base yaml

30ecb52

Merge branch 'dev' into 24_improved_classifier

2298f36

india-kerle closed this Aug 25, 2021

india-kerle reopened this Aug 25, 2021

india-kerle merged commit e7f7ed7 into dev Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24 improved classifier #30

24 improved classifier #30

india-kerle commented Jul 20, 2021 •

edited

Loading

lizgzil Aug 6, 2021

lizgzil Aug 6, 2021

lizgzil Aug 6, 2021

india-kerle Aug 6, 2021

lizgzil Aug 6, 2021

lizgzil Aug 6, 2021

india-kerle commented Aug 20, 2021

lizgzil Aug 20, 2021

lizgzil Aug 20, 2021

india-kerle Aug 25, 2021

lizgzil commented Aug 20, 2021

lizgzil Aug 20, 2021

india-kerle Aug 20, 2021

lizgzil Aug 20, 2021

india-kerle Aug 25, 2021

lizgzil commented Aug 20, 2021 •

edited

Loading



		# %% [markdown]
		# ## Experiment No. 7 - Balance training data - use contextual word embeddings to oversample 1 class

		From `2021.07.09`:

		This will run predictions on a random sample of 10 of the 686 data files. The outputs of this yielded 5,823,903 skill sentences from the 1,000,000 job adverts.


		### Sentence classifier - `2021.08.16.yaml`

		`python skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py --yaml_file_name 2021.08.16`

24 improved classifier #30

24 improved classifier #30

Conversation

india-kerle commented Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

india-kerle commented Aug 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizgzil commented Aug 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizgzil commented Aug 20, 2021 • edited Loading

india-kerle commented Jul 20, 2021 •

edited

Loading

lizgzil commented Aug 20, 2021 •

edited

Loading