Scaling up extracting skills pipeline #31

lizgzil · 2021-07-21T13:03:33Z

This PR adds the 3-step pipeline in pipeline/skills_extraction/ - getting BERT sentence embeddings from the skill sentences (step 1) and using them to extract 'TK' skills (step 2). It also creates an ESCO-TK skills mapper dictionary (step 3).

Previously in #27 PR I found skill sentences in 10 random TextKernel files -> this resulted in around 6 million sentences.

Documents which explain what this PR adds:

Skill Extraction document: Skills Extraction.md
(optional) Skill Extraction methodology experiments document: Skills Extraction Experiments.md

Files to review:

Everything with a notebooks/ file path can be ignored - they are experimental and anything important has been refactored out into various scripts.
S3 data getters: getters/s3_data.py
(Step 1) Get the sentence embeddings get_sentence_embeddings.py and get_sentence_embeddings_utils.py
(Step 2) Extract skills: extract_skills.py and extract_skills_utils.py
(Step 3) Mapping ESCO skills to TK skills: esco_skills_mappings.py

Checklist:

… its in a stop word

…only include if in vocab

…gs using sentence_transformers library

…bedding clustering

…cripts

…, and some fixes

…appings

…date parameters for skills extraction when its on 10k data points

india-kerle

thanks @lizgzil for the pull request! I left a few comments - let me know if you have any additional questions or want me to go over anything specific in more detail.

india-kerle · 2021-08-10T11:19:49Z

skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings_utils.py

@@ -0,0 +1,100 @@
+import re


could be helpful to have a description of this file as well!

india-kerle · 2021-08-10T11:20:51Z

skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings_utils.py

+    - Isn't a proper noun/number/quite a few other word types
+    - Isn't a word with numbers in (these are always garbage)
+    """
+    not_skills_words = [


maybe make this a txt file and save it in inputs so you can add to the list as well and it won't be unruly in the code?

india-kerle · 2021-08-10T11:23:48Z

skills_taxonomy_v2/pipeline/skills_extraction/extract_skills_utils.py

@@ -0,0 +1,275 @@
+import logging


same as here - a small description could be helpful

india-kerle · 2021-08-10T11:30:10Z

skills_taxonomy_v2/pipeline/skills_extraction/Skills Extraction.md

+This is done by running:
+
+```
+python -i skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.08.02.yaml'


this code runs! although might be good to add python-Levenshtein to the requirements.txt because i keep getting warning messages to pip install it in order to suppress the message.

oh good to know!

india-kerle · 2021-08-10T11:30:53Z

skills_taxonomy_v2/pipeline/skills_extraction/Skills Extraction.md

@@ -0,0 +1,82 @@
+# Skills Extraction
+
+The aim of this pipeline is to extract skills from job adverts. There are 3 steps:


really helpful description!!

india-kerle · 2021-08-10T11:36:56Z

skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings.py

+    """
+    Not an elegant solution - but if the data is too large (i think >5GB) you get
+    an error:
+    botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when


this is annoying!! maybe worth talking to someone in data engineering i.e. Joel, Jack to see if they've dealt with this before and have a more elegant solution?

… words in a text file, add python-levenshtein to requirements

lizgzil · 2021-08-20T10:26:00Z

thanks for the review @india-kerle ! I've made your changes in the latest commit. Good call on the custom stop words going in their own txt file.

lizgzil added 19 commits July 29, 2021 17:09

Add a S3 data getter

c9ad942

First pass of refactoring getting word embeddings for skills extraction

71a08d9

Add word embeddings extarction yaml

cf2ac26

Fix output of word embeddings to be list of tuples

7c232d2

Use pooling to process sentences

804bd73

delete mistake

2f8ad4f

Remove pool as it interferes with pytorch memory

1a5de7d

Fix some bugs in the sentence processing

23ea85f

Format files

c37fd27

Add function to deal with situation for output being too large for S3

10c0db4

Remove certain words and output average embedding, not all

73877a1

Add numpy

629c477

Turn np array to list before saving out

4e133eb

Only process a sample make sure to lower case text before checking if…

2f02b44

… its in a stop word

Add some nuances to getting the word embeddings - remove duplicates, …

3c734da

…only include if in vocab

New approach to mask out stop words, then calculate sentence embeddin…

808ba2b

…gs using sentence_transformers library

Notebook to extract skills using clustering transformers approach

71107e6

Edit word embeddings to output the job and sentence id

c55bd9e

Separate out experimental notebooks for skills extraction via word em…

a90f36c

…bedding clustering

lizgzil force-pushed the scale-up-extract-skills branch from 79471e4 to a90f36c Compare July 29, 2021 16:10

lizgzil added 3 commits July 30, 2021 14:37

Add a lot more details about the skills extraction experiments

510aa41

Rename transformer sentence embedding notebooks

4c92626

Add figures for experimentation notebooks explanation

89038e2

lizgzil mentioned this pull request Aug 2, 2021

Refactor extract skills #36

Closed

lizgzil added 6 commits August 3, 2021 15:17

Use init bucket name global variable

08dd9c3

Refctor notebook version of skills extraction into its own pipeline s…

7af94f5

…cripts

add config for skills extraction

c9d6a7f

Add ability to read csv from s3, adding requirements for this

cefbf2c

Add new config file with all steps parameters in

5e4aea1

Modifications to extract skills - include n compoentns param for umap…

c4ed241

…, and some fixes

lizgzil added 6 commits August 3, 2021 18:17

use config file for whole pipeline in get word embeddings

6239303

Add new script in extract skills pipeline that finds the esco to tk m…

7659a3e

…appings

Add new doc to explain skills extraction pipeline

c72d486

Rerunning experimental notebooks with new parameters

b519abd

Change the name of get word embeddings to get sentence embeddings

68d1ea8

Neaten up and delete function not being used

4d7fa4b

lizgzil marked this pull request as ready for review August 4, 2021 09:31

lizgzil requested review from georgerichardson and india-kerle August 4, 2021 09:34

lizgzil added 3 commits August 4, 2021 10:55

Update figures in experiment document

0c716a7

Add abilitiy to save out pickle of reducer and clustering objects, up…

b0be534

…date parameters for skills extraction when its on 10k data points

Add latest results to document

172869e

india-kerle reviewed Aug 11, 2021

View reviewed changes

Addressing @india-kerle PR comments - add docstrings, put custom stop…

71e3f94

… words in a text file, add python-levenshtein to requirements

lizgzil merged commit 178c07b into dev Aug 20, 2021

lizgzil deleted the scale-up-extract-skills branch August 20, 2021 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling up extracting skills pipeline #31

Scaling up extracting skills pipeline #31

lizgzil commented Jul 21, 2021 •

edited

Loading

india-kerle left a comment

india-kerle Aug 10, 2021

india-kerle Aug 10, 2021

lizgzil Aug 20, 2021

india-kerle Aug 10, 2021

india-kerle Aug 10, 2021

lizgzil Aug 20, 2021

india-kerle Aug 10, 2021

india-kerle Aug 10, 2021

lizgzil commented Aug 20, 2021

		@@ -0,0 +1,82 @@
		# Skills Extraction

		The aim of this pipeline is to extract skills from job adverts. There are 3 steps:

Scaling up extracting skills pipeline #31

Scaling up extracting skills pipeline #31

Conversation

lizgzil commented Jul 21, 2021 • edited Loading

Documents which explain what this PR adds:

Files to review:

india-kerle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lizgzil commented Aug 20, 2021

lizgzil commented Jul 21, 2021 •

edited

Loading