-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling up extracting skills pipeline #31
Conversation
… its in a stop word
…only include if in vocab
…gs using sentence_transformers library
…bedding clustering
79471e4
to
a90f36c
Compare
…date parameters for skills extraction when its on 10k data points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @lizgzil for the pull request! I left a few comments - let me know if you have any additional questions or want me to go over anything specific in more detail.
@@ -0,0 +1,100 @@ | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be helpful to have a description of this file as well!
- Isn't a proper noun/number/quite a few other word types | ||
- Isn't a word with numbers in (these are always garbage) | ||
""" | ||
not_skills_words = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make this a txt file and save it in inputs
so you can add to the list as well and it won't be unruly in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea!
@@ -0,0 +1,275 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as here - a small description could be helpful
This is done by running: | ||
|
||
``` | ||
python -i skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.08.02.yaml' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code runs! although might be good to add python-Levenshtein
to the requirements.txt
because i keep getting warning messages to pip install it in order to suppress the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good to know!
@@ -0,0 +1,82 @@ | |||
# Skills Extraction | |||
|
|||
The aim of this pipeline is to extract skills from job adverts. There are 3 steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really helpful description!!
""" | ||
Not an elegant solution - but if the data is too large (i think >5GB) you get | ||
an error: | ||
botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is annoying!! maybe worth talking to someone in data engineering i.e. Joel, Jack to see if they've dealt with this before and have a more elegant solution?
… words in a text file, add python-levenshtein to requirements
thanks for the review @india-kerle ! I've made your changes in the latest commit. Good call on the custom stop words going in their own txt file. |
This PR adds the 3-step pipeline in
pipeline/skills_extraction/
- getting BERT sentence embeddings from the skill sentences (step 1) and using them to extract 'TK' skills (step 2). It also creates an ESCO-TK skills mapper dictionary (step 3).Previously in #27 PR I found skill sentences in 10 random TextKernel files -> this resulted in around 6 million sentences.
Documents which explain what this PR adds:
Files to review:
notebooks/
file path can be ignored - they are experimental and anything important has been refactored out into various scripts.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
soutput/reports/