Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling up extracting skills pipeline #31

Merged
merged 38 commits into from
Aug 20, 2021
Merged

Scaling up extracting skills pipeline #31

merged 38 commits into from
Aug 20, 2021

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Jul 21, 2021


This PR adds the 3-step pipeline in pipeline/skills_extraction/ - getting BERT sentence embeddings from the skill sentences (step 1) and using them to extract 'TK' skills (step 2). It also creates an ESCO-TK skills mapper dictionary (step 3).

Previously in #27 PR I found skill sentences in 10 random TextKernel files -> this resulted in around 6 million sentences.

Documents which explain what this PR adds:

Files to review:

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained the feature in this PR or (better) in output/reports/
  • I have requested a code review

@lizgzil lizgzil force-pushed the scale-up-extract-skills branch from 79471e4 to a90f36c Compare July 29, 2021 16:10
@lizgzil lizgzil mentioned this pull request Aug 2, 2021
@lizgzil lizgzil marked this pull request as ready for review August 4, 2021 09:31
Copy link
Contributor

@india-kerle india-kerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @lizgzil for the pull request! I left a few comments - let me know if you have any additional questions or want me to go over anything specific in more detail.

@@ -0,0 +1,100 @@
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be helpful to have a description of this file as well!

- Isn't a proper noun/number/quite a few other word types
- Isn't a word with numbers in (these are always garbage)
"""
not_skills_words = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this a txt file and save it in inputs so you can add to the list as well and it won't be unruly in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!

@@ -0,0 +1,275 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as here - a small description could be helpful

This is done by running:

```
python -i skills_taxonomy_v2/pipeline/skills_extraction/get_sentence_embeddings.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.08.02.yaml'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code runs! although might be good to add python-Levenshtein to the requirements.txt because i keep getting warning messages to pip install it in order to suppress the message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good to know!

@@ -0,0 +1,82 @@
# Skills Extraction

The aim of this pipeline is to extract skills from job adverts. There are 3 steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really helpful description!!

"""
Not an elegant solution - but if the data is too large (i think >5GB) you get
an error:
botocore.exceptions.ClientError: An error occurred (EntityTooLarge) when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is annoying!! maybe worth talking to someone in data engineering i.e. Joel, Jack to see if they've dealt with this before and have a more elegant solution?

… words in a text file, add python-levenshtein to requirements
@lizgzil
Copy link
Contributor Author

lizgzil commented Aug 20, 2021

thanks for the review @india-kerle ! I've made your changes in the latest commit. Good call on the custom stop words going in their own txt file.

@lizgzil lizgzil merged commit 178c07b into dev Aug 20, 2021
@lizgzil lizgzil deleted the scale-up-extract-skills branch August 20, 2021 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants