70 better naming #76

india-kerle · 2021-11-09T13:53:58Z

Hey Liz- I modified the files to work with the current skills pipeline. As a result, I've added the following scripts/READMEs for review in two folders:

skills_taxonomy_v2/analysis/skills_extraction/:
- Skill Naming Experiments.md - a summary of the types of experiments that were run for improving skill cluster naming incl. key phrase generation experiments, more text cleaning and merging skill clusters. Good to read for context.
- notebooks/better_skills_naming.py - code accompanying the experiments discussed in the README.md - does NOT include code reflected from the call last Friday.
skills_taxonomy_v2/pipeline/skills_extraction/:
- skills_naming.py - (slightly modified to accommodate updated code)
- skills_naming_utils.py - the main script to check out!

To run skill cluster naming, run
python -i skills_taxonomy_v2/pipeline/skills_extraction/skills_naming.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.12.07.yaml'

I'd focus mostly on skills_naming_utils.py - i've modified every function slightly to a) improve text cleaning and b) generate candidate ngrams per skill cluster. Thanks for your eyes on this!

Also I had some issues locally with Numba for the function get_output_config_stamped() so I downgraded it in requirements.txt.

Cheers!

…uster

lizgzil

Looks good @india-kerle :) ! I had a few comments on being on the safe side with some bits of the code - not sure if they are bugs, but could be. Also, I was a bit confused as to whether you are merging the skills or not? Was that something you just did in the experiments as I don't see it in skills_naming_utils.py.

Strange about the numba issues with get_output_config_stamped - this function is so basic I am surprised it is causing any issues :/

I will try to run this shortly and have a look at the names, but just wanted to feedback these bits now.

lizgzil · 2021-12-08T11:21:35Z

skills_taxonomy_v2/analysis/skills_extraction/Skill Naming Experiments.md

+
+1. More text cleaning - incl. getting rid of job specific language, singularising terms, getting rid of duplicate candidate ngram phrases i.e. 'day day'
+2. Generating candidate ngrams _per_ cluster - candidate ngrams were generated using each cluster's available vocabulary, hopefully creating more 'local' skill labels. Minimum descriptions were also used as cluster labels in the event that candidate ngrams were not generated.
+3. Merging skill clusters - if both the skill cluster name AND the cluster centroid were very close in semantic space, skills were merged together.


above you say that you aren't merging skills - is this step different somehow?

lizgzil · 2021-12-08T11:27:49Z

skills_taxonomy_v2/config/skills_extraction/2021.12.07.yaml

+      new_skill_sentences_path: "outputs/skills_extraction/extracted_skills/2021.11.05_sentences_skills_data.json"
+      new_skills_path: "outputs/skills_extraction/extracted_skills/2021.11.05_skills_data.json"
+      new_skills_embeds_path: "outputs/skills_extraction/reduced_embeddings/"


I don't think these need to be prefixed with the new_ bit, unless I'm missing something?

lizgzil · 2021-12-08T11:58:41Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming.py

+        ["sentence id", "Cluster number predicted"]
+    ]
+    merged_sents_embeds = pd.merge(
+        skills_embeds_df, skill_sentences_df, on="sentence id"


For this, I've been doing on= ['job id', 'sentence id'] since this combo is unique for each sentence in a particular job advert. Although it's true that the sentence embedding should be the same if the sentence is the same (regardless of which job advert it is from), I think there is potentially a problem with duplicated rows in the merge if you don't use on= ['job id', 'sentence id'] . Perhaps it makes no difference, but just to be on the safe side. Haven't investigated this, so I may be talking crap!

lizgzil · 2021-12-08T12:02:25Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming.py

+    ]
+
+    skills_df = pd.DataFrame(skills).T
+    skills_df["Mean embedding"] = sent_cluster_embeds.values()


I guess this is reliant on the order of skill_df being the same skill number order as sent_cluster_embeds. Not sure if this is always going to be the case, so I'd just do something like:

skills_df["Mean embedding"] = skills_df["Skill number"].apply(lambda x: sent_cluster_embeds[x])

(You may need to create skills_df["Skill number"] with the index?)

lizgzil · 2021-12-08T12:16:28Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming_utils.py

+    sentence_embeds = {}
+    all_sentence_embeds = []
+
+    for i in tqdm(range(0, 8)):  # 8 files
+
+        new_sentences_dict = load_s3_data(
+            s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json"
+        )
+        all_sentence_embeds.append(new_sentences_dict)
+
+    # https://stackoverflow.com/questions/57340332/how-do-you-combine-lists-of-multiple-dictionaries-in-python
+    for k in all_sentence_embeds[0].keys():
+        sentence_embeds[k] = sum(
+            [skills_dict[k] for skills_dict in all_sentence_embeds], []
+        )
+


You may be able to use a default dict? Something like:

from collections import defaultdict sentence_embeds = defaultdict(list) for i in tqdm(range(0, 8)): # 8 files new_sentences_dict = load_s3_data( s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json" ) for k, v in new_sentences_dict.items(): sentence_embeds[k].append(v)

Also to note instead of hardcoding the range(0,8) bit, I have done it this way:

reduced_embeddings_paths = get_s3_data_paths( s3, bucket_name, new_skills_embeds_path, file_types=["*sentences_data_*.json"] ) sentences_data = pd.DataFrame() for reduced_embeddings_path in tqdm(reduced_embeddings_paths): sentences_data_i = load_s3_data( s3, bucket_name, new_skills_embeds_path ) ...

india-kerle · 2021-12-08T12:32:24Z

thanks @lizgzil! yeah, i experimented with merging with data from the first batch but didn't do any merging with the new skill clusters based on our previous conversations. The merging comments are now just in the experiments.md bit! I'll make the changes you suggested in the meantime.

lizgzil · 2021-12-08T15:49:17Z

@india-kerle I was thinking it'd be nice if the output name of skills_naming.py was a bit more clearly linked to the input skills file version.

At the moment skills_naming.py inputs extracted_skills/2021.11.05_skills_data.json and outputs extracted_skills/2021.12.07_skills_data.json, so it's not clear that the second one is just the first one with the names added.

So I'd suggest skills_data_output_path = params["new_skills_path"].split('.json')[0] + '_named.json'

lizgzil · 2021-12-08T16:07:11Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming_utils.py

@@ -100,18 +167,22 @@ def clean_cluster_descriptions(sentences_data):
            replace_ngrams(sentence, ngram_words) for sentence in cluster_docs_cleaned
        ]

-        cluster_descriptions[cluster_num] = cluster_docs_clean
+        cluster_descriptions[cluster_number] = cluster_docs_clean


I think this is overwriting every iteration of the for loop? as cluster_number is just one number

lizgzil · 2021-12-08T16:13:44Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming_utils.py

-
-
-def clean_cluster_descriptions(sentences_data):
+def clean_cluster_description(sentences, cluster_number):


I don't think the clean_cluster_description or get_clean_ngrams functions need cluster_number as an input at all since they only process one skill as a time anyway.

lizgzil · 2021-12-08T16:19:03Z

skills_taxonomy_v2/pipeline/skills_extraction/skills_naming_utils.py

+        candidate_ngrams, cluster_descriptions = get_clean_ngrams(
+            sents, skills_num, ngram, min_count, threshold


I think if you remove skills_num as an input to get_clean_ngrams then the output cluster_descriptions can just be a list rather than a one key dictionary (so later you can do "Texts": cluster_descriptions rather than "Texts": cluster_descriptions[skills_num])

India Kerle added 16 commits August 26, 2021 11:14

added s3 requirements

06a32b2

Merge branch 'dev' of github.com:nestauk/skills-taxonomy-v2 into dev

7872fae

added updated skill naming notes

0168cc7

added pandas screenshots of different keyword extraction techniques

e38052e

added 1 line on where better_skills_naming.py is located

453e765

added patterns package for better naming experiments

8e216ee

added better naming experiments.py

68c17e9

added merged skill cluster examples

9d3a378

updated code

8c9b348

delete unnecessary print statement

0985724

downgraded numba to 0.53

7096144

new config file for lower thresholds of skill naming

d8b3598

updated readme to reflect convo w/ Liz re: skills merging

10a3804

modify to incl. duplicate renaming function

c7cc8ae

update ngram cleaning, rename duplicate ngrams, generate ngram PER cl…

373be98

…uster

rename_duplicate_named_skills docstring update

fb076d1

india-kerle requested a review from lizgzil November 9, 2021 13:54

india-kerle linked an issue Nov 9, 2021 that may be closed by this pull request

Better naming of skills #70

Closed

3 tasks

india-kerle requested review from georgerichardson and VolcanoBlue13 November 10, 2021 09:29

india-kerle assigned georgerichardson and VolcanoBlue13 Nov 10, 2021

india-kerle removed the request for review from lizgzil November 10, 2021 09:49

India Kerle added 4 commits December 7, 2021 12:34

modified script to reflect new skills pipeline

b6b1d83

updated utils to reflect new skills data

d2798b2

new config file for new data paths

89fe81a

remove .sample

dc19669

india-kerle requested review from lizgzil and removed request for georgerichardson and VolcanoBlue13 December 7, 2021 11:37

india-kerle assigned lizgzil and unassigned georgerichardson and VolcanoBlue13 Dec 7, 2021

lizgzil reviewed Dec 8, 2021

View reviewed changes

India Kerle added 8 commits December 10, 2021 18:05

address Liz's PR

4ccd792

address Liz's PR comments

a8325e3

updated parameter

9150c59

addressed liz's PR

1a7199e

addressed Liz's PR

5976107

PR

4121c91

addressed Liz's PR

d046fef

edited docstring after PR

a2770f1

india-kerle merged commit ef386a8 into dev Dec 10, 2021

india-kerle deleted the 70_better_naming branch December 10, 2021 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

70 better naming #76

70 better naming #76

india-kerle commented Nov 9, 2021 •

edited

Loading

lizgzil left a comment

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021

india-kerle commented Dec 8, 2021 •

edited

Loading

lizgzil commented Dec 8, 2021

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021

lizgzil Dec 8, 2021



		def clean_cluster_descriptions(sentences_data):
		def clean_cluster_description(sentences, cluster_number):

		candidate_ngrams, cluster_descriptions = get_clean_ngrams(
		sents, skills_num, ngram, min_count, threshold

70 better naming #76

70 better naming #76

Conversation

india-kerle commented Nov 9, 2021 • edited Loading

lizgzil left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

india-kerle commented Dec 8, 2021 • edited Loading

lizgzil commented Dec 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

india-kerle commented Nov 9, 2021 •

edited

Loading

india-kerle commented Dec 8, 2021 •

edited

Loading