Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

70 better naming #76

Merged
merged 28 commits into from
Dec 10, 2021
Merged

70 better naming #76

merged 28 commits into from
Dec 10, 2021

Conversation

india-kerle
Copy link
Contributor

@india-kerle india-kerle commented Nov 9, 2021

Hey Liz- I modified the files to work with the current skills pipeline. As a result, I've added the following scripts/READMEs for review in two folders:

  1. skills_taxonomy_v2/analysis/skills_extraction/:

    • Skill Naming Experiments.md - a summary of the types of experiments that were run for improving skill cluster naming incl. key phrase generation experiments, more text cleaning and merging skill clusters. Good to read for context.
    • notebooks/better_skills_naming.py - code accompanying the experiments discussed in the README.md - does NOT include code reflected from the call last Friday.
  2. skills_taxonomy_v2/pipeline/skills_extraction/:

    • skills_naming.py - (slightly modified to accommodate updated code)
    • skills_naming_utils.py - the main script to check out!

To run skill cluster naming, run
python -i skills_taxonomy_v2/pipeline/skills_extraction/skills_naming.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.12.07.yaml'

I'd focus mostly on skills_naming_utils.py - i've modified every function slightly to a) improve text cleaning and b) generate candidate ngrams per skill cluster. Thanks for your eyes on this!

Also I had some issues locally with Numba for the function get_output_config_stamped() so I downgraded it in requirements.txt.

Cheers!

@india-kerle india-kerle requested a review from lizgzil November 9, 2021 13:54
@india-kerle india-kerle linked an issue Nov 9, 2021 that may be closed by this pull request
3 tasks
@india-kerle india-kerle removed the request for review from lizgzil November 10, 2021 09:49
@india-kerle india-kerle requested review from lizgzil and removed request for georgerichardson and VolcanoBlue13 December 7, 2021 11:37
Copy link
Contributor

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @india-kerle :) ! I had a few comments on being on the safe side with some bits of the code - not sure if they are bugs, but could be. Also, I was a bit confused as to whether you are merging the skills or not? Was that something you just did in the experiments as I don't see it in skills_naming_utils.py.

Strange about the numba issues with get_output_config_stamped - this function is so basic I am surprised it is causing any issues :/

I will try to run this shortly and have a look at the names, but just wanted to feedback these bits now.


1. More text cleaning - incl. getting rid of job specific language, singularising terms, getting rid of duplicate candidate ngram phrases i.e. 'day day'
2. Generating candidate ngrams _per_ cluster - candidate ngrams were generated using each cluster's available vocabulary, hopefully creating more 'local' skill labels. Minimum descriptions were also used as cluster labels in the event that candidate ngrams were not generated.
3. Merging skill clusters - if both the skill cluster name AND the cluster centroid were very close in semantic space, skills were merged together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above you say that you aren't merging skills - is this step different somehow?

Comment on lines 27 to 29
new_skill_sentences_path: "outputs/skills_extraction/extracted_skills/2021.11.05_sentences_skills_data.json"
new_skills_path: "outputs/skills_extraction/extracted_skills/2021.11.05_skills_data.json"
new_skills_embeds_path: "outputs/skills_extraction/reduced_embeddings/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these need to be prefixed with the new_ bit, unless I'm missing something?

["sentence id", "Cluster number predicted"]
]
merged_sents_embeds = pd.merge(
skills_embeds_df, skill_sentences_df, on="sentence id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this, I've been doing on= ['job id', 'sentence id'] since this combo is unique for each sentence in a particular job advert. Although it's true that the sentence embedding should be the same if the sentence is the same (regardless of which job advert it is from), I think there is potentially a problem with duplicated rows in the merge if you don't use on= ['job id', 'sentence id'] . Perhaps it makes no difference, but just to be on the safe side. Haven't investigated this, so I may be talking crap!

]

skills_df = pd.DataFrame(skills).T
skills_df["Mean embedding"] = sent_cluster_embeds.values()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is reliant on the order of skill_df being the same skill number order as sent_cluster_embeds. Not sure if this is always going to be the case, so I'd just do something like:

skills_df["Mean embedding"] = skills_df["Skill number"].apply(lambda x: sent_cluster_embeds[x])

(You may need to create skills_df["Skill number"] with the index?)

Comment on lines 49 to 64
sentence_embeds = {}
all_sentence_embeds = []

for i in tqdm(range(0, 8)): # 8 files

new_sentences_dict = load_s3_data(
s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json"
)
all_sentence_embeds.append(new_sentences_dict)

# https://stackoverflow.com/questions/57340332/how-do-you-combine-lists-of-multiple-dictionaries-in-python
for k in all_sentence_embeds[0].keys():
sentence_embeds[k] = sum(
[skills_dict[k] for skills_dict in all_sentence_embeds], []
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to use a default dict? Something like:

from collections import defaultdict
sentence_embeds = defaultdict(list)
for i in tqdm(range(0, 8)):  # 8 files
    new_sentences_dict = load_s3_data(
            s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json"
        )
    for k, v in new_sentences_dict.items():
        sentence_embeds[k].append(v)

Also to note instead of hardcoding the range(0,8) bit, I have done it this way:

reduced_embeddings_paths = get_s3_data_paths(
            s3,
            bucket_name,
            new_skills_embeds_path,
            file_types=["*sentences_data_*.json"]
            )

sentences_data = pd.DataFrame()
for reduced_embeddings_path in tqdm(reduced_embeddings_paths):
    sentences_data_i = load_s3_data(
                s3, bucket_name,
                new_skills_embeds_path
            )
    ...
    

@india-kerle
Copy link
Contributor Author

india-kerle commented Dec 8, 2021

thanks @lizgzil! yeah, i experimented with merging with data from the first batch but didn't do any merging with the new skill clusters based on our previous conversations. The merging comments are now just in the experiments.md bit! I'll make the changes you suggested in the meantime.

@lizgzil
Copy link
Contributor

lizgzil commented Dec 8, 2021

@india-kerle I was thinking it'd be nice if the output name of skills_naming.py was a bit more clearly linked to the input skills file version.

At the moment skills_naming.py inputs extracted_skills/2021.11.05_skills_data.json and outputs extracted_skills/2021.12.07_skills_data.json, so it's not clear that the second one is just the first one with the names added.

So I'd suggest skills_data_output_path = params["new_skills_path"].split('.json')[0] + '_named.json'

@@ -100,18 +167,22 @@ def clean_cluster_descriptions(sentences_data):
replace_ngrams(sentence, ngram_words) for sentence in cluster_docs_cleaned
]

cluster_descriptions[cluster_num] = cluster_docs_clean
cluster_descriptions[cluster_number] = cluster_docs_clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is overwriting every iteration of the for loop? as cluster_number is just one number



def clean_cluster_descriptions(sentences_data):
def clean_cluster_description(sentences, cluster_number):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the clean_cluster_description or get_clean_ngrams functions need cluster_number as an input at all since they only process one skill as a time anyway.

Comment on lines 272 to 273
candidate_ngrams, cluster_descriptions = get_clean_ngrams(
sents, skills_num, ngram, min_count, threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you remove skills_num as an input to get_clean_ngrams then the output cluster_descriptions can just be a list rather than a one key dictionary (so later you can do "Texts": cluster_descriptions rather than "Texts": cluster_descriptions[skills_num])

@india-kerle india-kerle merged commit ef386a8 into dev Dec 10, 2021
@india-kerle india-kerle deleted the 70_better_naming branch December 10, 2021 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better naming of skills
4 participants