-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
70 better naming #76
70 better naming #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @india-kerle :) ! I had a few comments on being on the safe side with some bits of the code - not sure if they are bugs, but could be. Also, I was a bit confused as to whether you are merging the skills or not? Was that something you just did in the experiments as I don't see it in skills_naming_utils.py
.
Strange about the numba issues with get_output_config_stamped
- this function is so basic I am surprised it is causing any issues :/
I will try to run this shortly and have a look at the names, but just wanted to feedback these bits now.
|
||
1. More text cleaning - incl. getting rid of job specific language, singularising terms, getting rid of duplicate candidate ngram phrases i.e. 'day day' | ||
2. Generating candidate ngrams _per_ cluster - candidate ngrams were generated using each cluster's available vocabulary, hopefully creating more 'local' skill labels. Minimum descriptions were also used as cluster labels in the event that candidate ngrams were not generated. | ||
3. Merging skill clusters - if both the skill cluster name AND the cluster centroid were very close in semantic space, skills were merged together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
above you say that you aren't merging skills - is this step different somehow?
new_skill_sentences_path: "outputs/skills_extraction/extracted_skills/2021.11.05_sentences_skills_data.json" | ||
new_skills_path: "outputs/skills_extraction/extracted_skills/2021.11.05_skills_data.json" | ||
new_skills_embeds_path: "outputs/skills_extraction/reduced_embeddings/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these need to be prefixed with the new_
bit, unless I'm missing something?
["sentence id", "Cluster number predicted"] | ||
] | ||
merged_sents_embeds = pd.merge( | ||
skills_embeds_df, skill_sentences_df, on="sentence id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this, I've been doing on= ['job id', 'sentence id']
since this combo is unique for each sentence in a particular job advert. Although it's true that the sentence embedding should be the same if the sentence is the same (regardless of which job advert it is from), I think there is potentially a problem with duplicated rows in the merge if you don't use on= ['job id', 'sentence id']
. Perhaps it makes no difference, but just to be on the safe side. Haven't investigated this, so I may be talking crap!
] | ||
|
||
skills_df = pd.DataFrame(skills).T | ||
skills_df["Mean embedding"] = sent_cluster_embeds.values() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is reliant on the order of skill_df
being the same skill number order as sent_cluster_embeds
. Not sure if this is always going to be the case, so I'd just do something like:
skills_df["Mean embedding"] = skills_df["Skill number"].apply(lambda x: sent_cluster_embeds[x])
(You may need to create skills_df["Skill number"]
with the index?)
sentence_embeds = {} | ||
all_sentence_embeds = [] | ||
|
||
for i in tqdm(range(0, 8)): # 8 files | ||
|
||
new_sentences_dict = load_s3_data( | ||
s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json" | ||
) | ||
all_sentence_embeds.append(new_sentences_dict) | ||
|
||
# https://stackoverflow.com/questions/57340332/how-do-you-combine-lists-of-multiple-dictionaries-in-python | ||
for k in all_sentence_embeds[0].keys(): | ||
sentence_embeds[k] = sum( | ||
[skills_dict[k] for skills_dict in all_sentence_embeds], [] | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may be able to use a default dict? Something like:
from collections import defaultdict
sentence_embeds = defaultdict(list)
for i in tqdm(range(0, 8)): # 8 files
new_sentences_dict = load_s3_data(
s3, bucket_name, new_skills_embeds_path + f"sentences_data_{i}.json"
)
for k, v in new_sentences_dict.items():
sentence_embeds[k].append(v)
Also to note instead of hardcoding the range(0,8)
bit, I have done it this way:
reduced_embeddings_paths = get_s3_data_paths(
s3,
bucket_name,
new_skills_embeds_path,
file_types=["*sentences_data_*.json"]
)
sentences_data = pd.DataFrame()
for reduced_embeddings_path in tqdm(reduced_embeddings_paths):
sentences_data_i = load_s3_data(
s3, bucket_name,
new_skills_embeds_path
)
...
thanks @lizgzil! yeah, i experimented with merging with data from the first batch but didn't do any merging with the new skill clusters based on our previous conversations. The merging comments are now just in the experiments.md bit! I'll make the changes you suggested in the meantime. |
@india-kerle I was thinking it'd be nice if the output name of At the moment So I'd suggest |
@@ -100,18 +167,22 @@ def clean_cluster_descriptions(sentences_data): | |||
replace_ngrams(sentence, ngram_words) for sentence in cluster_docs_cleaned | |||
] | |||
|
|||
cluster_descriptions[cluster_num] = cluster_docs_clean | |||
cluster_descriptions[cluster_number] = cluster_docs_clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is overwriting every iteration of the for loop? as cluster_number
is just one number
|
||
|
||
def clean_cluster_descriptions(sentences_data): | ||
def clean_cluster_description(sentences, cluster_number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the clean_cluster_description
or get_clean_ngrams
functions need cluster_number
as an input at all since they only process one skill as a time anyway.
candidate_ngrams, cluster_descriptions = get_clean_ngrams( | ||
sents, skills_num, ngram, min_count, threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you remove skills_num
as an input to get_clean_ngrams
then the output cluster_descriptions
can just be a list rather than a one key dictionary (so later you can do "Texts": cluster_descriptions
rather than "Texts": cluster_descriptions[skills_num]
)
Hey Liz- I modified the files to work with the current skills pipeline. As a result, I've added the following scripts/READMEs for review in two folders:
skills_taxonomy_v2/analysis/skills_extraction/
:Skill Naming Experiments.md
- a summary of the types of experiments that were run for improving skill cluster naming incl. key phrase generation experiments, more text cleaning and merging skill clusters. Good to read for context.notebooks/better_skills_naming.py
- code accompanying the experiments discussed in the README.md - does NOT include code reflected from the call last Friday.skills_taxonomy_v2/pipeline/skills_extraction/
:skills_naming.py
- (slightly modified to accommodate updated code)skills_naming_utils.py
- the main script to check out!To run skill cluster naming, run
python -i skills_taxonomy_v2/pipeline/skills_extraction/skills_naming.py --config_path 'skills_taxonomy_v2/config/skills_extraction/2021.12.07.yaml'
I'd focus mostly on
skills_naming_utils.py
- i've modified every function slightly to a) improve text cleaning and b) generate candidate ngrams per skill cluster. Thanks for your eyes on this!Also I had some issues locally with Numba for the function
get_output_config_stamped()
so I downgraded it in requirements.txt.Cheers!