-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Correcy misatke in skills extraction readme * Add little script to calculate mean embeddings for each skill * Update new config for skills taxonomy * Update build_taxonomy.py to have the option of not creating a level D, or when skills names might not exist. Also separated out loading of data and includes option for new format of skills data. * function in build taxonomy utils to append clustering with manual interventions * Add json for mnaual naming of level A groups and manual grouping of level B skills * Add other analysis bits to readme * Add new manual mapper dict from consultation with India and George. Also update some of the manual mapper functions * Edits to the name process - silence some annoying nltk logs, add logging to skills_naming_utils, use same config file as newest skill extraction, and fix a bug in clean_cluster_description as well as clean out some unneccessary procssing bits * The way the embeddings were loading wasn't loading all of them for some reason - so changed how this is done, also added some safety bits in get_skill_info where it saves out every 100 skills rather than wait a whole day until saving * Remove unneccessary imports from naming skills * Readd numba and remove unneccessary imports * Fix index bug in cluster embeddings * Add some naming fixes to the building of the taxonomy and outputting, no duplicate names, also output centroid
- Loading branch information
Showing
18 changed files
with
734 additions
and
367 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,4 +28,4 @@ geopandas | |
rtree | ||
urllib3 | ||
shapely | ||
pattern | ||
pattern |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
44 changes: 0 additions & 44 deletions
44
skills_taxonomy_v2/config/skills_extraction/2021.11.09.yaml
This file was deleted.
Oops, something went wrong.
46 changes: 0 additions & 46 deletions
46
skills_taxonomy_v2/config/skills_extraction/2021.12.07.yaml
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
flows: | ||
build_taxonomy_flow: | ||
params: | ||
reduced_embeddings_dir: "outputs/skills_extraction/reduced_embeddings/" | ||
clustered_sentences_path: "outputs/skills_extraction/extracted_skills/2021.11.05_sentences_skills_data.json" | ||
skills_data_path: "outputs/skills_extraction/extracted_skills/2021.11.05_skills_data.json" | ||
skills_names_data_path: "outputs/skills_extraction/extracted_skills/2021.11.05_skills_data_named.json" | ||
cluster_column_name: "Cluster number predicted" | ||
embedding_column_name: "embedding" | ||
skills_data_texts_name: "Sentences" | ||
level_c_n: 250 | ||
level_b_n: 60 | ||
k_means_max_iter: 5000 | ||
check_low_siloutte_b: False | ||
create_level_d: False | ||
level_names_tfidif_n: 3 | ||
level_a_manual_clusters_path: "skills_taxonomy_v2/utils/2021.12.20_level_a_mapper_dict.json" | ||
output_dir: "outputs/skills_taxonomy/" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
42 changes: 42 additions & 0 deletions
42
skills_taxonomy_v2/pipeline/skills_extraction/skills_naming_embeddings.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
""" | ||
For the skill naming we need the mean embedding for each skill. | ||
""" | ||
import pandas as pd | ||
import numpy as np | ||
import boto3 | ||
from tqdm import tqdm | ||
|
||
from collections import defaultdict | ||
|
||
from skills_taxonomy_v2.getters.s3_data import get_s3_data_paths, save_to_s3, load_s3_data | ||
from skills_taxonomy_v2 import BUCKET_NAME | ||
|
||
s3 = boto3.resource("s3") | ||
|
||
# Load skills | ||
# The sentences ID + cluster num | ||
sentence_embs = load_s3_data(s3, BUCKET_NAME, "outputs/skills_extraction/extracted_skills/2021.11.05_sentences_skills_data.json") | ||
sentence_embs = pd.DataFrame(sentence_embs) | ||
sentence_embs = sentence_embs[sentence_embs["Cluster number predicted"] >= 0] | ||
|
||
# Load embeddings | ||
sentence_embeddings_dirs = get_s3_data_paths( | ||
s3, BUCKET_NAME, 'outputs/skills_extraction/word_embeddings/data/2021.11.05', file_types=["*embeddings.json"]) | ||
|
||
skill_embeddings = defaultdict(list) | ||
for embedding_dir in tqdm(sentence_embeddings_dirs): | ||
sentence_embeddings = load_s3_data(s3, BUCKET_NAME, embedding_dir) | ||
sentence_embeddings_df = pd.DataFrame(sentence_embeddings) | ||
temp_merge = pd.merge(sentence_embs, sentence_embeddings_df, how="inner", left_on=['job id', 'sentence id'], right_on=[0,1]) | ||
for skill_num, embeddings in temp_merge.groupby('Cluster number predicted'): | ||
skill_embeddings[skill_num].extend(embeddings[3].tolist()) | ||
|
||
# Get mean embedding for each skill number | ||
print("Getting mean embeddings") | ||
mean_skill_embeddings = {} | ||
for skill_num, embeddings_list in skill_embeddings.items(): | ||
mean_skill_embeddings[skill_num] = np.mean(embeddings_list, axis=0).tolist() | ||
|
||
# Save out | ||
print("Saving mean embeddings") | ||
save_to_s3(s3, BUCKET_NAME, mean_skill_embeddings, 'outputs/skills_extraction/extracted_skills/2021.11.05_skill_mean_embeddings.json') |
Oops, something went wrong.