-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sample tk data #73
Sample tk data #73
Conversation
… sampled data dates as compared to all data, and finally some tweaks to the predict sentence class script to process this new form of data
with Pool(4) as pool: # 4 cpus | ||
partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=30) | ||
split_sentence_pool_output = pool.map(partial_split_sentence, data) | ||
logger.info(f"Splitting sentences took {time.time() - start_time} seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this takes 5 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now down to <5 seconds thanks to comments by @jaklinger
|
||
if sentences: | ||
logger.info(f"Transforming skill sentences ...") | ||
sentences_vec = sent_classifier.transform(sentences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this takes about 10 minutes
First note (another coming I think, but have a meeting now!): |
Second note, also applying to the sentence processing is that quite often there is an overhead in creating threads, and so rather than doing 10000 operations over 4 cores in 2500 threads, you can do 4 x 2500 operations over 4 cores in 4 threads. In general, a more practical way to do this is by splitting into chunks and then flattening the output. Potentially here you will make a saving of another factor of 10 on the sentence splitting def split_sentence_over_chunk(chunk, nlp, min_length):
partial_split_sentence = partial(split_sentence, nlp=nlp, min_length=min_length)
return list(map(partial_split_sentence, chunk))
def make_chunks(lst, n):
for i in range(0, len(lst), n):
yield lst[i:i + n]
...
with Pool(4) as pool: # 4 cpus
chunks = make_chunks(data, 1000) # chunks of 1000s sentences
partial_split_sentence = partial(split_sentence_over_chunk, nlp=nlp, min_length=30)
# NB the output will be a list of lists, so make sure to flatten after this!
split_sentence_pool_output = pool.map(partial_split_sentence, chunks) |
General comment: you could get a speed up of around 100 by switching the pipeline to metaflow + batch with I suspect that this would take just a couple of hours to run for your whole dataset, so even if it would take 5 days to write it would still be worth it, not taking into account additional development cycles of batches of 5 days 😄 class SentenceFlow(FlowSpec):
@step
def start(self):
self.file_names = job_ad_file_names
self.next(self.process_sentences, foreach="file_names")
@batch()
@step
def process_sentences(self):
self.file_name = self.input
sentence_data = get_sentences(self.file_name) # a list of dict
self.chunks = make_chunks(sentence_data)
self.next(self.embedding_chunks, foreach="chunks")
@batch()
@step
def embedding_chunks(self):
# save on memory with while/pop
texts, ids = [], []
while self.input:
row = self.input.pop(0)
texts.append(row['text'])
ids.append(row['ids'])
bert_model = SentenceTransformer(bert_model_name)
bert_model.max_seq_length = 512
vecs = bert_model.encode(texts)
self.data = list(*zip(ids, vecs))
self.next(self.join_embedding_chunks)
@step
def join_embedding_chunks(self, inputs):
self.data = []
for input in inputs:
self.data += input.data
self.next(self.process_sentences)
... etc ... |
whoa! ok this was much better. Went from 25 secs to 3 secs (on 100 job adverts) |
…hunking of data when it's being used
…ta for splitting, change to 27.yaml as default
After making some changes, the code actually just took 4.5 days to run |
…ill sentences to the relevant READMEs
Fixing #68
get_tk_sample.py
)This samples 5 million job adverts randomly. Thus when the skill sentence predictions are outputted, although in sum there are more - each file contains less data. Previously 100 random files were selected and the first 10k job adverts were processed. Now there is data from 647 of these files, and a random selection from each (which is less that 10k).
Timing
Each file goes through the same algorithm independently, and takes roughly 15-20minutes. These are the timings for each step of the algorithm for 1 data file out of 647:
9871 job adverts were in the "historical/2020/2020-03-11/jobs_2.110.jsonl.gz" file.
We would expect an average of 5000000/647 = 7728 of the sampled job adverts in each file. So this file seems to have a particularly large sample of job adverts in.
There will be different numbers of sentences in each job advert, but scaling that up it means that 20mins647 = 9 days. or 20mins(5000000/9871) = 7 days.
Target areas for speeding up!
The biggest sticking point is in the transforming of the sentences using the pre-trained BERT model (even when using multiprocessing), i.e. in this function
skills-taxonomy-v2/skills_taxonomy_v2/pipeline/sentence_classifier/sentence_classifier.py
Line 76 in 830be30
So could this step be done better in order to speed up that area of the pipeline?
The second biggest time lag is splitting the text up into sentences. This is done via the function split_sentence - in this PR this function is called here