-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize asynchronously with ThreadPoolExecutor #56
Vectorize asynchronously with ThreadPoolExecutor #56
Conversation
7fdc3d4
to
b8974d1
Compare
vectorizer.py
Outdated
@@ -52,8 +55,10 @@ def __init__(self, model_path: str, cuda_support: bool, cuda_core: str, cuda_per | |||
|
|||
self.tokenizer = self.model_delegate.create_tokenizer(model_path) | |||
|
|||
self.executor = ThreadPoolExecutor(max_workers=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the ThreadPoolExecutor
is meant to leverage the vectorization process by running them in parallel. I have a couple of question regarding this approach:
- Why is the
max_workers
set to1
? This means only one thread can be active at a time. Doesn't this break the parallelism that we want to achieve? Why not to leave the value set toNone
? - Why
threads
? The vectorization process is CPU bound and threads are meant for IO bound tasks, why not use processes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Previously the vectorization process was synchronous, so in theory vectors were generated only one at a time. I left it at 1 to mimic this; a side effect of this is that the GPU memory usage when cuda is enabled is more predictable. I'll run some experiments with other values though, it could lead to significant speedups!
-
Good shout, I tend to use ThreadPoolExecutor over ProcessPoolExecutor since it's technically less overhead (and since with 1 worker the difference is negligible) but I'll try both for the many workers case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying ProcessPoolExecutor I see a couple issues:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item
vector = await vec.vectorize(item.text, item.config)
File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize
return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config))
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 261, in reduce_tensor
File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/storage.py", line 920, in _share_cuda_
RuntimeError: unable to open shared memory object </torch_49047_1633709960_558> in read-write mode: Too many open files (24)
INFO: 127.0.0.1:42060 - "POST /vectors HTTP/1.1" 500 Internal Server Error
and
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item
vector = await vec.vectorize(item.text, item.config)
File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize
return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config))
File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object
However, ThreadPoolExecutor handles multiple workers quite nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ThreadPoolExecutor with max_workers=None
reaches up to ~70 requests/sec, up 300% from the ~23 r/s I was seeing with a single worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change max_workers=None
and then we can merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I removed the kwarg altogether since the default is None.
Maybe we could just run |
Running multiple workers right now will instantiate multiple models, which is great for speed, but will increase gpu memory usage, see my
|
e392275
to
4527559
Compare
Over the past few months I've noticed that my k8s-deployed weaviate instance (relying on t2v-transformers) has randomly been responding with 500s. After some digging I found out that the transformers-inference pod was being killed due to failing liveness and readiness probes, which causes the node to be restarted. After taking a look at the code it seems that the "async" vectorizer was not indeed asynchronous, and each vectorization request can block the entire app. To fix this, I've wrapped each vectorization request in a ThreadPoolExecutor future. This should strictly improve the availability of liveness and readiness probes, and alleviate this restarting issue. On top of that, the executor also allows tasks to minimize IO between tasks, which gives about a 3x speed boost on my machine 🚀
A screenshot of my stress test before:
And after:
And here is a link to the stress test I used, in case anyone wants to try it out themselves.