Vectorize asynchronously with ThreadPoolExecutor #56

alexcannan · 2023-04-21T13:43:34Z

Over the past few months I've noticed that my k8s-deployed weaviate instance (relying on t2v-transformers) has randomly been responding with 500s. After some digging I found out that the transformers-inference pod was being killed due to failing liveness and readiness probes, which causes the node to be restarted. After taking a look at the code it seems that the "async" vectorizer was not indeed asynchronous, and each vectorization request can block the entire app. To fix this, I've wrapped each vectorization request in a ThreadPoolExecutor future. This should strictly improve the availability of liveness and readiness probes, and alleviate this restarting issue. On top of that, the executor also allows tasks to minimize IO between tasks, which gives about a 3x speed boost on my machine 🚀

A screenshot of my stress test before:

And after:

And here is a link to the stress test I used, in case anyone wants to try it out themselves.

StefanBogdan · 2023-04-21T14:42:57Z

vectorizer.py

@@ -52,8 +55,10 @@ def __init__(self, model_path: str, cuda_support: bool, cuda_core: str, cuda_per

        self.tokenizer = self.model_delegate.create_tokenizer(model_path)

+        self.executor = ThreadPoolExecutor(max_workers=1)


As I understand the ThreadPoolExecutor is meant to leverage the vectorization process by running them in parallel. I have a couple of question regarding this approach:

Why is the max_workers set to 1? This means only one thread can be active at a time. Doesn't this break the parallelism that we want to achieve? Why not to leave the value set to None?

Why threads? The vectorization process is CPU bound and threads are meant for IO bound tasks, why not use processes?

Previously the vectorization process was synchronous, so in theory vectors were generated only one at a time. I left it at 1 to mimic this; a side effect of this is that the GPU memory usage when cuda is enabled is more predictable. I'll run some experiments with other values though, it could lead to significant speedups!

Good shout, I tend to use ThreadPoolExecutor over ProcessPoolExecutor since it's technically less overhead (and since with 1 worker the difference is negligible) but I'll try both for the many workers case.

Trying ProcessPoolExecutor I see a couple issues:

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item vector = await vec.vectorize(item.text, item.config) File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config)) File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 261, in reduce_tensor File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/storage.py", line 920, in _share_cuda_ RuntimeError: unable to open shared memory object </torch_49047_1633709960_558> in read-write mode: Too many open files (24) INFO: 127.0.0.1:42060 - "POST /vectors HTTP/1.1" 500 Internal Server Error

and

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item vector = await vec.vectorize(item.text, item.config) File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config)) File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) TypeError: cannot pickle '_thread.lock' object

However, ThreadPoolExecutor handles multiple workers quite nicely.

ThreadPoolExecutor with max_workers=None reaches up to ~70 requests/sec, up 300% from the ~23 r/s I was seeing with a single worker

Can you change max_workers=None and then we can merge it.

Done, I removed the kwarg altogether since the default is None.

StefanBogdan · 2023-04-21T14:53:32Z

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

alexcannan · 2023-04-21T15:14:47Z

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

Running multiple workers right now will instantiate multiple models, which is great for speed, but will increase gpu memory usage, see my nvidia-smi output with --workers 4:

alex@hq3 ~> nvidia-smi
Fri Apr 21 11:12:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
| 51%   72C    P2   343W / 350W |   9962MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3291      G   /usr/lib/xorg/Xorg                442MiB |
|    0   N/A  N/A      3442      G   /usr/bin/gnome-shell               66MiB |
|    0   N/A  N/A      4425      G   ...veSuggestionsOnlyOnDemand       82MiB |
|    0   N/A  N/A      6918      G   ...RendererForSitePerProcess      122MiB |
|    0   N/A  N/A      8204      G   ...RendererForSitePerProcess       26MiB |
|    0   N/A  N/A      9777      G   ...9/usr/lib/firefox/firefox      232MiB |
|    0   N/A  N/A     50710      C   ...rs-models/env/bin/python3     2158MiB |
|    0   N/A  N/A     50711      C   ...rs-models/env/bin/python3     2224MiB |
|    0   N/A  N/A     50712      C   ...rs-models/env/bin/python3     2292MiB |
|    0   N/A  N/A     50713      C   ...rs-models/env/bin/python3     2308MiB |
+-----------------------------------------------------------------------------+

antas-marcin approved these changes Apr 21, 2023

View reviewed changes

alexcannan force-pushed the feature/liveness-readiness-availability branch from 7fdc3d4 to b8974d1 Compare April 21, 2023 14:25

StefanBogdan reviewed Apr 21, 2023

View reviewed changes

alexcannan added 2 commits April 21, 2023 11:19

vectorize asynchronously with ThreadPoolExecutor

1c39258

unbounded executor workers

4527559

alexcannan force-pushed the feature/liveness-readiness-availability branch from e392275 to 4527559 Compare April 21, 2023 15:19

StefanBogdan approved these changes Apr 21, 2023

View reviewed changes

StefanBogdan merged commit 64b36bd into weaviate:main Apr 21, 2023

alexcannan deleted the feature/liveness-readiness-availability branch July 26, 2023 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize asynchronously with ThreadPoolExecutor #56

Vectorize asynchronously with ThreadPoolExecutor #56

alexcannan commented Apr 21, 2023 •

edited

Loading

StefanBogdan Apr 21, 2023 •

edited

Loading

alexcannan Apr 21, 2023

alexcannan Apr 21, 2023

alexcannan Apr 21, 2023

StefanBogdan Apr 21, 2023

alexcannan Apr 21, 2023

StefanBogdan commented Apr 21, 2023

alexcannan commented Apr 21, 2023 •

edited

Loading

		@@ -52,8 +55,10 @@ def __init__(self, model_path: str, cuda_support: bool, cuda_core: str, cuda_per

		self.tokenizer = self.model_delegate.create_tokenizer(model_path)

		self.executor = ThreadPoolExecutor(max_workers=1)

Vectorize asynchronously with ThreadPoolExecutor #56

Vectorize asynchronously with ThreadPoolExecutor #56

Conversation

alexcannan commented Apr 21, 2023 • edited Loading

StefanBogdan Apr 21, 2023 • edited Loading

Choose a reason for hiding this comment

alexcannan Apr 21, 2023

Choose a reason for hiding this comment

alexcannan Apr 21, 2023

Choose a reason for hiding this comment

alexcannan Apr 21, 2023

Choose a reason for hiding this comment

StefanBogdan Apr 21, 2023

Choose a reason for hiding this comment

alexcannan Apr 21, 2023

Choose a reason for hiding this comment

StefanBogdan commented Apr 21, 2023

alexcannan commented Apr 21, 2023 • edited Loading

alexcannan commented Apr 21, 2023 •

edited

Loading

StefanBogdan Apr 21, 2023 •

edited

Loading

alexcannan commented Apr 21, 2023 •

edited

Loading