Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize asynchronously with ThreadPoolExecutor #56

Merged

Conversation

alexcannan
Copy link
Contributor

@alexcannan alexcannan commented Apr 21, 2023

Over the past few months I've noticed that my k8s-deployed weaviate instance (relying on t2v-transformers) has randomly been responding with 500s. After some digging I found out that the transformers-inference pod was being killed due to failing liveness and readiness probes, which causes the node to be restarted. After taking a look at the code it seems that the "async" vectorizer was not indeed asynchronous, and each vectorization request can block the entire app. To fix this, I've wrapped each vectorization request in a ThreadPoolExecutor future. This should strictly improve the availability of liveness and readiness probes, and alleviate this restarting issue. On top of that, the executor also allows tasks to minimize IO between tasks, which gives about a 3x speed boost on my machine 🚀

A screenshot of my stress test before:
t2v-stresstest

And after:
2023-04-21_09-27

And here is a link to the stress test I used, in case anyone wants to try it out themselves.

@alexcannan alexcannan force-pushed the feature/liveness-readiness-availability branch from 7fdc3d4 to b8974d1 Compare April 21, 2023 14:25
vectorizer.py Outdated
@@ -52,8 +55,10 @@ def __init__(self, model_path: str, cuda_support: bool, cuda_core: str, cuda_per

self.tokenizer = self.model_delegate.create_tokenizer(model_path)

self.executor = ThreadPoolExecutor(max_workers=1)
Copy link
Member

@StefanBogdan StefanBogdan Apr 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the ThreadPoolExecutor is meant to leverage the vectorization process by running them in parallel. I have a couple of question regarding this approach:

  1. Why is the max_workers set to 1? This means only one thread can be active at a time. Doesn't this break the parallelism that we want to achieve? Why not to leave the value set to None?
  2. Why threads? The vectorization process is CPU bound and threads are meant for IO bound tasks, why not use processes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Previously the vectorization process was synchronous, so in theory vectors were generated only one at a time. I left it at 1 to mimic this; a side effect of this is that the GPU memory usage when cuda is enabled is more predictable. I'll run some experiments with other values though, it could lead to significant speedups!

  2. Good shout, I tend to use ThreadPoolExecutor over ProcessPoolExecutor since it's technically less overhead (and since with 1 worker the difference is negligible) but I'll try both for the many workers case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying ProcessPoolExecutor I see a couple issues:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item
    vector = await vec.vectorize(item.text, item.config)
  File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize
    return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config))
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 261, in reduce_tensor
  File "/home/alex/repos/t2v-transformers-models/env/lib/python3.10/site-packages/torch/storage.py", line 920, in _share_cuda_
RuntimeError: unable to open shared memory object </torch_49047_1633709960_558> in read-write mode: Too many open files (24)
INFO:     127.0.0.1:42060 - "POST /vectors HTTP/1.1" 500 Internal Server Error

and

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/alex/repos/forked-t2v-transformers-models/app.py", line 66, in read_item
    vector = await vec.vectorize(item.text, item.config)
  File "/home/alex/repos/forked-t2v-transformers-models/vectorizer.py", line 108, in vectorize
    return await asyncio.wrap_future(self.executor.submit(self._vectorize, text, config))
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.lock' object

However, ThreadPoolExecutor handles multiple workers quite nicely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPoolExecutor with max_workers=None reaches up to ~70 requests/sec, up 300% from the ~23 r/s I was seeing with a single worker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change max_workers=None and then we can merge it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I removed the kwarg altogether since the default is None.

@StefanBogdan
Copy link
Member

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

@alexcannan
Copy link
Contributor Author

alexcannan commented Apr 21, 2023

Maybe we could just run uvicorn with multiple workers: https://www.uvicorn.org/settings/#production

Running multiple workers right now will instantiate multiple models, which is great for speed, but will increase gpu memory usage, see my nvidia-smi output with --workers 4:

alex@hq3 ~> nvidia-smi
Fri Apr 21 11:12:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
| 51%   72C    P2   343W / 350W |   9962MiB / 24576MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      3291      G   /usr/lib/xorg/Xorg                442MiB |
|    0   N/A  N/A      3442      G   /usr/bin/gnome-shell               66MiB |
|    0   N/A  N/A      4425      G   ...veSuggestionsOnlyOnDemand       82MiB |
|    0   N/A  N/A      6918      G   ...RendererForSitePerProcess      122MiB |
|    0   N/A  N/A      8204      G   ...RendererForSitePerProcess       26MiB |
|    0   N/A  N/A      9777      G   ...9/usr/lib/firefox/firefox      232MiB |
|    0   N/A  N/A     50710      C   ...rs-models/env/bin/python3     2158MiB |
|    0   N/A  N/A     50711      C   ...rs-models/env/bin/python3     2224MiB |
|    0   N/A  N/A     50712      C   ...rs-models/env/bin/python3     2292MiB |
|    0   N/A  N/A     50713      C   ...rs-models/env/bin/python3     2308MiB |
+-----------------------------------------------------------------------------+

@alexcannan alexcannan force-pushed the feature/liveness-readiness-availability branch from e392275 to 4527559 Compare April 21, 2023 15:19
@StefanBogdan StefanBogdan merged commit 64b36bd into weaviate:main Apr 21, 2023
@alexcannan alexcannan deleted the feature/liveness-readiness-availability branch July 26, 2023 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants