diff --git a/README.md b/README.md index 5a9f51abd..53fd07430 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,7 @@ The project is production-oriented and comes with [backward compatibility guaran * **Lightweight on disk**
Quantization can make the models 4 times smaller on disk with minimal accuracy loss. * **Simple integration**
The project has few dependencies and exposes simple APIs in [Python](https://opennmt.net/CTranslate2/python/overview.html) and C++ to cover most integration needs. * **Configurable and interactive decoding**
[Advanced decoding features](https://opennmt.net/CTranslate2/decoding.html) allow autocompleting a partial sequence and returning alternatives at a specific location in the sequence. +* **Support tensor parallelism for distributed inference. Some of these features are difficult to achieve with standard deep learning frameworks and are the motivation for this project. diff --git a/docs/parallel.md b/docs/parallel.md index 604fea122..053e10a0e 100644 --- a/docs/parallel.md +++ b/docs/parallel.md @@ -42,8 +42,43 @@ Parallelization with multiple Python threads is possible because all computation ``` ## Model and tensor parallelism +Models as the [`Translator`](python/ctranslate2.Translator.rst) and [`Generator`](python/ctranslate2.Generator.rst) can be split into multiple GPUs different. +This is very helpful when the model is too big to be load in only 1 GPU. -These types of parallelism are not yet implemented in CTranslate2. +```python +translator = ctranslate2.Translator(model_path, device="cuda", tensor_parallel=True) +``` + +Setup environment: +* Install [open-mpi](https://www.open-mpi.org/) +* Configure open-mpi by creating the config file like ``hostfile``: +```bash +[ipaddress or dns] slots=nbGPU1 +[other ipaddress or dns] slots=NbGPU2 +``` +* Run the application in multiprocess to using tensor parallel: +```bash +mpirun -np nbGPUExpected -hostfile hostfile python3 script +``` + +If you're trying to run the tensor parallelism in multiple machine, there are additional configuration is needed: +* Make sure Master and Slave can connect to each other as a pair with ssh + pubkey +* Export all necessary environment variables from Master to Slave like the example below: +```bash +mpirun -x VIRTUAL_ENV_PROMPT -x PATH -x VIRTUAL_ENV -x _ -x LD_LIBRARY_PATH -np nbGPUExpected -hostfile hostfile python3 script +``` + +Read more [open-mpi docs](https://www.open-mpi.org/doc/) for more information. + +```{note} +Running model in tensor parallel mode in one machine can boost the performance but if running the model shared between multiple +machine could be slower because of the latency in the connectivity. +``` + +```{note} +In mode tensor parallel, `inter_threads` is always supported to run multiple workers. Otherwise, `device_index` no longer has any effect +because tensor parallel mode will check only available gpus on the system and number of gpu that you want to use. +``` ## Asynchronous execution