GPU Inference #25

tpoisonooo · 2023-07-06T08:39:02Z

llama.onnx is primarily used for understanding LLM and converting it to NPU.

If you are looking for inference on Nvidia GPU, we have released lmdeploy at https://github.com/InternLM/lmdeploy.

It supports:

Models similar to llama ranging from 7B to 100B in size, available in huggingface or meta format
Configurable batch_size and quantization, faster than some other implementations
Tensor parallelism, allowing you to run a 65B model on multiple 3090 GPUs
Interacting with WebUI or using command line interface

The text was updated successfully, but these errors were encountered:

tpoisonooo · 2023-07-06T08:43:32Z

tpoisonooo · 2023-07-06T08:50:28Z

yiliu30 · 2023-08-08T15:00:11Z

Tensor parallelism

Nice work! Can tensor parallelism be implemented using both Torch and ONNX models?

Provide feedback