This directory contains instructions to deploy the model generated by RTX AI Toolkit's fine-tuning process for deployment for the following workflows.
As seen in the LLM Finetuning tutorial, the LlamaFactory app supports export of either LoRA adapters or merged HF checkpoints. Here, we will see how to optimize both these options for deployment across a variety of platforms.
RTX AI Toolkit supports the following deployment workflows for the fine-tuned LLMs:
Quantized (on-device) inference:
Platform | LoRA Adapter | Merged checkpoint |
---|---|---|
TensorRT-LLM | ✅ | ✅ |
llama.cpp | ✅ | |
ONNX Runtime - DML | ✅ |
FP16 (cloud) inference:
Platform | LoRA Adapter | Merged checkpoint |
---|---|---|
vLLM | ✅ | ✅ |
NIMs | ✅ |