RTX AI Toolkit Model Deployment Workflow

This directory contains instructions to deploy the model generated by RTX AI Toolkit's fine-tuning process for deployment for the following workflows.

As seen in the LLM Finetuning tutorial, the LlamaFactory app supports export of either LoRA adapters or merged HF checkpoints. Here, we will see how to optimize both these options for deployment across a variety of platforms.

RTX AI Toolkit supports the following deployment workflows for the fine-tuned LLMs:

Quantized (on-device) inference:

Platform	LoRA Adapter	Merged checkpoint
TensorRT-LLM	✅	✅
llama.cpp		✅
ONNX Runtime - DML		✅

FP16 (cloud) inference:

Platform	LoRA Adapter	Merged checkpoint
vLLM	✅	✅
NIMs	✅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RTX AI Toolkit Model Deployment Workflow

Files

README.md

Latest commit

History

README.md

File metadata and controls

RTX AI Toolkit Model Deployment Workflow