Skip to content

Latest commit

 

History

History
24 lines (15 loc) · 1.12 KB

README.md

File metadata and controls

24 lines (15 loc) · 1.12 KB

RTX AI Toolkit Model Deployment Workflow

This directory contains instructions to deploy the model generated by RTX AI Toolkit's fine-tuning process for deployment for the following workflows.

As seen in the LLM Finetuning tutorial, the LlamaFactory app supports export of either LoRA adapters or merged HF checkpoints. Here, we will see how to optimize both these options for deployment across a variety of platforms.

RTX AI Toolkit supports the following deployment workflows for the fine-tuned LLMs:

Quantized (on-device) inference:

Platform LoRA Adapter Merged checkpoint
TensorRT-LLM
llama.cpp
ONNX Runtime - DML

FP16 (cloud) inference:

Platform LoRA Adapter Merged checkpoint
vLLM
NIMs