Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

This repo contains the source code for experiments for our paper

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models
Fangzhao Zhang, Mert Pilanci
Paper: https://arxiv.org/abs/2402.02347

In this work we study the enhancement of Low Rank Adaptation (LoRA) fine-tuning procedure by introducing a Riemannian preconditioner in its optimization step. Specifically, we introduce an $r\times r$ preconditioner in each gradient step where $r$ is the LoRA rank. This preconditioner requires a small change to existing optimizer code and creates virtually minuscule storage and runtime overhead. Our experimental results with both large language models and text-to-image diffusion models show that with our preconditioner, the convergence and reliability of SGD and AdamW can be significantly enhanced. Moreover, the training process becomes much more robust to hyperparameter choices such as learning rate. Theoretically, we verify that the proposed preconditioner stabilizes feature learning with LoRA under infinite-width NN setting and show that fine-tuning a two-layer ReLU network in the convex paramaterization with our preconditioner has convergence rate independent of condition number of the data matrix. This new Riemannian preconditioner, previously explored in classic low-rank matrix recovery, is introduced to deep learning tasks for the first time in our work.

Specifically, in each iteration, let $BA^T$ denotes additive LoRA weight in the current iteration, we consider the following scaled GD updating rule

$A = A-\eta (B^TB)^{-1} \nabla_A f $

$B = B-\eta \nabla_B f (AA^T)^{-1}$

where $A^TA$ and $B^TB$ are of dimension $r$ by $r$. This preconditioner has originally been studied by Tong et al., Jia et al., and some other works for classic low rank matrix optimization problems and has been shown to have better global convergence rate compared to plain gradient method. Here we extend it to AdamW method and introduce it to LoRA fine-tuning procedure. The following plots show that the above preconditioner enhances LoRA training significantly without introducing much runtime overhead.

Repository Overview

In this project, we experiments with GPT-2 fine-tuning, Mistral 7B fine-tuning, Mix-of-Show fine-tuning, custom diffusion fine-tuning.

GPT-2 Fine-Tuning (see GPT-2/ for experiment code.)

Mistral 7B Fine-Tuning (see Mistral-7B/ for experiment code.)

Mix-of-Show Fine-Tuning (see Mix-of-Show/ for experiment code.)

Custom Diffusion Fine-tuning (see Object_Generation/ for experiment code.)

Reproducibility

See Parameter Reference in each section for parameter choices for each experiment. See Runtime Experiment in GPT-2/ for runtime experiment details.

Contact

Please contact us or post an issue if you have any questions.

Fangzhao Zhang ([email protected])

References and Acknowledgements

This work has been heavily influenced by recent development in low-rank matrix optimization research and parameter-efficient fine-tuning (PEFT) research. We cite several important references here with a more complete reference list presented in our paper. Moreover, our experimental code is mainly built on the following repositories: LoRA (Hu et al., 2021), Mix-of-Show (Gu et al., 2023), custom diffsuon.

@article{tong2021accelerating,
    title={Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent},
    author={Tian Tong and Cong Ma and Yuejie Chi},
    journal={arXiv preprint arXiv:2005.08898},
    year={2021}
}

@inproceedings{hu2022lora,
title={Lo{RA}: Low-Rank Adaptation of Large Language Models},
author={Edward J Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=nZeVKeeFYf9}
}

@article{gu2023mixofshow,
    title={Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models},
    author={Gu, Yuchao and Wang, Xintao and Wu, Jay Zhangjie and Shi, Yujun and Chen Yunpeng and Fan, Zihan and Xiao, Wuyou and Zhao, Rui and Chang, Shuning and Wu, Weijia and Ge, Yixiao and Shan Ying and Shou, Mike Zheng},
    journal={arXiv preprint arXiv:2305.18292},
    year={2023}
}

Citation

@misc{zhang2024riemannian,
      title={Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models}, 
      author={Fangzhao Zhang and Mert Pilanci},
      year={2024},
      eprint={2402.02347},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
GPT2		GPT2
Mistral-7B		Mistral-7B
Mix-of-Show		Mix-of-Show
Object_Generation		Object_Generation
main_figures		main_figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

$A = A-\eta (B^TB)^{-1} \nabla_A f $

$B = B-\eta \nabla_B f (AA^T)^{-1}$

Repository Overview

Reproducibility

Contact

References and Acknowledgements

Citation

About

Releases

Packages

Languages

License

pilancilab/Riemannian_Preconditioned_LoRA

Folders and files

Latest commit

History

Repository files navigation

Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models

$A = A-\eta (B^TB)^{-1} \nabla_A f $

$B = B-\eta \nabla_B f (AA^T)^{-1}$

Repository Overview

Reproducibility

Contact

References and Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages