Skip to content

Commit

Permalink
Update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
bofenghuang committed Oct 20, 2023
1 parent 72b96b5 commit eb1a666
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 10 deletions.
6 changes: 1 addition & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,11 +125,7 @@ More information can be found in [vigogne/data](docs/data.md).

## Training

For efficient LLM fine-tuning, we utilize a technique called [low-rank adaptation (LoRA)](https://arxiv.org/abs/2106.09685) from 🤗 Hugging Face's [PEFT](https://github.com/huggingface/peft) library. This approach involves freezing the base model's weights and introducing a small number of learnable parameters.

Additionally, for practitioners without access to GPUs with ample memory, it's advisable to consider quantizing certain computations to either 8-bit or 4-bit precision using [LLM.int8()](https://arxiv.org/abs/2208.07339) or [QLoRA](https://arxiv.org/abs/2305.14314). Be aware that this might lead to a minor reduction in speed compared to fp16 or bf16 versions.

We highly recommend the utilization of tools such as [DeepSpeed](https://github.com/microsoft/DeepSpeed) or [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api), particularly when engaged in distributed learning scenarios. When dealing with long sequences, [FlashAttention](https://arxiv.org/abs/2307.08691) becomes crucial to speed up training and reduce memory usage.
The Vigogne models were mostly instruction fine-tuned from other foundation models.

More information can be found in [vigogne/training](docs/training.md).

Expand Down
8 changes: 7 additions & 1 deletion docs/inference.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,13 @@

This repository offers multiple options for inference and deployment, including Google Colab notebooks, Gradio demos, [FastChat](https://github.com/lm-sys/FastChat), and [vLLM](https://vllm.ai). It also offers guidance on conducting experiments using [llama.cpp](https://github.com/ggerganov/llama.cpp) on your personal computer.

Thanks to the contributions by [TheBloke](https://huggingface.co/TheBloke), some of Vigogne models have been quantized to [GGML](https://github.com/ggerganov/ggml) format (compatible with [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), [ctransformers](https://github.com/marella/ctransformers), etc.) and [GTPQ](https://github.com/IST-DASLab/gptq) format (compatible with [GPTQ-for-LLaMA](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)). These formats facilitate testing and development. You can find these models on the [Hugging Face Hub](https://huggingface.co/models?sort=trending&search=TheBloke+vigogne).
## Quantized Models

The quantized versions of certain models are generously provided by [TheBloke](https://huggingface.co/TheBloke)!

These versions facilitate testing and development with various popular frameworks, including [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), and more.

You can find these models on the [Hugging Face Hub](https://huggingface.co/models?search=TheBloke/vigo).

## Google Colab Notebook

Expand Down
8 changes: 4 additions & 4 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

## Supervised Fine-tuning

For efficient LLM fine-tuning, we utilize a technique called [low-rank adaptation (LoRA)](https://arxiv.org/abs/2106.09685) from 🤗 Hugging Face's [PEFT](https://github.com/huggingface/peft) library. This approach involves freezing the base model's weights and introducing a small number of learnable parameters.
For efficient LLM fine-tuning, we use [low-rank adaptation (LoRA)](https://arxiv.org/abs/2106.09685) from 🤗 Hugging Face's [PEFT](https://github.com/huggingface/peft) library. This involves freezing the base model's parameters and introducing a small number of learnable parameters.

Additionally, for practitioners without access to GPUs with ample memory, it's advisable to consider quantizing certain computations to either 8-bit or 4-bit precision using [LLM.int8()](https://arxiv.org/abs/2208.07339) or [QLoRA](https://arxiv.org/abs/2305.14314). Be aware that this might lead to a minor reduction in speed compared to fp16 or bf16 versions.
For those with limited GPU memory, it's recommended to quantize certain computations to 8-bit or 4-bit precision using [LLM.int8()](https://arxiv.org/abs/2208.07339) or [QLoRA](https://arxiv.org/abs/2305.14314). Note that this might result in a slight training slowdown compared to the fp16 or bf16 versions.

We highly recommend the utilization of tools such as [DeepSpeed](https://github.com/microsoft/DeepSpeed) or [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api), particularly when engaged in distributed learning scenarios. When dealing with long sequences, [FlashAttention](https://arxiv.org/abs/2307.08691) becomes crucial to speed up training and reduce memory usage.
Tools like [DeepSpeed](https://github.com/microsoft/DeepSpeed) or [FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api) are highly recommended for distributed learning. [FlashAttention](https://arxiv.org/abs/2307.08691) is essential for speeding up training and reducing memory usage with long sequences.

More examples can be found in [examples](https://github.com/bofenghuang/vigogne/blob/main/examples/train).

Since version 2.2, I've refactored the training code, integrating specific elements inspired by the excellent training framework [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). Thanks to the Axolotl team for their contributions to the open-source community! The primary motivation behind maintaining my own framework is to have full control over the entire training process and customize it to my specific needs. I highly recommend using Axolotl for additional features.
*Since version 2.2, I've refactored the training code, integrating specific elements inspired by the excellent training framework [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl). Thanks to the Axolotl team for their contributions to the open-source community! The primary motivation behind maintaining my own framework is to have full control over the entire training process and customize it to my specific needs. I highly recommend using Axolotl for additional features.*

0 comments on commit eb1a666

Please sign in to comment.