Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Allow loading model with BitsAndBytes 4bit quantization, PEFT LoRA adapters. #203

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

idoru
Copy link

@idoru idoru commented Jun 4, 2023

Also supports loading PEFT LoRA adapters with MODEL_PEFT=true.
For detail on 4bit quantization options, see: https://huggingface.co/blog/4bit-transformers-bitsandbytes

Implements #202

@idoru idoru force-pushed the 4bit-quantization branch from 008f8f0 to a368310 Compare June 4, 2023 23:45
@idoru idoru changed the title Allow loading model with 4bit quantization. Allow loading model with BitsAndBytes 4bit quantization, PEFT LoRA adapters. Jun 4, 2023
@LoopControl
Copy link

LoopControl commented Jun 5, 2023

It might be better to split the QLora stuff from the Peft Lora adapter support.

Qlora/4bit requires latest/git-master version of transformers, accelerate, and such (and I don't see that listed in the requirements.txt on this PR).

Lora-adapter support should be possible without bleeding edge versions of transformers though so that'd be great to get merged in first.

@idoru
Copy link
Author

idoru commented Jun 5, 2023

Thanks for the review! I'm very new to working on Python codebases, so haven't fully got the hang of the dependency management workflows and gotchas. I'll split them as you suggested, and fix the requirements.

@peakji
Copy link
Member

peakji commented Jun 9, 2023

Huggingface finally released QLoRa-supported versions of transformers and accelerate, which allows us to add basic 4-bit quantization support in #209.

Maybe you can simplify this PR to include only PEFT stuffs? Of course it would also be easier if you want to add more detailed options for 4-bit quantization, as dependencies are no longer an issue.

@idoru idoru force-pushed the 4bit-quantization branch from a0299fa to 4877353 Compare June 14, 2023 20:44
@idoru idoru force-pushed the 4bit-quantization branch from 4877353 to ef138a0 Compare June 14, 2023 20:46
@idoru
Copy link
Author

idoru commented Jun 15, 2023

Hi, thanks for the feedback. I've updated the PR now. Tested with my very amateur QLoRA model with the following:

MODEL_TRUST_REMOTE_CODE=true \
MODEL_LOAD_IN_4BIT=true \
MODEL_4BIT_QUANT_TYPE=nf4 \
MODEL_4BIT_DOUBLE_QUANT=true \
MODEL_PEFT=true \
MODEL=idoru/falcon-40b-nf4dq-chat-oasst1-2epoch-v2 \
PORT=8080 \
python -m basaran

@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2023

Codecov Report

Patch coverage: 36.84% and project coverage change: -2.61 ⚠️

Comparison is base (1677491) 94.29% compared to head (33a37c1) 91.69%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #203      +/-   ##
==========================================
- Coverage   94.29%   91.69%   -2.61%     
==========================================
  Files           7        7              
  Lines         333      349      +16     
==========================================
+ Hits          314      320       +6     
- Misses         19       29      +10     
Impacted Files Coverage Δ
basaran/model.py 83.52% <25.00%> (-5.01%) ⬇️
basaran/__init__.py 96.87% <100.00%> (+0.32%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@peakji peakji added the enhancement New feature or request label Jun 16, 2023
)
from peft import (
PeftConfig,
PeftModel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that PeftModel is not being used. Are you sure that PEFT is working correctly? (The GitHub actions environment does not have a GPU for testing)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops 😅 well that explains why I wasn't seeming to get any results from my LoRA fine tunings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I added loading. But it only works with 4bit with dev version of peft. If loading 4bit with peft 0.3.0 then it will error on inference.

load_in_4bit=True,
bnb_4bit_quant_type=quant_type,
bnb_4bit_use_double_quant=double_quant,
bnb_4bit_compute_dtype=torch.bfloat16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoding bnb_4bit_compute_dtype to bfloat16 is indeed a reasonable choice. But similarly, can't we use the recommended configuration for bnb_4bit_quant_type and bnb_4bit_use_double_quant in most scenarios? In fact, I prefer to keep only the load_in_4bit option to reduce user confusion. What do you think?

Reference: https://huggingface.co/blog/4bit-transformers-bitsandbytes#advanced-usage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it's a lot more configuration options. I just was not sure how much people are playing around with different options, so I put them all in!

I think hardcoding "nf4" might be reasonable for the 4bit quant type as the QLoRA literature recommends this, even though the default for BitsAndBytesConfig is "fp4".

double_quant seems to only be suggested if memory constrained. Maybe the 0.4 bits are not worth it? If you think so, then I can hardcode to False and remove the optional config?

Finally, with bnb_4bit_compute_dtype I was not very sure of the tradeoffs - while QLoRA supposedly uses bfloat16 the default for BitsAndBytesConfig is float32. The reference seems to suggest bfloat16 it's for faster training, I thought it might be the same for inference, but that's not explicitly called out. Maybe it's only a memory saving benefit for inference??? So is this decision a good one? 🤔

Sorry for all the questions, I'm still trying to level up on ML code, and once again appreciate the feedback!

basaran/model.py Show resolved Hide resolved
@@ -12,3 +12,5 @@ safetensors~=0.3.1
torch>=1.12.1
transformers[sentencepiece]~=4.30.1
waitress~=2.1.2
peft~=0.3.0
scipy~=1.10.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing dependency is a bug in bitsandbytes: bitsandbytes-foundation/bitsandbytes#426

Instead of specifying the version of the indirect dependency, I suggest waiting for bitsandbytes to fix the issue in version 0.39.0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. It did feel a bit weird that I had to add this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants