-
-
Notifications
You must be signed in to change notification settings - Fork 79
Allow loading model with BitsAndBytes 4bit quantization, PEFT LoRA adapters. #203
base: master
Are you sure you want to change the base?
Conversation
It might be better to split the QLora stuff from the Peft Lora adapter support. Qlora/4bit requires latest/git-master version of transformers, accelerate, and such (and I don't see that listed in the Lora-adapter support should be possible without bleeding edge versions of transformers though so that'd be great to get merged in first. |
Thanks for the review! I'm very new to working on Python codebases, so haven't fully got the hang of the dependency management workflows and gotchas. I'll split them as you suggested, and fix the requirements. |
Huggingface finally released QLoRa-supported versions of transformers and accelerate, which allows us to add basic 4-bit quantization support in #209. Maybe you can simplify this PR to include only PEFT stuffs? Of course it would also be easier if you want to add more detailed options for 4-bit quantization, as dependencies are no longer an issue. |
Hi, thanks for the feedback. I've updated the PR now. Tested with my very amateur QLoRA model with the following:
|
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## master #203 +/- ##
==========================================
- Coverage 94.29% 91.69% -2.61%
==========================================
Files 7 7
Lines 333 349 +16
==========================================
+ Hits 314 320 +6
- Misses 19 29 +10
☔ View full report in Codecov by Sentry. |
) | ||
from peft import ( | ||
PeftConfig, | ||
PeftModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that PeftModel
is not being used. Are you sure that PEFT is working correctly? (The GitHub actions environment does not have a GPU for testing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops 😅 well that explains why I wasn't seeming to get any results from my LoRA fine tunings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I added loading. But it only works with 4bit with dev version of peft
. If loading 4bit with peft 0.3.0
then it will error on inference.
load_in_4bit=True, | ||
bnb_4bit_quant_type=quant_type, | ||
bnb_4bit_use_double_quant=double_quant, | ||
bnb_4bit_compute_dtype=torch.bfloat16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoding bnb_4bit_compute_dtype
to bfloat16
is indeed a reasonable choice. But similarly, can't we use the recommended configuration for bnb_4bit_quant_type
and bnb_4bit_use_double_quant
in most scenarios? In fact, I prefer to keep only the load_in_4bit
option to reduce user confusion. What do you think?
Reference: https://huggingface.co/blog/4bit-transformers-bitsandbytes#advanced-usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree it's a lot more configuration options. I just was not sure how much people are playing around with different options, so I put them all in!
I think hardcoding "nf4"
might be reasonable for the 4bit quant type as the QLoRA literature recommends this, even though the default for BitsAndBytesConfig is "fp4"
.
double_quant
seems to only be suggested if memory constrained. Maybe the 0.4 bits are not worth it? If you think so, then I can hardcode to False and remove the optional config?
Finally, with bnb_4bit_compute_dtype
I was not very sure of the tradeoffs - while QLoRA supposedly uses bfloat16
the default for BitsAndBytesConfig is float32. The reference seems to suggest bfloat16
it's for faster training, I thought it might be the same for inference, but that's not explicitly called out. Maybe it's only a memory saving benefit for inference??? So is this decision a good one? 🤔
Sorry for all the questions, I'm still trying to level up on ML code, and once again appreciate the feedback!
@@ -12,3 +12,5 @@ safetensors~=0.3.1 | |||
torch>=1.12.1 | |||
transformers[sentencepiece]~=4.30.1 | |||
waitress~=2.1.2 | |||
peft~=0.3.0 | |||
scipy~=1.10.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing dependency is a bug in bitsandbytes: bitsandbytes-foundation/bitsandbytes#426
Instead of specifying the version of the indirect dependency, I suggest waiting for bitsandbytes to fix the issue in version 0.39.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. It did feel a bit weird that I had to add this!
Also supports loading PEFT LoRA adapters with
MODEL_PEFT=true
.For detail on 4bit quantization options, see: https://huggingface.co/blog/4bit-transformers-bitsandbytes
Implements #202