-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for quantized OPT models and refactor #295
Changes from 6 commits
345b6de
edbc611
1b99ed6
3c9afd5
b746250
e1c952c
b6c5c57
a6a6522
8778b75
518e5c4
3da73e4
265ba38
87192e2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,28 +7,20 @@ | |
import modules.shared as shared | ||
|
||
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa"))) | ||
from llama import load_quant | ||
|
||
|
||
# 4-bit LLaMA | ||
def load_quantized_LLaMA(model_name): | ||
if shared.args.load_in_4bit: | ||
bits = 4 | ||
def load_quantized(model_name, model_type): | ||
if model_type == 'llama': | ||
from llama import load_quant | ||
elif model_type == 'opt': | ||
from opt import load_quant | ||
else: | ||
bits = shared.args.gptq_bits | ||
print("Unknown pre-quantized model type specified. Only 'llama' and 'opt' are supported") | ||
exit() | ||
|
||
path_to_model = Path(f'models/{model_name}') | ||
pt_model = '' | ||
if path_to_model.name.lower().startswith('llama-7b'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removing this block breaks compatibility with folder names like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I moved the code back |
||
pt_model = f'llama-7b-{bits}bit.pt' | ||
elif path_to_model.name.lower().startswith('llama-13b'): | ||
pt_model = f'llama-13b-{bits}bit.pt' | ||
elif path_to_model.name.lower().startswith('llama-30b'): | ||
pt_model = f'llama-30b-{bits}bit.pt' | ||
elif path_to_model.name.lower().startswith('llama-65b'): | ||
pt_model = f'llama-65b-{bits}bit.pt' | ||
else: | ||
pt_model = f'{model_name}-{bits}bit.pt' | ||
pt_model = f'{model_name}-{shared.args.gptq_bits}bit.pt' | ||
|
||
# Try to find the .pt both in models/ and in the subfolder | ||
pt_path = None | ||
|
@@ -40,7 +32,7 @@ def load_quantized_LLaMA(model_name): | |
print(f"Could not find {pt_model}, exiting...") | ||
exit() | ||
|
||
model = load_quant(path_to_model, str(pt_path), bits) | ||
model = load_quant(path_to_model, str(pt_path), shared.args.gptq_bits) | ||
|
||
# Multiple GPUs or GPU+CPU | ||
if shared.args.gpu_memory: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,8 +68,8 @@ def str2bool(v): | |
parser.add_argument('--cai-chat', action='store_true', help='Launch the web UI in chat mode with a style similar to Character.AI\'s. If the file img_bot.png or img_bot.jpg exists in the same folder as server.py, this image will be used as the bot\'s profile picture. Similarly, img_me.png or img_me.jpg will be used as your profile picture.') | ||
parser.add_argument('--cpu', action='store_true', help='Use the CPU to generate text.') | ||
parser.add_argument('--load-in-8bit', action='store_true', help='Load the model with 8-bit precision.') | ||
parser.add_argument('--load-in-4bit', action='store_true', help='Load the model with 4-bit precision. Currently only works with LLaMA.') | ||
parser.add_argument('--gptq-bits', type=int, default=0, help='Load a pre-quantized model with specified precision. 2, 3, 4 and 8bit are supported. Currently only works with LLaMA.') | ||
parser.add_argument('--gptq-bits', type=int, default=0, help='Load a pre-quantized model with specified precision. 2, 3, 4 and 8bit are supported. Currently only works with LLaMA and OPT.') | ||
parser.add_argument('--gptq-model-type', type=str, default='llama', help='Model type of pre-quantized model. Currently only LLaMa and OPT are supported.') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we could infer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @oobabooga Maybe do as you described, but keep this argument as fallback if folder name has no prefix? What do you think about it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds like a good idea (making this argument optional). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I implemented this. Tested myself with different folder names and it works. (but I recommend you to check too) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So here is the logic: |
||
parser.add_argument('--bf16', action='store_true', help='Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU.') | ||
parser.add_argument('--auto-devices', action='store_true', help='Automatically split the model across the available GPU(s) and CPU.') | ||
parser.add_argument('--disk', action='store_true', help='If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk.') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to remove
--load-in-4bit
because that will certainly cause confusion, but I guess we can do it and move on.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's up to you but I think that it's a bad practice to keep multiple arguments which do the same thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this
--load-in-4bit
argument was confusing because it works differently from--load-in-8bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have convinced me, let's ditch
--load-in-4bit
.