Error when using convert_hf_to_gguf.py for Q5_K_L #11088

JohnConnor123 · 2025-01-05T13:49:48Z

JohnConnor123
Jan 5, 2025

I needed Q5_K_L quantization (I saw it in Bartowski's lists). I decided to run the script with quantization, but 2/3 of the Bartowski quantizations were not found (see traceback below). Tell me how I can get Q5_K_L quantization. Googling didn't help(
P.s. My code: https://pastebin.com/RNzru1nQ and error:

Quantization start: ['python', '../llama.cpp/convert_hf_to_gguf.py', '--outfile', 'distilbert/distilbert-base-uncased-q5_k_l.gguf', '--outtype', 'q5_k_l', 'distilbert/distilbert-base-uncased']
usage: convert_hf_to_gguf.py [-h] [--vocab-only] [--outfile OUTFILE] [--outtype {f32,f16,bf16,q8_0,tq1_0,tq2_0,auto}]
                             [--bigendian] [--use-temp-file] [--no-lazy] [--model-name MODEL_NAME] [--verbose]
                             [--split-max-tensors SPLIT_MAX_TENSORS] [--split-max-size SPLIT_MAX_SIZE] [--dry-run]
                             [--no-tensor-first-split] [--metadata METADATA]
                             model
convert_hf_to_gguf.py: error: argument --outtype: invalid choice: 'q5_k_l' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')
Error while performing Q5_K_L quantization: Command '['python', '../llama.cpp/convert_hf_to_gguf.py', '--outfile', 'distilbert/distilbert-base-uncased-q5_k_l.gguf', '--outtype', 'q5_k_l', 'distilbert/distilbert-base-uncased']' returned non-zero exit status 2.

danbev · 2025-01-05T14:56:22Z

danbev
Jan 5, 2025
Collaborator

I think what needs to be done is to first convert to one of the supported types using convert_hf_to_gguf.py, and then use llama-quantize to quantize the converted model to one of the supported quantizations types:

 $ ./build/bin/llama-quantize --help
usage: ./build/bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quantized model in the same shards as input
  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  4.34G, +0.4685 ppl @ Llama-3-8B
   3  or  Q4_1    :  4.78G, +0.4511 ppl @ Llama-3-8B
   8  or  Q5_0    :  5.21G, +0.1316 ppl @ Llama-3-8B
   9  or  Q5_1    :  5.65G, +0.1062 ppl @ Llama-3-8B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  36  or  TQ1_0   :  1.69 bpw ternarization
  37  or  TQ2_0   :  2.06 bpw ternarization
  10  or  Q2_K    :  2.96G, +3.5199 ppl @ Llama-3-8B
  21  or  Q2_K_S  :  2.96G, +3.1836 ppl @ Llama-3-8B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  3.41G, +1.6321 ppl @ Llama-3-8B
  12  or  Q3_K_M  :  3.74G, +0.6569 ppl @ Llama-3-8B
  13  or  Q3_K_L  :  4.03G, +0.5562 ppl @ Llama-3-8B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  4.37G, +0.2689 ppl @ Llama-3-8B
  15  or  Q4_K_M  :  4.58G, +0.1754 ppl @ Llama-3-8B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  5.21G, +0.1049 ppl @ Llama-3-8B
  17  or  Q5_K_M  :  5.33G, +0.0569 ppl @ Llama-3-8B
  18  or  Q6_K    :  6.14G, +0.0217 ppl @ Llama-3-8B
   7  or  Q8_0    :  7.96G, +0.0026 ppl @ Llama-3-8B
   1  or  F16     : 14.00G, +0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

1 reply

JohnConnor123 Jan 5, 2025
Author

Thanks for reply! Btw, can I directly feed the model from hf_hub to the input? or do I need to at least do q8 quantization, saving it in gguf format?
P.s. I just want to reduce the amount of extra steps of using the console in my .py file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using convert_hf_to_gguf.py for Q5_K_L #11088

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Error when using convert_hf_to_gguf.py for Q5_K_L #11088

JohnConnor123 Jan 5, 2025

Replies: 1 comment · 1 reply

danbev Jan 5, 2025 Collaborator

JohnConnor123 Jan 5, 2025 Author

JohnConnor123
Jan 5, 2025

Replies: 1 comment 1 reply

danbev
Jan 5, 2025
Collaborator

JohnConnor123 Jan 5, 2025
Author