-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-hf : support bfloat16 conversion #7158
Conversation
Can try to add a small test to ci/run.sh that exercises the conversion. The test can download a small Mamba model and run a short |
Based on your description, my assumption is that if the original weights are in bf16, you should do convert with outtype bf16 and then everything will just work a bit better. Otherwise no other changes to pipeline (from an end-user perspective) are needed. Is that correct? Or is there even some kind of auto-detection? |
Yes this is correct. Using I just now found a way to make bit-exact identical |
convert-hf-to-gguf.py
Outdated
@@ -2417,8 +2372,8 @@ def parse_args() -> argparse.Namespace: | |||
help="path to write to; default: based on input", | |||
) | |||
parser.add_argument( | |||
"--outtype", type=str, choices=["f32", "f16"], default="f16", | |||
help="output format - use f32 for float32, f16 for float16", | |||
"--outtype", type=str, choices=["f32", "f16", "bf16"], default="f16", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given most models come in bf16 wouldn't it make sense to set it as default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion to bf16
is slightly slower and uses a bit more RAM than f16
conversion, due to the lack of native Numpy support, so I didn't change the default.
I'll see if I can auto-detect whether the model contains bf16
tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16
as default if it's widely used.
My concern with bf16
as default is that f16 -> bf16
is more lossy than bf16 -> f16
, since 3 bits of the mantissa are always lost in f16 -> bf16
, while bf16 -> f16
only turns some very-close-to-zero values into zero, and big values get turned to inf
(but such big values are usually not in model weights, see #7075 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only have rudimentary CPU-only bf16
support in ggml
, so f16
is better for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(left comment on wrong account)
I'll see if I can auto-detect whether the model contains bf16 tensors (but it will most likely be too complicated). Otherwise, it does make sense to set bf16 as default if it's widely used.
@compilade could we not attempt to read from config.json? it should have a torch_dtype in it
Ggerganov, when you say CPU-only, I assume you're referring to inference, since all conversion and quantization is currently CPU-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we not attempt to read from config.json? it should have a torch_dtype in it
@bartowski1182 yes, but not all models define that field, so I think a second guess based on the type of the first tensor type in the model will sometimes be necessary.
config.json
and use another type in the model files. For example, https://huggingface.co/pansophic/rocket-3B has F16
tensors, but defines torch_dtype
as bfloat16
in config.json
Would you still like some kind of --outtype auto-f16
based on the model content even if f16
is kept as the default --outtype
otherwise? (due to (slightly) faster conversion, and more complete backend support)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why bf16 is only used for CPU? Will there be GPU support in the future?
@htdung167 This PR is only about conversion, and the convert script always has been CPU-only.
Inference (text generation) with bf16
was worked on by @jart in #6412. A relevant comment from there regarding future GPU support would be #6412 (comment).
I think '--outtype auto' might be fine, since at its core that's what it's doing
@bartowski1182 I agree, I've thought about this more, and I'll change this to auto
instead of auto-f16
. I don't think there will be auto-anything-else
anyway.
is it possible to figure out the auto-chosen version earlier and use it for naming the outfile?
yes, this is already possible by checking the logs. It would also be possible to do automatically, but not everyone has the same naming conventions, so maybe the right way to do this would be with a .format()
pattern? For example, --outfile llama-3-8b-instruct-{ftype}.gguf
, or --outfile llama-3-8b-instruct-{outtype}.gguf
or --outfile llama-3-8b-instruct-{}.gguf
. Not sure which to support (all?), but it should be clearer than %s
. It would also be possible to allow using {FTYPE}
or {OUTTYPE}
for upper-cased type names.
I see you've used fp16
and fp32
in the past, but this will use f16
and f32
, respectively, for these type names.
(EDIT: this is now implemented in e0af2df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i've made the transition to f16/f32, was an oversight from me naming them 'fp'
a format option would be amazing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should auto
simply try to be as lossless as possible? Like, if the model is originally in f32
, make the output f32
? Or should it always select a 16-bit type? (currently bf16
is selected for f32
models)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My vote would be on compressing even if originally it was f32, if the person covering wants f32 they'll specify, otherwise presumably they're always converting with the intention of quantizing where it won't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Do no harm" should be the default.
The quantization version was missing. * convert-hf : don't round bf16 NANs * convert-hf : save some memory with np.int16 intermediate bf16 weights * convert-hf : more closely match llama.cpp with which weights to keep in f32
A reason for this to exist is for model quantizers who want an initial GGUF with the most fidelity to the original model while still using a 16-bit float type instead of 32-bit floats.
# same as ggml_compute_fp32_to_bf16 in ggml-impl.h | ||
def np_fp32_to_bf16(n: np.ndarray): | ||
# force nan to quiet | ||
n = np.where((n & 0x7fffffff) > 0x7f800000, (n & 0xffff0000) | (64 << 16), n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good. This is the only line that bugs me. Not sure about the (64 << 16)
for logical OR. I'm knee deep in a dataset, so my mental state is not 100% there right now. Could be nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's doing the equivalent of these lines in ggml-impl.h
:
(64 << 16)
came from adapting this while postponing the right shift until after rounding.
(n & 0xffff0000)
is to avoid rounding up the NANs by setting the low bits to zero.
It's a reflex from when programming in C/C++, I guess.
* convert-hf : rename --outtype auto-f16 to --outtype auto
This is looking really good. What's the next steps to get this merged? I can do some testing if that's what is needed |
Honestly, it's pretty much ready. But the newly-added
Conversion performance (speed-wise) for If there is no objection, I would like to merge this at 2024-05-11 15:00 UTC. |
I assume any slowdown in converting to bf16 is made up for by the speed of quanting bf16 instead of f32 Actually on that subject, since we can't inference bf16 with GPU, can we make imatrix for bf16 with GPU? Looking forward to it! may pull it to try out in the interim |
Yet. |
@jart love the hint haha. Yeah I figured it's coming, but in the meantime I'm curious how it works, is GPU inference support required for imatrix on GPU? |
As a follow-up to #7075 and #6412, this introduces proper lazy
bfloat16
conversion inconvert-hf-to-gguf.py
with Numpy.Numpy does not yet support
bfloat16
, but this is still possible.This implementation, like the one in
ggml-impl.h
, makesnan
quiet, flushes subnormals to zero, and rounds to nearest even value.This means
bf16
tensor data made withconvert-hf-to-gguf.py
should match exactly what./quantize
produces fromf32
models.Summary of changes
gguf-py/gguf/lazy.py
for this, which defines theLazyMeta
metaclass and theLazyBase
base class which is used by bothLazyNumpyTensor
andLazyTorchTensor
.LazyTorchTensor
is still defined inconvert-hf-to-gguf.py
to avoidtorch
dependency ingguf-py
.deque
of lazy tensors per expression graphbfloat16
conversion supportLlamaFileType
ingguf-py/gguf/constants.py
to get the correctftype
values as inllama_ftype
fromllama.h
.GGMLFileType
because it's probably best to reserve this name for an enum analogous toggml_ftype
--outtype auto
to choose the highest-fidelity 16-bit floating point type according to the type of the first loaded tensor.f16
if the first tensor has dtypetorch.float16
, and usesbf16
otherwise, so thattorch.float32
andtorch.bfloat16
tensors keep their range.--outfile
name templatingpython3 convert-hf-to-gguf.py --outfile path/to/llama-3-8b-instruct-{ftype}.gguf --outtype auto ./path/to/Meta-Llama-3-8B-Instruct/
, and still get the automatically-chosen output type in the name.Testing
Note
The checksum of a model converted with
$ python3 convert-hf-to-gguf.py --outfile ./models/ggml-model-bf16.gguf --outtype bf16 ./path/to/model_dir/
and one converted with
SHOULD EXACTLY MATCH (as of 95930da)
bf16
andf32
quantized tobf16
(with./quantize
) result in the same output at--temp 1
given the same seed with https://huggingface.co/delphi-suite/v0-mamba-100k. Model made with--no-lazy
matches checksum of lazily-converted model. (sha256sum:272030278cb50b8b1eece85e175e1681c2aeabc430330a38380d2e441099a996
)bf16
compared to./quantize
output fromf32
(sha256sum:560d1f957d9cb77ddcbbe558cbbb394d18af0500f402be25cb6b2292a3b3cc8a
)bf16
compared to./quantize
output fromf32
(sha256sum:d6e6e1977ffa9cc365d425c107d7a79770b9425cab9db04d695e092fedd00d72
)(relevant for at least @jart, @teleprint-me, @bartowski1182)