-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for different data types (float16, float32) #93
Conversation
Problem: - Only float32 is currently supported which requires 2x the memory and it slower to execute as inference is memory bound. Solution: - Add ability to export model to float16. - Add support to inference with float16. Note: - It's quite hard to do in generically in pure C (without templates) so to avoid adding too much complexity compilation option has been chosen. Ideally that would be a run-time pick based on the value stored in the config but that requires additional complexity which I wanted to avoid but that can be still explored with a proper solution.
@krzysztof-jusiak Hey there, what about approach would be used for BF16? thanks! |
For clang, |
@krzysztof-jusiak: Hi, do you think it makes sense to just move from float32 to float16? It seems float16 is always faster than float32. Using a single data type would be simpler both in code and in numbers of modelxxx.bin (already 3 + llama 7B and counting...) |
``` | ||
|
||
The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal: | ||
The export will take ~1 minute or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reaons -> reasons
@xefoci7612 I think there is a need for different data type as they provide different accuracy and differ in sizes, etc. Ideally we would operate on max precision floating point only but that's not possible due to performance/size limitations. However, smaller models or better hardware allow to use higher precision versions. Since inference is memory bound for faster speeds usually quantization is also applied, which compress the model even more to for example 4 bits but that comes with accuracy trades off, which I don't think can be made by default for everyone. |
No, float16 is not always faster than float32. If the processor does not natively support FP16 arithmetic, then it will be emulated in software |
will def take a look here. one thing to be careful with is that if you want to inference in fp16 you must train in fp16 (with gradient scalers), and not in fp32 or bf16. Otherwise the range of activations can overflow. |
@krzysztof-jusiak what are the benefits of fp16?
also my understanding is this would invalidate all our previous checkpoints because they don't contain dtype in the config. Which is on me for having chosen a dumb serialization :D |
Thanks for the hint about the training, it's a really valid point. Regarding the benefits of fp16, it's mainly performance as the weights size is smaller and since the computations are memory bound the speedup should be noticeable. Saying that, I've not noticed a huge difference between fp16 and fp32 on my machine which doesn't add up with my previous experiments with other llms, I'd expect a bit bigger difference. |
Backwards compatibility can be maintained by assuming that a missing dtype implies fp32. |
@@ -17,6 +17,10 @@ Then run with: | |||
#include <fcntl.h> | |||
#include <sys/mman.h> | |||
|
|||
#ifndef DTYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to typedef dtype rather than spreading DTYPE macro all over.
#ifndef DTYPE | |
#ifdef DTYPE | |
typedef DTYPE dtype; | |
else | |
typedef float dtype; | |
#else | |
Ok I took a look but I think the required changes are a little bit too yucky, and we're not seeing strong evidence of much better results. I do like that the files would have been smaller by ~half... Anyway I'm leaning to not include this atm but ty. If someone can demonstrate a solid higher throughput I think it's worth revisiting. |
Problem:
Solution:
Note: