Support for different data types (float16, float32) #93

kris-jusiak · 2023-07-26T02:10:55Z

Problem:

Only float32 is currently supported which requires 2x the memory and it slower to execute as inference is memory bound.

Solution:

Add ability to export model to float16.
Add support to inference with float16.

Note:

It's quite hard to do in generically in pure C (without templates) so to avoid adding too much complexity compilation option has been chosen. Ideally that would be a run-time pick based on the value stored in the config but that requires additional complexity which I wanted to avoid but that can be still explored with a proper solution.

Problem: - Only float32 is currently supported which requires 2x the memory and it slower to execute as inference is memory bound. Solution: - Add ability to export model to float16. - Add support to inference with float16. Note: - It's quite hard to do in generically in pure C (without templates) so to avoid adding too much complexity compilation option has been chosen. Ideally that would be a run-time pick based on the value stored in the config but that requires additional complexity which I wanted to avoid but that can be still explored with a proper solution.

vgoklani · 2023-07-26T03:17:02Z

@krzysztof-jusiak Hey there, what about approach would be used for BF16?

thanks!

axrwl · 2023-07-26T05:28:21Z

For clang, -DDTYPE=_Float16 will work.

xefoci7612 · 2023-07-26T05:40:53Z

@krzysztof-jusiak: Hi, do you think it makes sense to just move from float32 to float16? It seems float16 is always faster than float32.

Using a single data type would be simpler both in code and in numbers of modelxxx.bin (already 3 + llama 7B and counting...)

kroggen · 2023-07-26T07:35:49Z

README.md

 ```

-The export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal:
+The export will take ~1 minute or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https://github.com/karpathy/llama2.c/pull/85) that despite efforts, the 13B export currently doesn't work for unknown reaons (accepting PRs for fix). We can run the model as normal:


reaons -> reasons

kris-jusiak · 2023-07-26T07:43:54Z

@xefoci7612 I think there is a need for different data type as they provide different accuracy and differ in sizes, etc. Ideally we would operate on max precision floating point only but that's not possible due to performance/size limitations. However, smaller models or better hardware allow to use higher precision versions. Since inference is memory bound for faster speeds usually quantization is also applied, which compress the model even more to for example 4 bits but that comes with accuracy trades off, which I don't think can be made by default for everyone.

kroggen · 2023-07-26T07:51:40Z

@xefoci7612

No, float16 is not always faster than float32. If the processor does not natively support FP16 arithmetic, then it will be emulated in software

https://stackoverflow.com/questions/56697332/float16-is-much-slower-than-float32-and-float64-in-numpy

karpathy · 2023-07-26T17:58:31Z

will def take a look here. one thing to be careful with is that if you want to inference in fp16 you must train in fp16 (with gradient scalers), and not in fp32 or bf16. Otherwise the range of activations can overflow.

karpathy · 2023-07-26T22:09:32Z

@krzysztof-jusiak what are the benefits of fp16?

clearly checkpoints get half as small
how much faster is fp16? e.g. on your computer, or on a MacBook Air M1 or so

also my understanding is this would invalidate all our previous checkpoints because they don't contain dtype in the config. Which is on me for having chosen a dumb serialization :D

kris-jusiak · 2023-07-27T18:09:05Z

Thanks for the hint about the training, it's a really valid point.

Regarding the benefits of fp16, it's mainly performance as the weights size is smaller and since the computations are memory bound the speedup should be noticeable.

Saying that, I've not noticed a huge difference between fp16 and fp32 on my machine
fp16: 2.3tok/s
fp32: 2.0tok/s

which doesn't add up with my previous experiments with other llms, I'd expect a bit bigger difference.
I'm verifying whether the code is at memory bound stage or not yet. Maybe it's still cpu bound which would explain why the speedup isn't bigger.

axrwl · 2023-07-27T18:29:21Z

also my understanding is this would invalidate all our previous checkpoints because they don't contain dtype in the config

Backwards compatibility can be maintained by assuming that a missing dtype implies fp32.

ozabluda · 2023-07-29T19:07:28Z

run.c

@@ -17,6 +17,10 @@ Then run with:
 #include <fcntl.h>
 #include <sys/mman.h>

+#ifndef DTYPE


It's better to typedef dtype rather than spreading DTYPE macro all over.

Suggested change

#ifndef DTYPE

#ifdef DTYPE

typedef DTYPE dtype;

else

typedef float dtype;

#else

karpathy · 2023-08-05T22:59:29Z

Ok I took a look but I think the required changes are a little bit too yucky, and we're not seeing strong evidence of much better results. I do like that the files would have been smaller by ~half... Anyway I'm leaning to not include this atm but ty. If someone can demonstrate a solid higher throughput I think it's worth revisiting.

kris-jusiak mentioned this pull request Jul 26, 2023

Running llama2-7B on laptop with a few GB of memory... #98

Open

kroggen reviewed Jul 26, 2023

View reviewed changes

kris-jusiak mentioned this pull request Jul 26, 2023

Speed up inference ~4x for 7B model without introducing too much complexity #95

Open

ozabluda reviewed Jul 29, 2023

View reviewed changes

ozabluda mentioned this pull request Aug 1, 2023

Add -funroll-all-loops to compiler flags #183

Closed

karpathy closed this Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for different data types (float16, float32) #93

Support for different data types (float16, float32) #93

kris-jusiak commented Jul 26, 2023

vgoklani commented Jul 26, 2023

axrwl commented Jul 26, 2023

xefoci7612 commented Jul 26, 2023

kroggen Jul 26, 2023

kris-jusiak commented Jul 26, 2023

kroggen commented Jul 26, 2023

karpathy commented Jul 26, 2023

karpathy commented Jul 26, 2023

kris-jusiak commented Jul 27, 2023

axrwl commented Jul 27, 2023

ozabluda Jul 29, 2023

karpathy commented Aug 5, 2023

-#ifndef DTYPE
+#ifdef DTYPE
+typedef DTYPE dtype;
+else
+typedef float dtype;
+#else

Support for different data types (float16, float32) #93

Support for different data types (float16, float32) #93

Conversation

kris-jusiak commented Jul 26, 2023

vgoklani commented Jul 26, 2023

axrwl commented Jul 26, 2023

xefoci7612 commented Jul 26, 2023

kroggen Jul 26, 2023

Choose a reason for hiding this comment

kris-jusiak commented Jul 26, 2023

kroggen commented Jul 26, 2023

karpathy commented Jul 26, 2023

karpathy commented Jul 26, 2023

kris-jusiak commented Jul 27, 2023

axrwl commented Jul 27, 2023

ozabluda Jul 29, 2023

Choose a reason for hiding this comment

karpathy commented Aug 5, 2023