Why divert from the default GGML versions? #5

LLukas22 · 2023-06-17T21:27:18Z

The introduction of additional version numbers (7 and 40) brings additional complexity to the ggml ecosystem.

Basically this could also be solved by simply reading the file magic. If its a GGML file don't read the version and disable mmap. If the magic is GGJT read the version (as this format is versioned) and enable mmap.

This would also allow the creation of Falcon-7B ggjt files with mmap support.

cmp-nct · 2023-06-17T23:04:25Z

That originated from the first falcon demo examples, I did not want to break compatibility with existing models.

I agree that the magic and versioning of ggml binaries should be used more but it is exactly the same situation on llama.cpp which uses the layer count as model indicator not the magic.

So ggllm.cpp as well as llama.cpp both use the layer number as primary indicator of the model.
ggllm.cpp looks at "version" if the layer count does not match, so it's not a big deal.
Leaving version inside for now shouldn't matter much, it does not break anything, there might be changes to the conversion
Longterm we can remove the version again, also the layer count check is not needed to determine the model type. We can just use the weight names or kv head count as indicator.

Regarding mmap and ggjt support: we already got that.

create the basic model using the python script -> produces an ugly ggml V0 binary
Use falcon_quantize and convert that V0 binary into 3,4,5,6,8,16,32 bit ggml v3 binary (ggjt/mmap)

LLukas22 · 2023-06-18T08:02:45Z

I get your point about not needing to use the version number to determine the model type. However, in my project where I'm integrating Falcon support into rustformers/llm, the unusual version numbers cause issues. We have a universal GGML/GGJT loader in place that manages all loading tasks, built to work with the GGML and LLama.cpp repo. With this setup, version numbers like 7 and 40 aren't recognized as valid.

I could create and quantize my own Falcon models in the v3 GGJT format, but that would result in various models online that are only compatible with certain libraries. That's something I'd rather avoid.

Maybe the new V4 file format will get implemented soon, and we can sidestep the issue of having a fragmented ecosystem.

* use hipblas based on cublas * Update Makefile for the Cuda kernels * Expand arch list and make it overrideable * Fix multi GPU on multiple amd architectures with rocblas_initialize() (cmp-nct#5) * add hipBLAS to README * new build arg LLAMA_CUDA_MMQ_Y * fix half2 decomposition * Add intrinsics polyfills for AMD * AMD assembly optimized __dp4a * Allow overriding CC_TURING * use "ROCm" instead of "CUDA" * ignore all build dirs * Add Dockerfiles * fix llama-bench * fix -nommq help for non CUDA/HIP --------- Co-authored-by: YellowRoseCx <[email protected]> Co-authored-by: ardfork <[email protected]> Co-authored-by: funnbot <[email protected]> Co-authored-by: Engininja2 <[email protected]> Co-authored-by: Kerfuffle <[email protected]> Co-authored-by: jammm <[email protected]> Co-authored-by: jdecourval <[email protected]>

LLukas22 mentioned this issue Jun 17, 2023

Add Falcon Support rustformers/llm#313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why divert from the default GGML versions? #5

Why divert from the default GGML versions? #5

LLukas22 commented Jun 17, 2023

cmp-nct commented Jun 17, 2023 •

edited

Loading

LLukas22 commented Jun 18, 2023

Why divert from the default GGML versions? #5

Why divert from the default GGML versions? #5

Comments

LLukas22 commented Jun 17, 2023

cmp-nct commented Jun 17, 2023 • edited Loading

LLukas22 commented Jun 18, 2023

cmp-nct commented Jun 17, 2023 •

edited

Loading