Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why divert from the default GGML versions? #5

Open
LLukas22 opened this issue Jun 17, 2023 · 2 comments
Open

Why divert from the default GGML versions? #5

LLukas22 opened this issue Jun 17, 2023 · 2 comments

Comments

@LLukas22
Copy link

The introduction of additional version numbers (7 and 40) brings additional complexity to the ggml ecosystem.

Basically this could also be solved by simply reading the file magic. If its a GGML file don't read the version and disable mmap. If the magic is GGJT read the version (as this format is versioned) and enable mmap.

This would also allow the creation of Falcon-7B ggjt files with mmap support.

@cmp-nct
Copy link
Owner

cmp-nct commented Jun 17, 2023

That originated from the first falcon demo examples, I did not want to break compatibility with existing models.

I agree that the magic and versioning of ggml binaries should be used more but it is exactly the same situation on llama.cpp which uses the layer count as model indicator not the magic.

So ggllm.cpp as well as llama.cpp both use the layer number as primary indicator of the model.
ggllm.cpp looks at "version" if the layer count does not match, so it's not a big deal.
Leaving version inside for now shouldn't matter much, it does not break anything, there might be changes to the conversion
Longterm we can remove the version again, also the layer count check is not needed to determine the model type. We can just use the weight names or kv head count as indicator.

Regarding mmap and ggjt support: we already got that.

  1. create the basic model using the python script -> produces an ugly ggml V0 binary
  2. Use falcon_quantize and convert that V0 binary into 3,4,5,6,8,16,32 bit ggml v3 binary (ggjt/mmap)

@LLukas22
Copy link
Author

I get your point about not needing to use the version number to determine the model type. However, in my project where I'm integrating Falcon support into rustformers/llm, the unusual version numbers cause issues. We have a universal GGML/GGJT loader in place that manages all loading tasks, built to work with the GGML and LLama.cpp repo. With this setup, version numbers like 7 and 40 aren't recognized as valid.

I could create and quantize my own Falcon models in the v3 GGJT format, but that would result in various models online that are only compatible with certain libraries. That's something I'd rather avoid.

Maybe the new V4 file format will get implemented soon, and we can sidestep the issue of having a fragmented ecosystem.

maddes8cht pushed a commit to maddes8cht/ggllm.cpp.PR that referenced this issue Aug 29, 2023
* use hipblas based on cublas
* Update Makefile for the Cuda kernels
* Expand arch list and make it overrideable
* Fix multi GPU on multiple amd architectures with rocblas_initialize() (cmp-nct#5)
* add hipBLAS to README
* new build arg LLAMA_CUDA_MMQ_Y
* fix half2 decomposition
* Add intrinsics polyfills for AMD
* AMD assembly optimized __dp4a
* Allow overriding CC_TURING
* use "ROCm" instead of "CUDA"
* ignore all build dirs
* Add Dockerfiles
* fix llama-bench
* fix -nommq help for non CUDA/HIP

---------

Co-authored-by: YellowRoseCx <[email protected]>
Co-authored-by: ardfork <[email protected]>
Co-authored-by: funnbot <[email protected]>
Co-authored-by: Engininja2 <[email protected]>
Co-authored-by: Kerfuffle <[email protected]>
Co-authored-by: jammm <[email protected]>
Co-authored-by: jdecourval <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants