Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add date and commit hash to gguf metadata #2728

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

akawrykow
Copy link
Contributor

@akawrykow akawrykow commented Aug 23, 2023

There was some discussion on #2707 about including some extra metadata in the .gguf files to more easily figure out when and how the file was generated.

This PR adds:

  • Date time in ISO format
  • The current git commit hash (short format)
    • Under the hood we are calling git rev-parse --short HEAD to get this

We are also printing these fields alongside general.name etc

To test the change, I ran:

python3 convert.py ./models/7B/ --ctx 2048

and then cracked open the .gguf file in Notepad++ and saw the corresponding values:
gguf metadata

and saw the values were logged when running the inference:
image

@slaren
Copy link
Collaborator

slaren commented Aug 23, 2023

I think it's a good idea to add this metadata, but convert.py should not depend on git to work.

@akawrykow
Copy link
Contributor Author

@slaren fair enough. Open to suggestions for some other way of generating a 'commit hash'. Unless you just mean adding some extra safety to how we're generating it now, and leaving it as 'N/A' when git isn't setup

@KerfuffleV2
Copy link
Collaborator

Unless you just mean adding some extra safety to how we're generating it now

That sounds like a reasonable approach to me. I'd suggest just not adding it rather than adding it with a value that doesn't convey any information if fetching the hash fails.

I'm also not sure it should be a general field. Just for example, suppose I write a different tool to convert models. Do I put my commit hash in that field? If I do, it kind of becomes worthless because you don't have a way to differentiate between the commit hash from the official convert.py and kerfufflev2converter1000.py

Maybe instead of commit hash it just should be a fairly free-form general.generated_by field that could have information about what tool was used and its version.

@monatis
Copy link
Collaborator

monatis commented Aug 23, 2023

I'm preparing gguf writing utility as a pip-installable package. Maybe adding the package version makes more sense?

@KerfuffleV2
Copy link
Collaborator

It just occurred to me that this is going to make validating models with SHA256 or whatever a lot harder. The actual data for the model could be 100% the same, but if two models were generated one second apart they'd have a different hash.

Previously you could use something like a SHA256 to verify the content of a model, if stuff like the data/build is in it then you can only verify that it's a copy of a specific generated model. Cases where that might matter are for developers verifying that their changes don't functionally change the result, troubleshooting user issues, etc.

@akawrykow
Copy link
Contributor Author

@KerfuffleV2 what if we added some utility script for generating this hash? It could load the .gguf file and pick out the fields/data that make sense to include rather than taking a hash of the whole file

@KerfuffleV2
Copy link
Collaborator

what if we added some utility script for generating this hash?

Well, it's better than nothing. :) I think there's still a disadvantage though because you have to have people generate this specific special type of "hash". HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

@akawrykow
Copy link
Contributor Author

I'm preparing gguf writing utility as a pip-installable package. Maybe adding the package version makes more sense?

@monatis hmm but in theory this gguf utility could stay the same while the model contents change due to differences in convert.py or the quantization. I think we would want to know which commit from llama.cpp produces the model so we would know what the state of the conversion/quantization looked like at the time

@akawrykow
Copy link
Contributor Author

akawrykow commented Aug 23, 2023

HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

@KerfuffleV2 I'm a total noob in this space but are people already publishing .gguf files on HF? It looks like HF has a bunch of cool stuff like being able to run inference directly on their site:

image

Do we already have a ggml integration on HF? If not, maybe there is a hook for defining this model hash there

@KerfuffleV2
Copy link
Collaborator

HF's stuff doesn't really support GGML/GGUF at all. The hosted inference only works with models that can run via Transformers.

I'd be surprised if they ever allow stuff like hosted inference of GGUF models.

@akawrykow
Copy link
Contributor Author

@KerfuffleV2 I briefly skimmed this and it kind of sounded like it was possible: https://huggingface.co/docs/hub/models-adding-libraries#integrate-your-library-with-the-hub

@Green-Sky
Copy link
Collaborator

what if we added some utility script for generating this hash?

Well, it's better than nothing. :) I think there's still a disadvantage though because you have to have people generate this specific special type of "hash". HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

actually, why not include the hash for the tensors as metadata. that way a standalone tool can validate the file itself.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Aug 23, 2023

[...] it just should be a fairly free-form general.generated_by field that could have information about what tool was used and its version.

since a possible (yet artifical) chain of gguf files could be: first convert, then quantize, and maybe then quantize again, then apply a lora, then another, then requantize.
we might be interested to track the chain of previous file(s). eg by hash or something similar

edit: and then the history of a lora or a merge (avg) of 2 or more models would make it a tree history.... oh no, i think i rediscovered git.

@KerfuffleV2
Copy link
Collaborator

I briefly skimmed this and it kind of sounded like it was possible

I guess I might have been too pessimistic there. I'm almost positive there's currently no interface for hosted inference with GGML/GGUF though.

@monatis
Copy link
Collaborator

monatis commented Aug 24, 2023

@akawrykow

hmm but in theory this gguf utility could stay the same while the model contents change due to differences in convert.py or the quantization.

Ah yes. Maybe hash of the conversion script file itself? :) Its advantage over commit hash is that the output file will be the same unless the conversion scrip is updated, and the model file can still be validated by SHA256.

@monatis
Copy link
Collaborator

monatis commented Aug 24, 2023

@akawrykow and @KerfuffleV2

Some info about HF services / products here to be on the same page:

  • Hosted Inference is a free and rate-limited service by HF that allows use of selected Transformer models over a HTTP API.
  • Inference Endpoints is a paid service that allows deployment of any Transformer models or custom containers on HF infrastructure for dedicated / private use.
  • HF Hub is an artifact hosting service on top of Git and Git LFS that powers huggingface.co/models. There is a separate Python package that is a dependency of the main transformers package and that is used to download models from or upload to this service. It is used to cache models on the user machine etc., and it is possible to develop your own model hub on top of it --there's no need for an involvement from HF staff. Later, it can be integrated for a function like, for example, gguf_model_from_pretrained(const char * repo_path, enum llama_ftype). If it's not found in the cache dir, it can be automatically downloaded and cached. I don't think it will be adopted by Gerganov, though. More suitable for downstream repos like rustformers etc.

@rankaiyx
Copy link
Contributor

rankaiyx commented Dec 8, 2023

I think this is one way:
The conversion program calculates the hash of valid content and adds the hash and generation time to the metadata.
Inference program, add a switch, the function is: when loading the model, enable hash check, that is, the inference program calculates the hash of valid content, and compares the hash with the hash of the metadata record.

@mofosyne
Copy link
Collaborator

How's everyone take on this idea? Still a good idea these days?

@mofosyne mofosyne added the obsolete? Marker for potentially obsolete PR label May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
obsolete? Marker for potentially obsolete PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants