Add date and commit hash to gguf metadata #2728

akawrykow · 2023-08-23T00:33:01Z

There was some discussion on #2707 about including some extra metadata in the .gguf files to more easily figure out when and how the file was generated.

This PR adds:

Date time in ISO format
The current git commit hash (short format)
- Under the hood we are calling git rev-parse --short HEAD to get this

We are also printing these fields alongside general.name etc

To test the change, I ran:

python3 convert.py ./models/7B/ --ctx 2048

and then cracked open the .gguf file in Notepad++ and saw the corresponding values:

and saw the values were logged when running the inference:

slaren · 2023-08-23T00:46:14Z

I think it's a good idea to add this metadata, but convert.py should not depend on git to work.

akawrykow · 2023-08-23T01:00:55Z

@slaren fair enough. Open to suggestions for some other way of generating a 'commit hash'. Unless you just mean adding some extra safety to how we're generating it now, and leaving it as 'N/A' when git isn't setup

KerfuffleV2 · 2023-08-23T07:54:10Z

Unless you just mean adding some extra safety to how we're generating it now

That sounds like a reasonable approach to me. I'd suggest just not adding it rather than adding it with a value that doesn't convey any information if fetching the hash fails.

I'm also not sure it should be a general field. Just for example, suppose I write a different tool to convert models. Do I put my commit hash in that field? If I do, it kind of becomes worthless because you don't have a way to differentiate between the commit hash from the official convert.py and kerfufflev2converter1000.py

Maybe instead of commit hash it just should be a fairly free-form general.generated_by field that could have information about what tool was used and its version.

monatis · 2023-08-23T08:03:25Z

I'm preparing gguf writing utility as a pip-installable package. Maybe adding the package version makes more sense?

KerfuffleV2 · 2023-08-23T12:04:41Z

It just occurred to me that this is going to make validating models with SHA256 or whatever a lot harder. The actual data for the model could be 100% the same, but if two models were generated one second apart they'd have a different hash.

Previously you could use something like a SHA256 to verify the content of a model, if stuff like the data/build is in it then you can only verify that it's a copy of a specific generated model. Cases where that might matter are for developers verifying that their changes don't functionally change the result, troubleshooting user issues, etc.

akawrykow · 2023-08-23T15:15:08Z

@KerfuffleV2 what if we added some utility script for generating this hash? It could load the .gguf file and pick out the fields/data that make sense to include rather than taking a hash of the whole file

KerfuffleV2 · 2023-08-23T15:38:08Z

what if we added some utility script for generating this hash?

Well, it's better than nothing. :) I think there's still a disadvantage though because you have to have people generate this specific special type of "hash". HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

akawrykow · 2023-08-23T15:46:37Z

I'm preparing gguf writing utility as a pip-installable package. Maybe adding the package version makes more sense?

@monatis hmm but in theory this gguf utility could stay the same while the model contents change due to differences in convert.py or the quantization. I think we would want to know which commit from llama.cpp produces the model so we would know what the state of the conversion/quantization looked like at the time

akawrykow · 2023-08-23T16:20:46Z

HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

@KerfuffleV2 I'm a total noob in this space but are people already publishing .gguf files on HF? It looks like HF has a bunch of cool stuff like being able to run inference directly on their site:

Do we already have a ggml integration on HF? If not, maybe there is a hook for defining this model hash there

KerfuffleV2 · 2023-08-23T16:23:25Z

HF's stuff doesn't really support GGML/GGUF at all. The hosted inference only works with models that can run via Transformers.

I'd be surprised if they ever allow stuff like hosted inference of GGUF models.

akawrykow · 2023-08-23T16:24:19Z

@KerfuffleV2 I briefly skimmed this and it kind of sounded like it was possible: https://huggingface.co/docs/hub/models-adding-libraries#integrate-your-library-with-the-hub

Green-Sky · 2023-08-23T19:19:10Z

what if we added some utility script for generating this hash?

Well, it's better than nothing. :) I think there's still a disadvantage though because you have to have people generate this specific special type of "hash". HF includes a SHA256 with files published there, but you won't be able to use that or look there to find the hash that has to be used for comparing GGUF files.

actually, why not include the hash for the tensors as metadata. that way a standalone tool can validate the file itself.

Green-Sky · 2023-08-23T19:27:40Z

[...] it just should be a fairly free-form general.generated_by field that could have information about what tool was used and its version.

since a possible (yet artifical) chain of gguf files could be: first convert, then quantize, and maybe then quantize again, then apply a lora, then another, then requantize.
we might be interested to track the chain of previous file(s). eg by hash or something similar

edit: and then the history of a lora or a merge (avg) of 2 or more models would make it a tree history.... oh no, i think i rediscovered git.

KerfuffleV2 · 2023-08-23T20:42:31Z

I briefly skimmed this and it kind of sounded like it was possible

I guess I might have been too pessimistic there. I'm almost positive there's currently no interface for hosted inference with GGML/GGUF though.

monatis · 2023-08-24T07:06:38Z

@akawrykow

hmm but in theory this gguf utility could stay the same while the model contents change due to differences in convert.py or the quantization.

Ah yes. Maybe hash of the conversion script file itself? :) Its advantage over commit hash is that the output file will be the same unless the conversion scrip is updated, and the model file can still be validated by SHA256.

monatis · 2023-08-24T07:25:21Z

@akawrykow and @KerfuffleV2

Some info about HF services / products here to be on the same page:

Hosted Inference is a free and rate-limited service by HF that allows use of selected Transformer models over a HTTP API.
Inference Endpoints is a paid service that allows deployment of any Transformer models or custom containers on HF infrastructure for dedicated / private use.
HF Hub is an artifact hosting service on top of Git and Git LFS that powers huggingface.co/models. There is a separate Python package that is a dependency of the main transformers package and that is used to download models from or upload to this service. It is used to cache models on the user machine etc., and it is possible to develop your own model hub on top of it --there's no need for an involvement from HF staff. Later, it can be integrated for a function like, for example, gguf_model_from_pretrained(const char * repo_path, enum llama_ftype). If it's not found in the cache dir, it can be automatically downloaded and cached. I don't think it will be adopted by Gerganov, though. More suitable for downstream repos like rustformers etc.

rankaiyx · 2023-12-08T03:26:01Z

I think this is one way:
The conversion program calculates the hash of valid content and adds the hash and generation time to the metadata.
Inference program, add a switch, the function is: when loading the model, enable hash check, that is, the inference program calculates the hash of valid content, and compares the hash with the hash of the metadata record.

mofosyne · 2024-05-25T14:23:50Z

How's everyone take on this idea? Still a good idea these days?

akawrykow added 4 commits August 22, 2023 17:25

[gguf] Add date

684686e

[gguf] Add git commit hash

398bedb

[gguf] Print the date

39a2c89

[gguf] Print the commit hash

6803aac

mofosyne added the obsolete? Marker for potentially obsolete PR label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add date and commit hash to gguf metadata #2728

Add date and commit hash to gguf metadata #2728

akawrykow commented Aug 23, 2023 •

edited

Loading

slaren commented Aug 23, 2023 •

edited

Loading

akawrykow commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

monatis commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

akawrykow commented Aug 23, 2023 •

edited

Loading

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

Green-Sky commented Aug 23, 2023

Green-Sky commented Aug 23, 2023 •

edited

Loading

KerfuffleV2 commented Aug 23, 2023

monatis commented Aug 24, 2023

monatis commented Aug 24, 2023

rankaiyx commented Dec 8, 2023

mofosyne commented May 25, 2024

Add date and commit hash to gguf metadata #2728

Are you sure you want to change the base?

Add date and commit hash to gguf metadata #2728

Conversation

akawrykow commented Aug 23, 2023 • edited Loading

slaren commented Aug 23, 2023 • edited Loading

akawrykow commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

monatis commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

akawrykow commented Aug 23, 2023 • edited Loading

KerfuffleV2 commented Aug 23, 2023

akawrykow commented Aug 23, 2023

Green-Sky commented Aug 23, 2023

Green-Sky commented Aug 23, 2023 • edited Loading

KerfuffleV2 commented Aug 23, 2023

monatis commented Aug 24, 2023

monatis commented Aug 24, 2023

rankaiyx commented Dec 8, 2023

mofosyne commented May 25, 2024

akawrykow commented Aug 23, 2023 •

edited

Loading

slaren commented Aug 23, 2023 •

edited

Loading

akawrykow commented Aug 23, 2023 •

edited

Loading

Green-Sky commented Aug 23, 2023 •

edited

Loading