-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : unified file format #220
Comments
technically speaking, we also had a GGMFv1, the one before the memory mapped GGJTv1 |
there is also the new .ggml wip file, which contains the computation graph. 3b697a2 |
Wonderful, I thought I would go for safetensors but they are not really willing to extend the spec for quantized dtypes. Obviously, I wanted to avoid GGML because there was no spec. If there is a spec, I am all in for this. BTW: I would be also super-exited about the graph-saving/loading, I thought I would refactor my logic to be agnostic over Model vs. Graph, because both just need input and have output, so for inference, it shouldn't matter if I have a graph or a real model. |
This is the first step to realize a unified llm API and interface and that would handle any supported architecture. |
255 makes more sense if you're going to use a byte to store the length. Unless you want to always add +1 to that value.
It might make more sense to make something like If you're going to go through a bunch of trouble designing a model container format, it seems like it would make sense to make it something that could just generally be used. That would also mean tools that manipulate it wouldn't really have to care if it was GGML or some other type of model.
and similar - Why not use the architecture as the base key? A different architecture of model isn't necessarily going to have hidden layers, rotary, etc. It might have its own stuff. Just as an example, RWKV models don't even have attention heads. So instead, you'd have Did I miss it or does this not really describe how/where the actual tensors get defined? I actually like the SafeTensors approach quite a bit where the metadata just defines position and length. The only thing I'd change from that is adding a requirement that the tensor data has to start align. You'd want to pick an alignment that is pretty future proof and works for most CPU architectures and most types. What is it GGML uses internally, 64 bytes? That wastes a little space but it's not enough to really matter. |
Yep! We already implement this in llm in the Rust world, but we'd love to see upstream support for this and to begin consolidating the various examples into a cohesive framework so that we can all benefit.
Sure. I wasn't thinking about using a byte for the length, but that's entirely reasonable.
I'm not opposed, but I'd like to see a motivating case first. I think this is most likely to be implemented by all parties if we can agree on a reasonable extension from the original format.
No particular reason. I saw a commonality and merged them; if people decide using the architecture as base key makes more sense, I'm happy to go with that.
Yeah, this just uses the current GGJTv3 scheme in the interest of minimising the amount of work required to migrate to the format. No opposition to moving to a ST-like format from me, but I also don't feel particularly strongly about it. What does everyone else think? |
Even better (the value sets the key):
|
@klosax i'd say the ggml magic would take care of that - ideally non-ggml formats shouldn't be using it as a container format. No need to over engineer it (my 2c). |
Could we also include some optional generation parameters. Which contain default values for some sampling parameters? Or would that be to specific? |
I would recommend including stuff that's mainly essential for loading the model - things that are required for proper functioning. Samplers are technically not even dependent on the model - user is free to do with the output logits as they please. |
Agree with sampling parameters not being essential (especially since you can use whatever sampler you want with whatever model.) That being said, that reminds me - it might be a good idea to include suggested prompt formats as one of the standardised config parameters. Feel free to 👍 or 👎 this post if you think that's too extra. |
Hmm I think that will be fine as an optional parameter, but not as a standard parameter. Standard params should be stuff that are required for loading correctly, like use_parallel_residual and quantization types. Also prompt formats may not even make sense universally (they're kindof an instruct model thing). I have a model trained on literature, it has no prompt format, it just spews out prose. I also have another model that just generates long sequences of increasing numbers. Likewise... base llama has no prompt format. |
Is it possible that something like this could be useful https://github.com/khonsulabs/pot? It seems like at a high-level this discussion revolves around the best way to construct a self-describing data format, which is a problem that I think has already been addressed to a certain extent. More ideas here: https://github.com/yasammez/nachricht#prior-art |
Yes. Sorry, to clarify, when I say "standard" I don't mean they should be included in all models. It's just that if you do add a prompt format, you should call it something we've declared here, so that things that expect it know what to look for. I'll go through the list of k-v pairs up there to clarify which ones are required and which ones are standardised-in-name but otherwise optional, but I'll wait for feedback on the rest of the proposal first.
Normally, yeah, I'd just use a self-descriptive standard format. However, GGML/llama.cpp aim to be as dependency-free as possible, so something moderately bespoke but not too complex is more likely to be accepted by the wider community. |
Adopting a format or specification doesn't necessarily mean taking on any new dependencies, and it would allow for greater focus to be placed on the "secret sauce", which I think will be the standardized key/value pairs and what they are meant to specify. |
I think this will be needed to run inference in instruction mode on any instruction tuned model. It is maybe enough with a key telling what supported standardized prompt formatting to use. If the key is missing, no instruction mode inference will be available. |
I recommend extending safetensors. Only libggml need to load the model correctly anyways. See original discussion here: rustformers/llm#143 What extensions we need
|
Considering that the safetensors project already answers the question Yet another format? I think this is unambiguously the right thing to do. |
You'd have to fork it to do that, they don't seem interested in extending it. Based on existing discussion, it seems like they want to lock it down and reduce its extensibility further by, for example, forbidding gaps between tensors (even though the format currently would allow it since the metadata only says where tensors start and their length). |
I disagree. The format itself is very simple. The huggingface parser is not that good, and we need to write the parser in C (for ggml) anyways. The safetensors format is just a format. If we get enough people to use our version, then our version becomes the "official" one. |
I agree with all of the above, but that's basically what I'd call "forking" it. Taking that project and basing another one on it that takes a different approach, has different requirements, sets different restrictions, etc. quick edit: Probably also should add: While I'm not really a fan of the direction they seem to have chosen, I personally wouldn't use the approach of forcefully trying to take control away. If it was me, I'd start with the SafeTensors format but call it something different. |
I agree with Kerfuffle that that would be a non-ideal turn of events and would likely alienate an ecosystem that we should stay on good terms with.
In any case, I'd like to request that we keep discussion about switching formats or fundamentally changing the structure of this format out of this issue. Feel free to open another issue. I'm looking for a solution that solves the immediate issues the ecosystem is encountering at the least cost possible; we're not trying to find the perfect solution here, but the one that enables the most reusability / functionality at the least cost. This format's not perfect by any means, but it's simple, easy to work with and understand (i.e. can be parsed from C without too much suffering), and more importantly: it powers an existing ecosystem with inertia. The more complicated we make this change and the more parties we involve, the harder it will be to actually make the change. Let's keep it on track. |
we can name this .safetensors-ggml or something. |
Why would json give a higher quality than the current layout? Some models dont have a tokenizer.json, Replit uses spiece.model. How should such vocab be handled? To support any vocab, maybe a key like |
This makes the tokenizer config less portable. The tokenizer file is usually loaded by an external library from a file. |
Good question. For context: I wasn't aware of the existence of other ways to store the tokenization data, and I'd have to look into it. Do you have any further information about it that I could look into?
Is encoding the only thing that can diverge? I'll admit I am not too across the nuances here - my understanding is that the HF models have their complex tokenizers, and then the Python conversion scripts load those in and extract (token, score) tuples that a GGML executor can use to tokenize a string, except it may not account for all of the complexities of the original tokenizer.
Yes, that's why it's optional. The (token, score) scheme can still be used, but I'd like for users to be able to use the original HF tokenizers out of the box if possible. |
I thoroughly support any effort to produce a new format which will be future-proof and will protect against any more breaking changes. I know it's probably not on the cards but what I would really love is if this change would eventually lead to llama.cpp being able to load any GGML model, like GPTJ, MPT, etc. If that's not being considered then at least if a standardised format would allow for non-compatible clients to inform the user that this is not a supported model then that would help a lot. The idea of using safetensors sounds smart, although if it is used I think it'd be ideal to change the name for this fork of safetensors. I really like the idea of an embedded prompt template. Users are asking more and more for prompt templates to be communicated. Having that in the format itself sounds like a great idea. I have a feature request of my own: multi-part files. It'd be really helpful if this change could bring back support for multi-part GGML files. safetensors would support that natively I guess. This would be useful because of the Hugging Face Hub limit of 50GB per file, which prevents uploading 65B q8_0 models unless they're uploaded eg as a multi-part ZIP, which is messy and extra work for uploader and user alike. I could also imagine that in the future we might see some new larger models - perhaps a Falcon 80B for example - which might similarly not be possible to upload in the higher quant sizes. Multi-part GGML would solve that neatly. Great work, hope this gets implemented! |
Replit is implemented here. Look at the conversion script. It needs a special tokenizer implemented in main.cpp In the MPT example you can see what had to be done to correctly encode (in convert script) and decode (in main.cpp) the gpt-neox vocab. Maybe the vocabs that are not json could be converted to it when creating the gguf file? |
Awesome! Yeah, I figured you might have a stake in this 😂
Agreed, that would be ideal. I left the possibility of this open in the future section:
but I'm not sure how far along the cgraph export/import functionality is, or how stable it is. I figured we can add that as an extension once that's solidified a bit. I'd be happy just to have
100% agreed - we were bouncing around ST support a couple months ago for I'm not opposed to the use of safetensors (we're likely to support it in
Glad to hear it. Do you have any suggestions for what that might look like/what needs to be supported?
Aaahhh, I did think about this but I'm not sure about it. I feel like that's conflating a distribution concern with a deployment concern; do you think you'd still need this if it weren't for the HF limit? Would it be a significant improvement over uploading multipart ZIPs?
Ah... I see... they have a custom sentencepiece tokenizer. Yeah, not sure how to best handle that. @Narsil is that something |
If a major file format change is going to happen again the tokenizer configs for the models using huggingface When encoding they should, after doing the "pretokenizing" stage with the regex, merge bigrams in the order they occur in the merges list, which will not necessarily get the same result as just taking the longest matching token. The logic in minGPT's implementation of GPT2's tokenizer is a good reference: https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py#L95 Tokens added after "training", the ones in the "added_tokens" section of Lastly, to totally match the behavior of I have a C++ implementation of enough of that to correctly encode ChatML prompts as used by MPT-7B-Chat at https://github.com/apage43/bpe.cpp but it depends on ICU for two things, the unicode normalization, which might be possible to live without, and the pretokenizing regex being unicode-aware when splitting on "letter" characters, which is somewhat important for handling non-English text. |
Sorry about the delay, I'll get to making the PR within the next few days 👍
Seems reasonable. I'll account for this in the spec. Given that this might be a requirement for Metal loading, should we align both the metadata and tensor data to some large predefined alignment to ensure the models are always loadable?
Ah, yeah, I suppose we could always use separate fields. That would handle the case where there aren't any scores, either. Great, I'll account for that in the spec. Should there be support for nested arrays (e.g.
Is that different? Unless I'm mistaken, wouldn't that just be an extended version of this format with more KVs and tensors? |
Indeed, they are not really that different, just different from the current model files. |
Okay, I've written a first pass at the spec at #302! Have at it there - please make any further suggestions against that PR, so that we have a unified document/vision that we can update. |
It does not hurt to be in the spec and supported in the future. |
I’m confused about the claim that |
I think the main point against using safetensors is it's json usage, |
Can you elaborate as to why this is problematic? |
Hi! (Big fan of your work!) We were considering using safetensors before (and may still do in the future), but there are a few issues. Fundamentally, it boils down to a few things:
I think that safetensors will be supported by executors in the future - including the necessary extensions for GGML use - but GGUF's designed to resolve the issues with the current format while still leaving room for rapid evolution. |
I think this would be the case, and honestly, if that's the only blocker, I'd be happy to do it. I also believe we could use safetensors, if:
I am probably missing a few things (unicode?), but I think we could just say that our format is a subset of the safetensors (to make the parsing simpler) |
Aye, I think it would be possible to make safetensors work with enough work. My suspicion, though, is that the amount of work is on par with defining our own format, and it'd come with two disadvantages:
I think safetensors support is an excellent idea, but I don't think we can/should make it the primary format for this ecosystem until the rate of development slows down and things can be more standardised. |
It might be reasonable to support reading safetensors in but it probably doesn't make sense as a format for storing quantized models unless the quantization formats become more standardized (at the very least not exclusively used by ggml) |
I think if we are a subset then everybody can read us but we can't read any other model, which is IMHO fine. Reading itself of course does not mean it will be useful, the client still needs to know what is in the file but that's how it is with every format except graph-dumps.
The question is how they are eager to define/relax the spec and participate in joint development. The main benefit of safetensors is that we are trying to be forward-compatible here and burning brain cycles even when JSON has already solved all of that. The metadata itself is enough to describe anything, the rest are just tensor bytes. That said, I personally don't care that much about the format, the most important thing is to get it done and supported ASAP. The fragmentation is crazy ATM. So feel free to just ignore whatever I said :) |
Regarding nested JSONs The conversation about nested JSON support is about the metadata field, correct? Metadata is (by definition) auxiliary information that is non-essential. If there’s data stored as metadata that you need to have present to run a model, it seems like something is being misused more than anything else. Regarding the Spec What info is desired that isn’t found here? Is there a standard way for file specs to be written? I’m happy to write up the desired document (consulting with Nicolas of course) if that would be helpful. Regarding Everything Else My primary interest is in improving the interoperability of the open source ecosystem and reducing duplication of work. I came to this thread because I tweeted about how I was excited ST was gaining traction and someone replied with a list of complaints and linked to this thread. I’m very much not here to tell the ggml community how you all should prioritize things or what decisions you should make. If we can make ggml and ST happy at the same time instead of necessitating the creation of another format, that’s a big win in my book. If y’all decide that you have too divergent values from ST or that your library isn’t stable enough I understand. P.S.: Are there leader(s) of this community? I would love to learn about how y’all’re organizing and managing the community, and if there’s anything that EleutherAI can either do to help or learn from you about. |
@StellaAthena I believe the leader is @ggerganov, who created |
We store the model configuration within the model (i.e. hyperparameters, model structure information, tokenizer). This is to allow single-file deployments of models, because the configuration rarely changes and users enjoy being able to download one file and get going with their existing executor. The proposal being made with regards to safetensors here is to store that configuration in the JSON metadata, to allow for the same kind of experience. This seems doable, but would require readers to be able to read JSON, which is harder with GGML's single-file C header. (Although this appears to be changing with the addition of more backends.) I can't comment on whether that would be a misuse of the metadata field, but other ST readers would, most likely, ignore the presence of this data.
That looks good to me. I've seen several different ST implementations, so I assume that this is sufficient. I assume this was missed by the person who raised the concern.
I'm sorry to hear that - we have absolutely no bad blood with safetensors, it's just not necessarily the right fit for our ecosystem at this moment due to our slightly different constraints.
I agree that we should unify the formats if possible. It's just that GGML moves very quickly - we had two quantization format breaks in two weeks the other month - and having full control of the format will allow us to maintain that velocity. My main concern with making safetensors work for us is that we risk breaking the wider ecosystem or creating a somewhat-incompatible fork of safetensors that's compatible on paper, but not necessarily in practice. (e.g. ggml safetensors models being all-in-one with custom quantization, while other safetensors models have their config in separate files with well-defined quantization). The quantization is really the major sticking point for me; our quantization formats are fickle and won't be compatible with the larger safetensors ecosystem, which will lead to a lot of user confusion. With that being said - my hope is that in a few months time, this format will be retired, and we're all on safetensors. My part of the ecosystem (rustformers/llm) is likely to support safetensors as a format to load before that, as adding additional dependencies is easy for us.
As mentioned, Georgi is the lead of the ecosystem and is the head of the newly-founded ggml.ai. I'm the primary maintainer of rustformers/llm, a Rust library that uses GGML to implement several architectures with a unified interface. Executors outside of Georgi's repositories (llm included) are somewhat of an organic development, and aren't necessarily organized. It's still early days for this ecosystem - as far as I know, we don't actually have a place to communicate synchronously - so there's a lot of work to be done. Georgi is currently the BDFL :-) |
I pretty much agree with everything that @philpax said. In the long run we will support and integrate with safetensors, but at the moment we are primarily focused on consolidating the work from various More specifically, the main goal atm is to complete the current roadmap which would lay a good foundation for the project. After that we will look into extending support and collaboration further.
The best way to help now is to help complete the roadmap. |
The idea of safetensors sounds very good to me.
Or better |
Note: The discussion about the file format is continued in PR #302. |
@philpax Thank you for such a great proposal. I have a few questions: I'm wondering, you said:
As for ggml/gguf user there is only conver-blabla.py path to convert some custom model (as it was recently done for baichuan model at llama.cpp) or there are any other place where I can put mappings/conversion logic? I see converters are placed in many repos now: Where is the main place for it? Maybe it is a good place to specify some global registry-kind repo for them / template repo? PS I wish to write some kind of happy path for contributors Thank you in advance! |
Obsoletes #147, #150, ggerganov/llama.cpp#1575, ggerganov/llama.cpp#1590, rustformers/llm#143, and probably some other issues across some other repositories.
Please see the spec PR at #302; the following is left as-is so you can see the original proposal.
Current state of affairs
Overview
At present, there are two GGML file formats floating around for LLMs (and potentially other ggml-using projects, I haven't looked too much at the implementation of whisper):
Both of these formats share the same fundamental structure:
ftype
that should describe the type of the majority of the tensors, and for GGML files, the quantization version encoded using a modulo in the ftypeWe have more details on the format here: https://github.com/rustformers/llm/tree/main/crates/ggml#format
Drawbacks
Unfortunately, over the last few months, there are a few issues that have become apparent with the existing models:
GGJTv4/GGUF
Based on this, I'd like to propose a new format that's designed to be universal and addresses these issues. It is largely identical to GGJTv3, but makes one important difference: the hyperparameters are encoded as an array of key-value pairs that can be read in any order, and these hyperparameters are used to encode additional information about the model. A really important property I'd like to keep is single-file deployment: if I give you a GGUF file and you have a compatible executor, it should Just Work:tm without any additional conversion or extra files.
"Specification"
To quote from ggerganov/llama.cpp#1575 (comment):
Filling in some of the missing details:
Keys
Keys are ASCII lower_snake_case with dots for separation. Their length is stored before the key. They have a maximal length of 256 (open for debate, just a number I picked that seems like a reasonable upper bound).
This means that:
vocabulary.hugging_face
is a valid keyvocabulary-hugging-face
is notVocabulary.HuggingFace
is notvocabulary.hugging-face
is notI'd say we're looking at something like TOML keys without quotation.
Values
Values are one of the following types:
U32
: little-endian unsigned 32-bit integerI32
: little-endian signed 32-bit integer (honestly not sure if this is necessary, I feel like a lot of the existing i32 use has been more just due to the use ofint
than anything)F32
: IEEE754 32-bit floating point numberString
: UTF-8 string data, length prependedBytes
: Raw binary data with no specific meaning attached, length prependedBoolean
: 1-byte value where 0 is false and 1 is true. Anything else is invalid. I considered making anything other than 0 true, but being strict on this will help detect misbehaving writers.Standardized key-value pairs
This list is incomplete. Feel free to suggest additions. Where possible, I've tried to use the original names from the models to remove a layer of semantic confusion.
This is just from a quick appraisal of the models that
llm
supports. There are likely other fields that we can standardise ahead of time by looking at the HuggingFace config.General
general.architecture: String
: describes what architecture this model implements. Values can includellama
,mpt
,gpt-neox
,gpt-j
,gpt-2
,bloom
, etc. (List more if you can think of them, and they're not just variants of existing architectures!)general.quantization_version: u32
: version of quantization schemegeneral.file_type: String
: type of the majority of the tensors in the file. This shouldn't have any semantic meaning and should be purely informational, hence the use ofString
.general.license: String
: SPDX license of the modelgeneral.description: String
: information about the model, including provenancegeneral.original_model_url: String
: path to the original model that this GGML file was created fromLLM
llm.context_length: u32
: size of the maximum supported contextllm.hidden_size: u32
: embedding layer sizellm.num_hidden_layers: u32
: number of hidden layersllm.num_rotary: u32
:int(hparams["rotary_pct"]*(hparams["hidden_size"]//hparams["num_attention_heads"]))
llm.use_parallel_residual: bool
: whether or not the parallel residual logic should be usedllm.max_seq_len: u32
: Maximum sequence lengthllm.attention.num_heads: u32
: number of attention headsllm.attention.alibi_bias_max: f32
: The maximum bias to use for ALiBIllm.attention.clip_kqv: f32
: not sureVocabulary
vocabulary.embedded_size: u32
: size of the embedded vocabulary. Zero if there is no embedded vocabulary.vocabulary.huggingface_tokenizer_json: String
: the entirety of the HFtokenizer.json
for a given model (e.g. https://huggingface.co/mosaicml/mpt-7b-instruct/blob/main/tokenizer.json). Optional, but highly recommended for best tokenization quality with supported executors.Future
This is not something we should aim for in the MVP, but ggml now has support for exporting the computation graph. A sample computation graph could be embedded to allow an executor to run the model without having direct support for the architecture.
Migration
The existing migrations have been pretty messy for the ecosystem and for the community. We should try to avoid causing significant upset by providing a migration path.
My suggestion is to switch over all model implementations, including llama.cpp, to GGUF, but offer a very straightforward conversion utility that does not require Python and can convert GGML and GGJTv3 to GGUF with all required information.
If interested, we could also include support for GGJT v1 and v2 using ggerganov/llama.cpp#1504 (although the requantisation process is inherently lossy).
Hopefully, this is the last time we have to bite this bullet. Even if we make breaking changes (like quantization version) again, software consuming GGUF can intelligently decide what to do based on the available information in the hyperparameters.
New model architectures can use GGUF without any additional work, so no breaking changes should be necessary there, either.
Conversion of Python models to GGUF
Ideally, all of the existing
convert-h5-to-ggml.py
andconvert.py
scripts can be entirely deprecated. Instead, there is one script that takes an arbitrary HuggingFace model and converts it to a compatible GGUF file. This vastly reduces the maintenance burden and makes it simpler to action changes across the ecosystem when necessary.cc @ggerganov @LostRuins @KerfuffleV2 @LLukas22 @TheBloke @iacore @comex and others who work with GGML models
The text was updated successfully, but these errors were encountered: