Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Model verification #249

Open
mhykes opened this issue May 19, 2023 · 4 comments
Open

Model verification #249

mhykes opened this issue May 19, 2023 · 4 comments
Labels
issue:enhancement New feature or request meta:maintenance Changes that will make it easier for us to maintain code

Comments

@mhykes
Copy link

mhykes commented May 19, 2023

Loading these models and being able to use them for inference or in a repl is very cool.
To make this broadly useful, we need a way to verify that models loaded by llm are acting the same way as they would if loaded via other libraries. Does anyone have a strategy they are using for this?

Specifically, I'm playing with ggml-model-bloomz-560m-f16.bin and the results are almost useful, but not quite. I'm trying to understand if this is due to some issue with the model, with the cli parameters I'm using (all defaults), or if there's a bug with the way the bloom model is being loaded.

Rather than simply solve my specific problem, I'd like to see a general solution for answering this question for any model that can be loaded by llm. I'm willing to put some effort into that, if someone can point me in a good direction.

@danforbes
Copy link
Contributor

Seems like this will probably help #248

@mhykes
Copy link
Author

mhykes commented May 19, 2023

Perplexity over a dataset might be useful in finding ideal cli parameters and for selecting between models for a given task. Unless perplexity is directly comparable between this library and the llama.cpp library it won't be helpful for finding issues in the library.

Still, looking forward to having perplexity available for this. Some automated perplexity calculations across standard models and data sets should be beneficial in identifying any changes in the library that effect the evaluation of the models.

@philpax
Copy link
Collaborator

philpax commented May 20, 2023

We'd like to be able to automatically determine discrepancies, but this is not super-straightforward as each library can have its own execution strategy and source of randomness that might influence the results.

Do you have any ideas as to how we might tackle this?


Specifically, I'm playing with ggml-model-bloomz-560m-f16.bin and the results are almost useful, but not quite. I'm trying to understand if this is due to some issue with the model, with the cli parameters I'm using (all defaults), or if there's a bug with the way the bloom model is being loaded.

Yeah, I think our BLOOM inference is suspect. (#228, but I suspect the problem goes deeper.) We'll be revising the implementation of each model soon to bring them in line with the upstream implementations in llama.cpp/ggml. (It's likely that GGML has made changes that change the behaviour of the models, but we haven't kept up with the corresponding changes to the models.)

Unless perplexity is directly comparable between this library and the llama.cpp library it won't be helpful for finding issues in the library.

We now use the same perplexity implementation as llama.cpp, but our numbers aren't comparable. I suspect this is due to the same out-of-date implementation issue. You might be interested in following #257.

@philpax philpax added meta:maintenance Changes that will make it easier for us to maintain code issue:enhancement New feature or request labels May 20, 2023
@mhykes
Copy link
Author

mhykes commented Jun 2, 2023

I was just investigating the llm_rs python module and had an idea.

For any model where there is both a torch version and a ggml version, give both the torch version and the ggml version the same prompt. Then measure the perplexity of each model version while evaluating both of the generated results.

I have not tested this, but I expect that a model's perplexity should be fairly low for its own output. If so, then there are good matching results from both the ggml and non-ggml versions of a model if the perplexity of the non-ggml model is similar for both its own output and the output of the ggml model.

Is there a good way to get perplexity from non-ggml models?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request meta:maintenance Changes that will make it easier for us to maintain code
Projects
None yet
Development

No branches or pull requests

3 participants