Model verification #249

mhykes · 2023-05-19T12:08:55Z

Loading these models and being able to use them for inference or in a repl is very cool.
To make this broadly useful, we need a way to verify that models loaded by llm are acting the same way as they would if loaded via other libraries. Does anyone have a strategy they are using for this?

Specifically, I'm playing with ggml-model-bloomz-560m-f16.bin and the results are almost useful, but not quite. I'm trying to understand if this is due to some issue with the model, with the cli parameters I'm using (all defaults), or if there's a bug with the way the bloom model is being loaded.

Rather than simply solve my specific problem, I'd like to see a general solution for answering this question for any model that can be loaded by llm. I'm willing to put some effort into that, if someone can point me in a good direction.

danforbes · 2023-05-19T13:45:28Z

Seems like this will probably help #248

mhykes · 2023-05-19T16:29:08Z

Perplexity over a dataset might be useful in finding ideal cli parameters and for selecting between models for a given task. Unless perplexity is directly comparable between this library and the llama.cpp library it won't be helpful for finding issues in the library.

Still, looking forward to having perplexity available for this. Some automated perplexity calculations across standard models and data sets should be beneficial in identifying any changes in the library that effect the evaluation of the models.

philpax · 2023-05-20T21:31:56Z

We'd like to be able to automatically determine discrepancies, but this is not super-straightforward as each library can have its own execution strategy and source of randomness that might influence the results.

Do you have any ideas as to how we might tackle this?

Specifically, I'm playing with ggml-model-bloomz-560m-f16.bin and the results are almost useful, but not quite. I'm trying to understand if this is due to some issue with the model, with the cli parameters I'm using (all defaults), or if there's a bug with the way the bloom model is being loaded.

Yeah, I think our BLOOM inference is suspect. (#228, but I suspect the problem goes deeper.) We'll be revising the implementation of each model soon to bring them in line with the upstream implementations in llama.cpp/ggml. (It's likely that GGML has made changes that change the behaviour of the models, but we haven't kept up with the corresponding changes to the models.)

Unless perplexity is directly comparable between this library and the llama.cpp library it won't be helpful for finding issues in the library.

We now use the same perplexity implementation as llama.cpp, but our numbers aren't comparable. I suspect this is due to the same out-of-date implementation issue. You might be interested in following #257.

mhykes · 2023-06-02T13:07:12Z

I was just investigating the llm_rs python module and had an idea.

For any model where there is both a torch version and a ggml version, give both the torch version and the ggml version the same prompt. Then measure the perplexity of each model version while evaluating both of the generated results.

I have not tested this, but I expect that a model's perplexity should be fairly low for its own output. If so, then there are good matching results from both the ggml and non-ggml versions of a model if the perplexity of the non-ggml model is similar for both its own output and the output of the ggml model.

Is there a good way to get perplexity from non-ggml models?

philpax added meta:maintenance Changes that will make it easier for us to maintain code issue:enhancement New feature or request labels May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model verification #249

Model verification #249

mhykes commented May 19, 2023

danforbes commented May 19, 2023

mhykes commented May 19, 2023

philpax commented May 20, 2023

mhykes commented Jun 2, 2023

Model verification #249

Model verification #249

Comments

mhykes commented May 19, 2023

danforbes commented May 19, 2023

mhykes commented May 19, 2023

philpax commented May 20, 2023

mhykes commented Jun 2, 2023