-
Notifications
You must be signed in to change notification settings - Fork 369
Model verification #249
Comments
Seems like this will probably help #248 |
Perplexity over a dataset might be useful in finding ideal cli parameters and for selecting between models for a given task. Unless perplexity is directly comparable between this library and the llama.cpp library it won't be helpful for finding issues in the library. Still, looking forward to having perplexity available for this. Some automated perplexity calculations across standard models and data sets should be beneficial in identifying any changes in the library that effect the evaluation of the models. |
We'd like to be able to automatically determine discrepancies, but this is not super-straightforward as each library can have its own execution strategy and source of randomness that might influence the results. Do you have any ideas as to how we might tackle this?
Yeah, I think our BLOOM inference is suspect. (#228, but I suspect the problem goes deeper.) We'll be revising the implementation of each model soon to bring them in line with the upstream implementations in llama.cpp/ggml. (It's likely that GGML has made changes that change the behaviour of the models, but we haven't kept up with the corresponding changes to the models.)
We now use the same perplexity implementation as llama.cpp, but our numbers aren't comparable. I suspect this is due to the same out-of-date implementation issue. You might be interested in following #257. |
I was just investigating the llm_rs python module and had an idea. For any model where there is both a torch version and a ggml version, give both the torch version and the ggml version the same prompt. Then measure the perplexity of each model version while evaluating both of the generated results. I have not tested this, but I expect that a model's perplexity should be fairly low for its own output. If so, then there are good matching results from both the ggml and non-ggml versions of a model if the perplexity of the non-ggml model is similar for both its own output and the output of the ggml model. Is there a good way to get perplexity from non-ggml models? |
Loading these models and being able to use them for inference or in a repl is very cool.
To make this broadly useful, we need a way to verify that models loaded by llm are acting the same way as they would if loaded via other libraries. Does anyone have a strategy they are using for this?
Specifically, I'm playing with ggml-model-bloomz-560m-f16.bin and the results are almost useful, but not quite. I'm trying to understand if this is due to some issue with the model, with the cli parameters I'm using (all defaults), or if there's a bug with the way the bloom model is being loaded.
Rather than simply solve my specific problem, I'd like to see a general solution for answering this question for any model that can be loaded by llm. I'm willing to put some effort into that, if someone can point me in a good direction.
The text was updated successfully, but these errors were encountered: