Compute perplexity fails with too many tokens exception #385

maziyarpanahi · 2023-03-22T08:08:29Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

It is supposed to compute perplexity like the original PR: #270

Current Behavior

However, it fails with the following exception:

llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector

Environment and Context

macOS (M2 Max)

$ python3 --version 3.8.16
$ make --version i386-apple-darwin11.3.0
$ g++ --version arm64-apple-darwin22.3.0

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

git pull
make
python3 convert-pth-to-ggml.py models/7B/ 1
python3 quantize.py 7B
./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

Failure Logs

llama.cpp % ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

main: seed = 1679472306
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector
zsh: abort      ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-03-22T08:17:43Z

The llama_tokenize() in utils has to be fixed to support large texts.
See discussion here: #370 (comment)

Will take a look later if it is still not fixed

gjmulder · 2023-03-22T10:55:19Z

FYI: I just completed a 222 chunk run with the 30B q4 model by taking the first 1404 lines of the wikitext:


llama.cpp$ git log | head -1
commit 353ec251a42491f5192c48561da4b444ef67f23c
llama.cpp/models/wikitext-2-raw$ head -1404 wiki.test.raw > wiki.test.raw.1404
./main -t 16 --perplexity -m models/30B/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw.1404'

Green-Sky · 2023-03-22T11:30:13Z

edit: proper fix here #390

for a quick fix you can do:

diff --git a/utils.cpp b/utils.cpp
index 1679ae1..af822cc 100644
--- a/utils.cpp
+++ b/utils.cpp
@@ -148,6 +148,12 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
 std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
     std::vector<llama_token> res(8096);
     int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+    if (n < 0) {
+        res.resize(-n);
+        n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+
+        assert(n >= 0);
+    }
     res.resize(n);

     return res;

however, this is not a good solution, since it invokes the tokenizer 2 times for a large file. And produces a warning/error llama_tokenize: too many tokens on the first try.

Green-Sky · 2023-03-22T11:37:13Z

FYI: I just completed a 222 chunk run with the 30B q4 model by taking the first 1404 lines of the wikitext:


llama.cpp$ git log | head -1
commit 353ec251a42491f5192c48561da4b444ef67f23c
llama.cpp/models/wikitext-2-raw$ head -1404 wiki.test.raw > wiki.test.raw.1404
./main -t 16 --perplexity -m models/30B/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw.1404'

can you provide the output too?

gjmulder · 2023-03-22T11:53:43Z

Is this what you need? 30B_int4.txt

Memory usage for 30B / q4 as reported by atop -r was 44.2GB and was constant:

    PID       TID   MINFLT   MAJFLT   VSTEXT   VSLIBS    VDATA   VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE    VGROW    RGROW   SWAPSZ   RUID       EUID         MEM    CMD       1/17
2637698         -    47973        0   188.0K     3.4M    47.6G   132.0K     0.0K     47.6G    44.2G    44.2G       0B       0B       0B   mulderg    mulderg      35%    main

Currently running 65B / q4, and seeing 79.2GB constant memory usage. It is taking approx. twice as long as 30B, so full results tomorrow. 🐌 🐌

Green-Sky · 2023-03-22T12:16:41Z

@gjmulder the graphs convey it way better then just the final number ❤️

edit: @gjmulder , don't just say q4, say q4_0 or q4_1. they make a significant difference.

gjmulder · 2023-03-22T12:31:34Z

Poor man's Weights & Biases using scp, awk, and R/ggplot2 🤣

BadisG · 2023-03-22T12:41:17Z

Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization.
https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

gjmulder · 2023-03-22T12:51:12Z

Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization. https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

@BadisG see #129, which is becoming a catch-all issue for model quality.

Update llama.py: Added how many input tokens in ValueError exception

maziyarpanahi mentioned this issue Mar 22, 2023

Compute perplexity over prompt #270

Merged

gjmulder added the bug Something isn't working label Mar 22, 2023

Green-Sky mentioned this issue Mar 22, 2023

fix perplexity after c-api refactor by proving a large enough token buffer #389

Closed

Green-Sky mentioned this issue Mar 22, 2023

fix perplexity after c-api refactor #390

Merged

ggerganov closed this as completed in #390 Mar 22, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggerganov#385 from nb-programmer/main

ff9faaa

Update llama.py: Added how many input tokens in ValueError exception

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute perplexity fails with too many tokens exception #385

Compute perplexity fails with too many tokens exception #385

maziyarpanahi commented Mar 22, 2023

ggerganov commented Mar 22, 2023 •

edited

Loading

gjmulder commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023

gjmulder commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 •

edited

Loading

gjmulder commented Mar 22, 2023 •

edited

Loading

BadisG commented Mar 22, 2023

gjmulder commented Mar 22, 2023

Compute perplexity fails with too many tokens exception #385

Compute perplexity fails with too many tokens exception #385

Comments

maziyarpanahi commented Mar 22, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

ggerganov commented Mar 22, 2023 • edited Loading

gjmulder commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

Green-Sky commented Mar 22, 2023

gjmulder commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

gjmulder commented Mar 22, 2023 • edited Loading

BadisG commented Mar 22, 2023

gjmulder commented Mar 22, 2023

ggerganov commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

gjmulder commented Mar 22, 2023 •

edited

Loading