Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute perplexity fails with too many tokens exception #385

Closed
4 tasks done
maziyarpanahi opened this issue Mar 22, 2023 · 9 comments · Fixed by #390
Closed
4 tasks done

Compute perplexity fails with too many tokens exception #385

maziyarpanahi opened this issue Mar 22, 2023 · 9 comments · Fixed by #390
Labels
bug Something isn't working

Comments

@maziyarpanahi
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

It is supposed to compute perplexity like the original PR: #270

Current Behavior

However, it fails with the following exception:

llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector

Environment and Context

  • macOS (M2 Max)
$ python3 --version 3.8.16
$ make --version i386-apple-darwin11.3.0
$ g++ --version arm64-apple-darwin22.3.0

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. git pull
  2. make
  3. python3 convert-pth-to-ggml.py models/7B/ 1
  4. python3 quantize.py 7B
  5. ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

Failure Logs

llama.cpp % ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f ~/wikitext-2-raw/wiki.test.raw

main: seed = 1679472306
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 4529.34 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291

system_info: n_threads = 4 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
llama_tokenize: too many tokens
libc++abi: terminating with uncaught exception of type std::length_error: vector
zsh: abort      ./main -m ./models/7B/ggml-model-q4_0.bin -t 4 -n 128 --perplexity -f
@ggerganov
Copy link
Owner

ggerganov commented Mar 22, 2023

The llama_tokenize() in utils has to be fixed to support large texts.
See discussion here: #370 (comment)

Will take a look later if it is still not fixed

@gjmulder gjmulder added the bug Something isn't working label Mar 22, 2023
@gjmulder
Copy link
Collaborator

FYI: I just completed a 222 chunk run with the 30B q4 model by taking the first 1404 lines of the wikitext:


llama.cpp$ git log | head -1
commit 353ec251a42491f5192c48561da4b444ef67f23c
llama.cpp/models/wikitext-2-raw$ head -1404 wiki.test.raw > wiki.test.raw.1404
./main -t 16 --perplexity -m models/30B/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw.1404'

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 22, 2023

edit: proper fix here #390

for a quick fix you can do:

diff --git a/utils.cpp b/utils.cpp
index 1679ae1..af822cc 100644
--- a/utils.cpp
+++ b/utils.cpp
@@ -148,6 +148,12 @@ std::string gpt_random_prompt(std::mt19937 & rng) {
 std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos) {
     std::vector<llama_token> res(8096);
     int n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+    if (n < 0) {
+        res.resize(-n);
+        n = llama_tokenize(ctx, text.c_str(), res.data(), res.size(), add_bos);
+
+        assert(n >= 0);
+    }
     res.resize(n);

     return res;

however, this is not a good solution, since it invokes the tokenizer 2 times for a large file. And produces a warning/error llama_tokenize: too many tokens on the first try.

@Green-Sky
Copy link
Collaborator

FYI: I just completed a 222 chunk run with the 30B q4 model by taking the first 1404 lines of the wikitext:


llama.cpp$ git log | head -1
commit 353ec251a42491f5192c48561da4b444ef67f23c
llama.cpp/models/wikitext-2-raw$ head -1404 wiki.test.raw > wiki.test.raw.1404
./main -t 16 --perplexity -m models/30B/ggml-model-q4_0.bin -f ./models/wikitext-2-raw/wiki.test.raw.1404'

can you provide the output too?

@gjmulder
Copy link
Collaborator

Is this what you need? 30B_int4.txt

Memory usage for 30B / q4 as reported by atop -r was 44.2GB and was constant:

    PID       TID   MINFLT   MAJFLT   VSTEXT   VSLIBS    VDATA   VSTACK   LOCKSZ     VSIZE    RSIZE    PSIZE    VGROW    RGROW   SWAPSZ   RUID       EUID         MEM    CMD       1/17
2637698         -    47973        0   188.0K     3.4M    47.6G   132.0K     0.0K     47.6G    44.2G    44.2G       0B       0B       0B   mulderg    mulderg      35%    main

Currently running 65B / q4, and seeing 79.2GB constant memory usage. It is taking approx. twice as long as 30B, so full results tomorrow. 🐌 🐌

perp_vs_chunk

@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 22, 2023

@gjmulder the graphs convey it way better then just the final number ❤️

edit: @gjmulder , don't just say q4, say q4_0 or q4_1. they make a significant difference.

@gjmulder
Copy link
Collaborator

gjmulder commented Mar 22, 2023

Poor man's Weights & Biases using scp, awk, and R/ggplot2 🤣

perp_vs_chunk

@BadisG
Copy link

BadisG commented Mar 22, 2023

Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization.
https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

@gjmulder
Copy link
Collaborator

Would be good to test the perplexity with the GPTQ quantization and compare with the usual RTN quantization. https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

@BadisG see #129, which is becoming a catch-all issue for model quality.

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
Update llama.py: Added how many input tokens in ValueError exception
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants