Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] ./server failed to eval #2122

Closed
4 tasks done
ghost opened this issue Jul 6, 2023 · 1 comment
Closed
4 tasks done

[User] ./server failed to eval #2122

ghost opened this issue Jul 6, 2023 · 1 comment

Comments

@ghost
Copy link

ghost commented Jul 6, 2023

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Hi, thanks for the continued effort with llama.cpp. I cloned the repo, then built with make as usual.

Expected Behavior

Run ./server without error messages. This issue was not present in #2009. Unfortunately, I'm receiving errors during inference with ./server with commit #2116. I'll test other builds..

Current Behavior

Errors during ./server inference:

llama_eval_internal: first token must be BOS
llama_eval: failed to eval

It's abrupt, and cuts off response in the middle of a sentence. Here's an example:

~/ollama (master)> ./server -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -t 4 -b 10

{"timestamp":1688607679,"level":"INFO","function":"main","line":1085,"message":"build info","build":796,"commit":"31cfbb1"}
{"timestamp":1688607679,"level":"INFO","function":"main","line":1090,"message":"

system info","n_threads":4,"total_threads":8,"system_info":"AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | "}

llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB

llama server listening at http://127.0.0.1:8080)

{"timestamp":1688607679,"level":"INFO","function":"main","line":1305,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37212,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1688607685,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37210,"status":404,"method":"GET","path":"/favicon.ico","params":{}}

llama_print_timings:        load time =  2102.17 ms
llama_print_timings:      sample time =  3291.18 ms /   355 runs   (    9.27 ms per token,   107.86 tokens per second) 
llama_print_timings: prompt eval time = 10480.78 ms /    49 tokens (  213.89 ms per token,     4.68 tokens per second) 
llama_print_timings:        eval time = 124335.87 ms /   354 runs   (  351.23 ms per token,     2.85 tokens per second)
llama_print_timings:       total time = 138282.27 ms

{"timestamp":1688607964,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37214,"status":200,"method":"POST","path":"/completion","params":{}}

llama_eval_internal: first token must be BOS
llama_eval: failed to eval
{"timestamp":1688608023,"level":"ERROR","function":"nextToken","line":360,"message":"failed to eval","n_eval":10,"n_past":0,"n_threads":4,"embd":" 
rare ingredients for potions, and even delved into dangerous dungeons filled with 
treacherous monsters. Along the way, she made friends with other creatures who shared her passion for knowledge and 
adventure, including dragons, unicorns, and even mermaids.\nAs time passed, Luna grew stronger both physically and mentally, 
becoming an extraordinary creature capable of performing incredible feats. And yet, 
despite all her newfound powers, she never forgot where she came from or the humble roots that first led her down this path. 
For Luna always remained true to her llama nature, using her abilities only for good and spreading joy wherever she went.\n
User: Thanks. Describe Lunas appearance please.\n
llama: As a young llama, Luna was adorable with soft brown fur, long eyelashes, and a friendly smile. But as she embarked on her 
journey towards greatness, her physical features began to change in mysterious ways. Her eyes 
became more intense, glowing like crystals themselves, while her body developed powerful 
muscles and a shimmering golden coat. She now stood taller than any ordinary ll"}

llama_print_timings:        load time =  2102.17 ms
llama_print_timings:      sample time =   936.31 ms /    93 runs   (   10.07 ms per token,    99.33 tokens per second)
llama_print_timings: prompt eval time =  4246.50 ms /    16 tokens (  265.41 ms per token,     3.77 tokens per second)
llama_print_timings:        eval time = 29930.84 ms /    92 runs   (  325.34 ms per token,     3.07 tokens per second)
llama_print_timings:       total time = 35164.16 ms

{"timestamp":1688608023,"level":"INFO","function":"log_server_request","line":1058,"message":"request","remote_addr":"127.0.0.1","remote_port":37216,"status":200,"method":"POST","path":"/completion","params":{}}
^C

Environment and Context

uname -a

Linux localhost 4.14.190-23725627-abG975WVLS8IWD1 #2 SMP PREEMPT Mon Apr 10 18:16:39 KST 2023 aarch64 Android
lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              Qualcomm
  Model name:           Kryo-4XX-Silver
    Model:              14
    Thread(s) per core: 1
    Core(s) per socket: 49
    Socket(s):          1
    Stepping:           0xd
    CPU(s) scaling MHz: 62%
    CPU max MHz:        1785.6000
    CPU min MHz:        300.0000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull
                         sha1 sha2 crc32 atomics f
                        php asimdhp cpuid asimdrdm
                         lrcpc dcpop asimddp
  Model name:           Kryo-4XX-Gold
    Model:              14
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          2
    Stepping:           0xd
    CPU(s) scaling MHz: 74%
    CPU max MHz:        2841.6001
    CPU min MHz:        710.4000
    BogoMIPS:           38.40
    Flags:              fp asimd evtstrm aes pmull
                         sha1 sha2 crc32 atomics f
                        php asimdhp cpuid asimdrdm
                         lrcpc dcpop asimddp
$ Python 3.11.4
$ GNU Make 4.4.1
$clang version 16.0.6
Target: aarch64-unknown-linux-android24
Thread model: posix
InstalledDir: /data/data/com.termux/files/usr/bin

Failure Information (for bugs)

llama_eval_internal: first token must be BOS
llama_eval: failed to eval

Steps to Reproduce

  1. git clone https://github.com/ggerganov/llama.cpp
  2. Make
  3. ./server -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin -t 4 -b 10
  4. Then interact with the model 2-3 times.
git log | head -1
commit 31cfbb1013a482e89c72146e2063ac4362becae7

Thank you!

@ghost
Copy link
Author

ghost commented Jul 6, 2023

For anyone like myself, ensure to add extra --context during ./server

These errors occur at max context.

@ghost ghost closed this as completed Jul 6, 2023
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants