From d0e4fe0bbe98f6bbc34c4cbebc1392734eccf838 Mon Sep 17 00:00:00 2001
From: WordsHk <info+github@words.hk>
Date: Thu, 22 Feb 2024 15:46:15 +0800
Subject: [PATCH] Partially revert #4280 which broke `parallel`

I found that parallel (in examples/parallel) was unusable when -np > 1.
I bisected the issue down to d7b800b8bc490a221acbd83c575206a907f2f6e2

I don't really understand anything about kv-cache, just that the change
caused parallel to emit nonsense on my M2 Mac Studio (Apple M2 Max,
macOS 14.1.2 (23B92)). The comments around it say kv_self.n is a
heuristic (and seems to have comments suggesting other possible values
for assignment), so I presume that it shouldn't be a problem to remove
the GGML_PAD(). Empirically it seems to work fine. That said, it does
sound like the bug could run deeper, but it is beyond my ability to
understand what the root cause might be.

It apparently reproduces on various models not only tinyllama, but this
one is small so it should be more convenient to reproduce. While
tinyllama isn't known for the quality of its output, there is still an
obvious difference between the nonsense output and the normal output.

Reproduction:

`./parallel -c 99999 -n 30 -ns 10 -np 2 -m ~/Downloads/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf`

Before fix (example bad output):

```
Input:    If you could have any superpower, what would it be?
Response: . In the 812
```

After fix (example expected output):

```
Input:    If you could have any superpower, what would it be?
Response: I would choose the power of being able to control time. The power
```

----

After typing the above I realized when running larger models, the
problem is less apparent, but still exists. For example, I tried
mixtral:

`./parallel -c 99999 -n 50 -ns 50 -np 2 -m ~/Downloads/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf`

And one of the outputs was:

```
Input:    Recommend some interesting books to read.
Response: I recommend the book "Surelyourecommend would suggest starting
with the book "The Foundation for Self-Help by Dr. Micahelle myself to
anywhere in the world"
```

The problem with the above response is obvious, but here's another that
isn't so obvious if you just glance at it:

```
Input:    I want to learn how to play the piano.
Response: That's great! I could recommend a personalize piano lessons
with a piano teacher. This will allow you to learn at your own pace. You
can practice scales and chords,
```

Note that "a personalize piano lessons" is not grammatical English, a
mistake that mixtral should not make. I didn't notice any such errors
when testing with this patch applied.
---
 llama.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/llama.cpp b/llama.cpp
index 259f2a3a3ea00..eff7b558e6327 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -7864,7 +7864,7 @@ static int llama_decode_internal(
     // a heuristic, to avoid attending the full cache if it is not yet utilized
     // after enough generations, the benefit from this heuristic disappears
     // if we start defragmenting the cache, the benefit from this will be more important
-    kv_self.n = std::min((int32_t) cparams.n_ctx, std::max(32, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));
+    kv_self.n = std::min((int32_t) cparams.n_ctx, std::max(32, llama_kv_cache_cell_max(kv_self)));
     //kv_self.n = llama_kv_cache_cell_max(kv_self);
 
     //printf("kv_self.n = %5d, kv_self.used = %5d, kv_self.head = %5d\n", kv_self.n, kv_self.used, kv_self.head);