Compute perplexity over prompt #270

glinscott · 2023-03-18T21:16:47Z

This adds an option to compute perplexity over the prompt input similar to https://huggingface.co/docs/transformers/perplexity. It does so by chunking up the prompt into non-overlapping chunks of the context window size. It then runs the forward pass and computes the softmax probability of the output logits for the last half of the context window. This is so the model always has some context to predict the next token. Be warned: it is pretty slow, taking about 4 hours or so to complete wikitext2 on a 32 core machine.

Note: when doing prediction over large prompts, the default 10% expansion for the memory buffer is not sufficient - there is definitely a non-linear scaling factor in there somewhere.

Example:

Download/extract: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-raw-v1.zip?ref=salesforce-research
Run ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw
Output: perplexity: 6.5949 [655/655]

Some example runs at context 512:

5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16

Context 1024 runs:

5.9876 - 7B, q4_0, --memory_f16

Which show that the 16 bit version of the model is the best (lower perplexity is better), the 4 bit quantization introduces a fair amount of error (but not disastrous certainly), and the --memory_f16 flag is almost identical to baseline 4 bit.

Comparing to this article: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and, where they compare 4-bit to GPTQ quantization. The results are comparable, which is a good sign!

glinscott · 2023-03-18T22:11:02Z

Got results for 7B, ctx=1024: perplexity: 11.4921 [57/57], so that seems promising.

Green-Sky · 2023-03-18T22:54:45Z

This is indeed very cpu time consuming. I had it running for 25min and only got this far:
perplexity: 12.5934 [39/649] for 7B q4_0 ctx=512 (everything default)

Green-Sky · 2023-03-18T22:57:29Z

Note: when doing prediction over large prompts, the default 10% expansion for the memory buffer is not sufficient - there is definitely a non-linear scaling factor in there somewhere.

this is likely related to #213

ggerganov · 2023-03-19T18:42:03Z

Very useful work. I think this can be significantly made faster if we have the option for the eval method to return the logits even for the past tokens:

llama.cpp/main.cpp

Lines 733 to 735 in 7392f1c

    
           // return result for just the last token 
        
           embd_w.resize(n_vocab); 
        
           memcpy(embd_w.data(), (float *) ggml_get_data(inpL) + (n_vocab*(N-1)), sizeof(float)*n_vocab);

… window (so 512x more tokens!)

bakkot · 2023-03-19T20:48:41Z

@glinscott How is it you're seeing [x/114]? With the default context size (512), I'm seeing [x/649]. But from the code that should only depend on tokens.size() / params.n_ctx, and those should be constant across machines. Were you using a different dataset or something?

Anyway I ran it on the 7B FP16 model before your most recent commits (at commit e94bd9c), with

./main -m ./models/7B/ggml-model-f16.bin -n 128 -t 8 --perplexity -f ./wikitext-2-raw/wiki.test.raw

and got

perplexity: 10.4625 [649/649]

glinscott · 2023-03-19T20:52:28Z

Very useful work. I think this can be significantly made faster if we have the option for the eval method to return the logits even for the past tokens:

Yes, thanks! I was prototyping this last night, just got it working (I think).

Current output with:

$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw
...
perplexity: 13.0231 [39/114]

So, it's consistent with the old one, but much more accurate (256x more tokens for 512 size window).

Green-Sky · 2023-03-19T20:53:50Z

Anyway I ran it on the 7B FP16 model before your most recent commits (at commit e94bd9c), with
./main -m ./models/7B/ggml-model-f16.bin -n 128 -t 8 --perplexity -f ./wikitext-2-raw/wiki.test.raw
and got

perplexity: 10.4625 [649/649]

I did exactly the same 🙈 . but with a slightly different batch size.
$ ./main --perplexity -t 8 -c 512 -b 32 -f wikitext-2-raw/wiki.test.raw -m models/7B/ggml-model-f16.bin
perplexity: 10.4624 [649/649]

Green-Sky · 2023-03-19T20:55:43Z

@glinscott How is it you're seeing [x/114]? With the default context size (512), I'm seeing [x/649]. But from the code that should only depend on tokens.size() / params.n_ctx, and those should be constant across machines. Were you using a different dataset or something?

@glinscott can you check your wikitext file is correct?

glinscott · 2023-03-19T20:59:35Z

@glinscott How is it you're seeing [x/114]? With the default context size (512), I'm seeing [x/649]. But from the code that should only depend on tokens.size() / params.n_ctx, and those should be constant across machines. Were you using a different dataset or something?

There are a couple of possibilities. I get this error tokenizing:

failed to tokenize string at 1067123!

So, I assume it truncates the string there? Do other folks not see that?

Other possibility is my dataset is wrong, can someone double check? It's 1290590 bytes.

$ sha256sum wiki.test.raw 
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08  wiki.test.raw

Green-Sky · 2023-03-19T21:01:07Z

$ sha256sum wikitext-2-raw/wiki.test.raw
173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08  wikitext-2-raw/wiki.test.raw

hmm, so file hash checks out

glinscott · 2023-03-19T21:10:59Z

One thing to note, I don't think the params.n_batch has any effect - I think adding support for that shouldn't be too hard though.

Can someone try adding this debugging printf() in?

--- a/main.cpp
+++ b/main.cpp
@@ -776,6 +776,7 @@ void perplexity(const gpt_vocab &vocab, const llama_model &model, const gpt_para
     int count = 0;
     double nll = 0.0;
     int seq_count = tokens.size() / params.n_ctx;
+    printf("params.prompt.size() = %d, tokens.size() = %d, params.n_ctx = %d, seq_count = %d\n", params.prompt.size(), tokens.size(), params.n_ctx, seq_count);

I get this:

params.prompt.size() = 1290589, tokens.size() = 58773, params.n_ctx = 512, seq_count = 114

Green-Sky · 2023-03-19T21:33:20Z

you should check your model files #238

Green-Sky · 2023-03-19T21:36:25Z

I get this:

params.prompt.size() = 1290589, tokens.size() = 58773, params.n_ctx = 512, seq_count = 114

I get this:

params.prompt.size() = 1290589, tokens.size() = 332762, params.n_ctx = 512, seq_count = 649

bakkot · 2023-03-19T22:28:40Z

Same results as @Green-Sky here. @glinscott I suspect you need to rebuild your models; you can check against the md5 hashes listed in #238. I don't see the "failed to tokenize string" message you report, either. I'm guessing you did the conversion before #79.

glinscott · 2023-03-20T00:13:27Z

Sure enough, my model was busted! Ok, I see consistent results now :).

params.prompt.size() = 1290589, tokens.size() = 332762, params.n_ctx = 512, seq_count = 649
perplexity: 16.0483 [16/649] 22507 ms

Now, at 22 seconds per inference pass, it's ~4 hours to do wikitext-2. So, would be great to see if we can get representative results from a much smaller subset.

glinscott · 2023-03-20T00:18:42Z

I'll do a run with:

$ ./main --perplexity -m models/7B/ggml-model-q4_0.bin -f wiki.test.raw

And log all the perplexities along the way. Once it starts to converge, it's probably a good sign we can cut the dataset off at that point. With the new method, it seems to be converging much faster hopefully. But we will see!

perplexity: 8.4400 [1/649] 24456 ms   
perplexity: 11.4887 [2/649] 22491 ms   
perplexity: 12.5905 [3/649] 22476 ms   
perplexity: 13.7533 [4/649] 22608 ms   
perplexity: 13.9558 [5/649] 22577 ms   
perplexity: 13.6425 [6/649] 22604 ms   
perplexity: 13.8768 [7/649] 22590 ms

glinscott · 2023-03-20T00:53:12Z

I asked GPT4 for some stats advice, and it recommended:

If you only care about accuracy down to two decimal digits, then you can stop sampling when your confidence interval has a width less than or equal to 0.01.
This means that you need at least n = 38416 samples to achieve an accuracy of two decimal digits with 95% confidence.

We will see :). That's [150/649] in our model (256 samples per evaluation). There are some big assumptions about uniform distribution of data in there which probably don't hold for wikitext2, but probably still reasonable. Most recent results do look like they are converging nicely:

perplexity: 12.5374 [90/649] 22718 ms   
perplexity: 12.5553 [91/649] 22714 ms   
perplexity: 12.5719 [92/649] 22715 ms   
perplexity: 12.5630 [93/649] 22715 ms   
perplexity: 12.6198 [94/649] 22766 ms   
perplexity: 12.7071 [95/649] 22743 ms   
perplexity: 12.7707 [96/649] 22755 ms   
perplexity: 12.7453 [97/649] 22706 ms   
perplexity: 12.7235 [98/649] 22700 ms

bakkot · 2023-03-20T01:44:00Z

I merged in #252 locally and am seeing a much better score: perplexity: 5.8149 [655/655]! That's a huge improvement, and much closer to the number reported in the "Int-4 is not enough" post.

This was done using the same model (7B FP16) and settings I used above (except with the model re-built to use the new tokenizer, of course), and without 91d71fe, so the numbers should be directly comparable to the ones I got above (10.4625).

I'll re-run both scenarios using the new logic in this branch but I expect very similar results. (Edit: yeah, pretty similar: 11.4675 before fixing the tokenizer, 5.9565 after.)

glinscott · 2023-03-20T02:23:26Z

@bakkot - wow, that is an incredible delta. Interesting, so the tokens must be subtly off with the existing tokenizer?

bakkot · 2023-03-20T02:25:22Z

Not really subtly, as reported in e.g. #167. Honestly it's impressive that it does as well as it does with the broken tokenizer it's currently using.

glinscott · 2023-03-20T03:26:02Z

Ok, well, for the 7B model at 4 bit quantization, the perplexity appears to be 12.2-12.9 or so. Doing a little bit of a random walk. Going to stop at 470 since I'm excited to try out #252 :).

bakkot · 2023-03-20T03:44:37Z

@glinscott Sidebar - the perplexity scores for chunks in wiktext aren't independent, because some articles are easier than others and there's multiple chunks per article. (So e.g. you might have ten chunks in a row from a really difficult article, each of which will raise the perplexity. With independent chunks that sort of consistent change in one direction would happen only very rarely.) That means perplexity isn't going to converge as fast as it should. So you might want to randomize the order in which chunks are processed, as in

// Different parts of the prompt are likely to vary in difficulty.
// For example, maybe the first half is easy to predict and the second half is hard.
// That will prevent scores from converging until the whole run finishes.
// So we randomize the order in which we consume each part of the prompt,
// so that the score converges towards the real value as rapidly as possible.
std::mt19937 gen(0x67676d6c); // use a fixed seed so results are reproducible; this seed is `ggml` in hex
std::vector<int> indexes(seq_count);
std::iota(indexes.begin(), indexes.end(), 1);
std::shuffle(indexes.begin(), indexes.end(), gen);
for (int i : indexes) {

instead of

for (int i = 0; i < seq_count; ++i) {

I haven't run this code, nor tested if it actually makes a difference in how fast the scores converge, but I expect it or something like it should work.

glinscott · 2023-03-20T05:05:55Z

Results for 4-bit quantization are looking great so far with #252 merged in as well!

perplexity: 6.5217 [62/655]
perplexity: 6.5569 [63/655]
perplexity: 6.5744 [64/655]
perplexity: 6.6235 [65/655]
perplexity: 6.6335 [66/655]
perplexity: 6.6522 [67/655]

bakkot · 2023-03-20T06:44:25Z

I captured the perplexity for each chunk separately (using 7B FP16, with #252 merged in).

From there I looked into how good the measurement would be if you used fewer chunks, assuming you consume the chunks in a random order. Keep in mind there's 655 chunks total. The (empirical) 90% confidence intervals for the difference from the final perplexity (for my specific conditions) after a specific number of chunks are:

10 chunks: ±1.0730
20 chunks: ±0.7697
50 chunks: ±0.4812
100 chunks: ±0.3279
150 chunks: ±0.2549
200 chunks: ±0.2104
400 chunks: ± 0.1122

Determining whether you can get away with fewer chunks will depend on the size of the effect you're looking at - e.g. the fixed tokenizer is obviously better after only 10 chunks, but confirming the presence of smaller effects (like from improved quantization strategies) will require significantly more.

code/data if you want to reproduce my results

// this is javascript
let data = [
  4.2337, 5.2897, 7.7770, 8.3287, 6.8390, 6.0943, 7.7364, 7.2585, 10.1792, 9.7338, 9.8226, 7.5937, 6.3308, 8.0002, 11.8844, 3.3917, 5.5408, 6.1756, 2.6730, 6.4901, 4.9989, 3.6592, 5.7682, 4.5201, 6.2160, 3.2217, 2.8567, 3.7494, 3.7945, 2.5883, 4.7733, 6.2793, 4.0443, 6.6725, 6.4370, 7.0745, 5.6611, 6.0521, 7.0657, 8.0699, 6.0984, 7.5405, 4.3730, 8.8372, 5.9219, 4.7395, 6.8133, 4.7350, 5.8450, 4.1329, 5.5502, 5.2692, 8.6583, 4.9914, 4.6868, 7.5662, 6.9880, 6.9894, 6.8970, 8.8414, 5.4384, 10.6731, 8.1942, 6.8570, 9.3563, 6.5627, 7.2757, 7.0825, 7.8798, 8.5397, 7.7570, 8.8057, 12.2151, 6.5003, 7.2832, 7.1812, 7.1461, 5.2082, 8.8034, 5.7541, 7.2228, 6.5905, 3.2219, 4.8862, 5.2106, 4.6112, 2.4795, 4.2595, 4.5617, 4.9153, 8.4723, 5.5482, 6.1128, 5.8297, 9.2492, 6.0519, 5.5583, 5.5216, 4.9173, 5.9582, 8.9768, 5.6014, 8.5170, 6.8875, 6.0951, 8.1004, 6.0354, 7.6947, 5.6168, 5.7427, 9.1345, 8.8376, 6.3986, 5.7434, 6.8633, 5.2115, 6.7495, 10.5116, 9.3441, 11.9780, 8.2422, 10.0067, 12.7040, 8.8324, 5.2965, 13.4408, 12.8634, 11.5266, 4.7939, 7.5777, 5.9655, 5.5261, 4.9038, 7.8649, 5.9049, 5.1198, 5.4877, 4.3806, 5.0965, 5.8914, 3.3561, 5.8583, 3.2323, 3.9742, 5.1125, 4.7900, 6.7743, 6.3185, 5.5245, 5.6687, 6.5638, 4.9464, 4.2488, 5.0675, 7.3592, 5.5228, 9.4368, 6.9210, 6.9797, 6.6831, 8.4606, 3.0650, 4.6591, 3.4063, 2.7900, 3.0231, 2.3005, 2.6896, 4.1826, 4.5053, 2.9034, 3.7563, 3.7867, 2.5532, 3.2104, 4.2681, 3.3105, 3.0264, 3.5613, 4.4102, 3.0667, 3.3960, 3.8231, 5.6702, 4.6170, 6.0197, 7.0675, 5.1326, 10.0308, 5.9919, 11.3845, 9.7865, 9.9764, 8.3787, 11.7139, 9.7893, 11.7055, 9.7135, 6.5766, 7.0163, 5.0125, 11.0156, 7.5948, 5.6769, 8.4561, 7.5776, 5.2701, 7.9725, 6.8910, 7.1792, 8.6991, 7.6900, 8.6591, 6.5381, 6.6024, 9.9117, 11.4651, 9.6110, 6.0322, 5.3760, 5.0621, 5.6246, 4.3323, 4.6806, 5.2827, 12.8015, 8.1204, 7.3919, 7.6432, 5.4063, 11.2815, 3.9873, 3.3158, 3.5056, 2.8041, 4.7094, 4.1956, 6.7119, 3.4211, 4.0789, 6.4766, 6.9613, 5.6383, 3.8569, 5.3274, 3.8636, 3.7660, 4.4742, 5.4093, 7.2289, 4.4956, 5.2353, 4.0107, 4.7802, 3.7488, 2.8184, 3.5604, 4.2093, 5.3541, 4.1740, 4.9184, 4.6309, 4.6749, 2.1799, 5.7219, 5.4113, 4.3672, 8.6913, 5.3731, 6.1470, 8.3038, 6.8235, 5.9549, 6.5837, 8.5758, 7.7327, 12.1389, 9.3534, 9.1320, 6.7431, 9.3347, 7.7855, 11.8079, 8.6349, 8.8769, 11.3166, 5.8538, 7.8667, 4.0560, 2.8534, 2.9460, 2.9278, 3.1373, 6.6050, 5.6842, 7.4505, 5.5637, 6.8299, 5.2548, 3.4957, 5.9363, 4.0149, 3.8561, 3.8802, 4.9512, 3.1070, 6.6027, 6.8806, 2.6353, 4.4386, 4.2173, 6.5665, 4.3896, 5.3577, 2.5667, 4.4052, 2.4796, 1.9780, 10.9267, 11.2068, 7.4261, 4.6996, 4.0354, 5.0048, 10.1574, 5.8825, 6.5496, 7.2039, 8.0570, 6.7768, 11.5410, 4.9996, 8.5831, 4.3073, 4.1795, 7.2409, 5.1631, 5.6205, 4.3670, 4.5893, 9.2200, 6.8801, 7.6852, 5.9022, 6.0188, 5.0642, 7.4118, 7.1476, 6.6982, 4.8392, 6.1443, 5.8701, 4.1545, 5.8907, 7.9460, 7.0058, 4.7597, 10.0613, 6.8521, 4.7857, 5.7337, 8.9369, 11.5146, 8.5051, 8.0402, 6.3870, 9.9484, 5.0987, 6.2364, 6.4576, 4.2600, 7.9318, 7.8497, 5.3683, 5.9516, 8.9665, 4.4904, 6.9869, 8.5304, 3.6020, 4.7592, 4.3036, 5.6554, 5.7098, 5.5246, 5.7023, 5.8297, 4.6599, 4.2254, 3.7789, 3.5960, 4.5255, 5.2527, 6.9731, 5.4062, 3.6407, 9.3482, 7.5259, 9.8064, 5.9531, 6.4362, 6.2962, 6.7262, 9.0811, 3.0848, 4.7268, 5.7033, 6.5912, 12.8079, 12.4113, 12.7754, 17.2818, 12.5417, 9.9668, 8.6653, 10.1074, 13.5201, 7.5909, 9.4968, 10.9255, 13.2899, 7.9184, 9.7576, 12.5443, 10.9062, 9.3986, 8.1765, 10.7153, 8.5812, 10.7370, 16.0474, 7.9101, 5.7778, 4.5653, 6.4762, 7.2687, 11.9263, 10.3103, 4.8934, 5.7261, 4.2609, 5.4918, 6.6898, 6.2934, 5.3325, 7.2188, 7.5185, 8.2033, 5.1273, 6.5011, 4.5670, 2.2386, 3.2372, 3.9893, 6.5170, 8.6117, 7.0107, 5.1495, 6.3412, 11.4701, 4.9500, 5.4811, 8.1177, 5.5823, 4.9553, 3.3866, 6.1404, 5.9408, 7.0779, 6.2677, 4.2244, 8.4411, 4.0633, 6.6431, 3.7769, 6.8590, 3.4788, 5.5537, 9.3246, 8.5652, 6.9801, 4.2857, 4.3862, 7.0454, 5.1355, 3.8384, 5.9033, 5.0303, 4.1490, 4.9914, 4.7928, 3.8402, 4.8503, 5.2423, 5.7663, 4.4227, 3.8162, 5.2343, 4.2511, 2.8094, 3.4875, 6.0975, 5.7161, 2.9156, 7.1691, 6.3762, 3.6819, 4.2773, 5.7032, 7.9792, 8.8110, 8.0352, 7.0775, 10.1292, 3.7952, 5.5346, 6.5626, 5.8085, 7.7367, 7.3946, 6.6608, 7.5490, 6.3721, 9.8001, 7.8648, 6.5025, 7.0076, 3.8873, 6.2564, 3.9161, 5.4713, 8.9538, 7.3572, 5.1763, 7.2523, 3.7586, 4.9993, 9.2871, 6.6082, 8.3411, 6.1726, 6.6453, 6.9063, 6.6387, 5.0784, 6.4587, 4.1723, 3.9443, 6.0791, 4.6138, 4.4106, 4.9176, 4.3316, 4.9980, 4.5371, 5.7626, 7.3694, 4.2320, 5.8014, 5.9095, 6.1267, 4.9075, 5.7717, 8.9320, 7.2476, 5.8910, 4.9010, 6.3294, 5.2988, 7.7972, 6.2766, 6.5831, 6.0055, 4.2898, 5.6208, 5.9260, 5.2393, 5.0046, 6.2955, 3.2518, 4.2156, 5.3862, 6.4839, 6.1187, 2.9040, 3.1042, 6.2950, 9.5657, 9.8978, 8.0166, 7.3498, 5.3361, 4.3502, 6.6131, 4.7414, 8.2340, 4.8954, 4.4713, 7.3732, 5.7101, 5.1141, 6.5039, 7.8757, 6.4675, 8.4179, 7.3042, 5.0399, 4.3175, 6.4821, 8.5142, 5.0135, 7.7970, 4.1496, 3.6323, 2.8987, 7.8440, 3.2591, 3.6729, 3.4526, 1.4961, 2.9882, 5.0225, 6.9797, 6.2451, 6.0565, 5.2908, 7.4791, 6.0146, 5.6742, 8.2883, 10.7090, 10.6945, 5.1382, 8.5528, 6.3640, 4.1532, 4.1070, 7.6952, 4.2944, 6.5832, 6.0564, 11.9188, 7.4791, 6.7621, 4.8915, 9.0851, 3.8716, 6.5621, 6.0976, 8.9491, 10.5537, 6.6961, 9.0845, 3.0358, 5.5621,
];

function combinePerp(acc, a, len) {
  let avg = Math.log(acc);
  let sum = avg * len;
  sum += Math.log(a);
  avg = sum / (len + 1);
  return Math.exp(avg);
}

function chunksToRunningScores(chunks) {
  let intermediatePerpScores = [chunks[0]];
  for (let i = 1; i < chunks.length; ++i) {
    intermediatePerpScores[i] = combinePerp(intermediatePerpScores[i - 1], chunks[i], i);
  }
  return intermediatePerpScores;
}

function fyShuffle(data) {
  let out = [...data];
  let i = out.length;
  while (--i > 0) {
    let idx = Math.floor(Math.random() * (i + 1));
    [out[idx], out[i]] = [out[i], out[idx]];
  }
  return out;
}


let test_N = 20000;
let Ns = data.map(() => []);
for (let i = 0; i < test_N; ++i) {
  let randomized = fyShuffle(data);
  let running = chunksToRunningScores(randomized);
  for (let N = 1; N < data.length; ++N) {
    Ns[N].push(Math.abs(5.9565 - running[N]));
  }
}
Ns.forEach(x => x.sort((a, b) => a - b));
let idx = Math.floor(.9 * test_N); // 90% CI
for (let N = 1; N < Ns.length; ++N) {
  console.log(N, Ns[N][idx].toFixed(4));
}

Green-Sky · 2023-03-22T11:35:58Z

started a run of 7B q4_0 to verify it still works after the refactor. using my quickfix #385 (comment)

brb in ETA 7.22 hours

result here: #406 (comment)

alankila · 2023-03-22T14:04:53Z

I tested my approach now using objective perplexity. Unfortunately, the results are not much improved. I have the first 3 test results here with my Q4_1 improved quantization:

$ ./main --perplexity -m models/7B/ggml-model-q4_1.bin -f wikitext-2-raw/wiki.test.raw 
[1]4.4862,[2]4.9819,[3]5.8331,

The best reference for stock Q4_1 is results from @Green-Sky which were:

4.4880
4.9980
5.9143

I suppose it is fair to say that at least it didn't make things much worse, and somewhat improved one of the poorer tests which is #3. The code recalculates the min/max parameters of the quantization using the average of values falling in bins 0 and 15, but rejects the result is root mean square error is not improved by the bounds.

Edit: unfortunately, these results are also subject to the batching and threading parameters. They have to be specified or the results will different between otherwise identical runs. It seems I can replicate Green-Sky's results using at least -t 4 and -b 4, or at least the first three are the same, now. So when comparing results across implementations, it is necessary to have the same approximations occurring inside GGML. I produced the above results at default -t 8 and -b 8 on my hardware, so they are not comparable. I think the whole Q4_1 quantization improvements are not so easily proven unless large number of these tests are run, which takes a long time and slows down iterative development.

Edit 2: I got comparable results using -t 4 and -b 4 and they are: 4.4791, 4.9720, 5.8831. I reimplemented on top of the old code the simplest algorithm that just averages the 0 and 15 bin values and quantizes with them without checking whether this choice actually improves quantization error. If I can get something that seems improves more than 0.02 or whatever, I may submit a pull request. Given that Q4_1 is only about 0.35 worse than ff16, something like 0.1 may already be worth considering.

Edit 3: when introducing a check whether RMS quantization error is reduced before using the optimized parameters, the first three numbers become 4.5279, 4.9482, 5.8555. I have crap hardware for this task, though, barely enough memory to run the perplexity test & iteration time is around 200 seconds, so that is why the results are so few. I ran the "edit 2" test overnight until 199th result in the morning, which looks like it would be good estimate for the final value, and had it say 6.2773 there. ff16 is around 6 at that point. I think this conditional use of optimized parameters improves by about 0.03, so I guess it might be at 6.25 if I had the ability to run it long enough.

glinscott · 2023-03-22T14:08:53Z

Thanks @Green-Sky!

I finished a run of 13B q4_0 overnight, and it looks great. A significant improvement vs 7B f16 even. I don't have enough RAM to run 13B f16 to compare though. I'm a bit unsure how batch size is implemented, perhaps it could allow it to work, might be good to test it's impact on perplexity as well.

5.5985 - 13B, q4_0
5.9565 - 7B, f16
6.3001 - 7B, q4_1
6.5949 - 7B, q4_0
6.5995 - 7B, q4_0, --memory_f16

13B q4_0 raw data

4.0769 4.4621 5.3497 5.8519 6.0447 5.9598 6.0964 6.2027 6.4905 6.7087 6.8834 6.9285 6.8792 6.9671 7.1674 6.8171 6.7220 6.7009 6.3737 6.3470 6.2709 6.0945 6.0726 5.9819 5.9864 5.8302 5.6464 5.5507 5.4693 5.3187 5.2764 5.2890 5.2477 5.2937 5.3156 5.3393 5.3268 5.3235 5.3540 5.3967 5.4199 5.4603 5.4172 5.4553 5.4564 5.4264 5.4526 5.4301 5.4358 5.3985 5.4037 5.3947 5.4402 5.4290 5.4082 5.4348 5.4506 5.4722 5.4915 5.5310 5.5271 5.5827 5.6068 5.6159 5.6544 5.6547 5.6711 5.6852 5.7143 5.7453 5.7695 5.8068 5.8600 5.8669 5.8819 5.8954 5.9070 5.8932 5.9203 5.9150 5.9264 5.9268 5.8795 5.8658 5.8596 5.8412 5.7793 5.7415 5.7170 5.7060 5.7288 5.7241 5.7255 5.7242 5.7522 5.7506 5.7491 5.7465 5.7377 5.7326 5.7570 5.7509 5.7653 5.7727 5.7742 5.7895 5.7871 5.8011 5.7991 5.7952 5.8144 5.8324 5.8336 5.8314 5.8345 5.8213 5.8217 5.8448 5.8644 5.8955 5.9109 5.9343 5.9715 5.9905 5.9845 6.0218 6.0560 6.0856 6.0719 6.0811 6.0764 6.0713 6.0595 6.0685 6.0680 6.0586 6.0550 6.0407 6.0320 6.0321 6.0037 6.0004 5.9724 5.9553 5.9468 5.9351 5.9391 5.9411 5.9366 5.9354 5.9413 5.9345 5.9230 5.9158 5.9209 5.9184 5.9353 5.9383 5.9386 5.9416 5.9528 5.9259 5.9151 5.8924 5.8651 5.8405 5.8064 5.7782 5.7637 5.7552 5.7332 5.7192 5.7039 5.6751 5.6530 5.6389 5.6217 5.5997 5.5856 5.5775 5.5591 5.5430 5.5291 5.5277 5.5207 5.5225 5.5287 5.5261 5.5430 5.5440 5.5618 5.5747 5.5907 5.6016 5.6216 5.6340 5.6547 5.6679 5.6691 5.6700 5.6632 5.6784 5.6847 5.6816 5.6916 5.6961 5.6926 5.6981 5.7019 5.7068 5.7163 5.7228 5.7334 5.7373 5.7401 5.7529 5.7699 5.7841 5.7839 5.7797 5.7743 5.7748 5.7673 5.7601 5.7574 5.7782 5.7858 5.7930 5.8005 5.7958 5.8108 5.7986 5.7830 5.7689 5.7474 5.7424 5.7321 5.7347 5.7221 5.7126 5.7162 5.7174 5.7159 5.7066 5.7024 5.6914 5.6811 5.6740 5.6708 5.6744 5.6653 5.6603 5.6504 5.6451 5.6354 5.6186 5.6076 5.6004 5.5993 5.5911 5.5859 5.5802 5.5754 5.5562 5.5564 5.5530 5.5461 5.5535 5.5529 5.5541 5.5607 5.5645 5.5650 5.5658 5.5716 5.5781 5.5903 5.5983 5.6069 5.6104 5.6206 5.6255 5.6384 5.6480 5.6558 5.6684 5.6650 5.6703 5.6639 5.6492 5.6347 5.6206 5.6078 5.6087 5.6088 5.6136 5.6124 5.6148 5.6123 5.6032 5.6036 5.5971 5.5881 5.5801 5.5780 5.5671 5.5702 5.5716 5.5575 5.5539 5.5491 5.5506 5.5444 5.5428 5.5292 5.5256 5.5123 5.4958 5.5068 5.5182 5.5231 5.5192 5.5114 5.5096 5.5193 5.5214 5.5225 5.5263 5.5307 5.5325 5.5430 5.5392 5.5473 5.5419 5.5358 5.5379 5.5365 5.5370 5.5322 5.5289 5.5358 5.5391 5.5429 5.5432 5.5445 5.5429 5.5474 5.5512 5.5534 5.5515 5.5530 5.5534 5.5480 5.5487 5.5537 5.5568 5.5538 5.5625 5.5641 5.5606 5.5606 5.5674 5.5786 5.5840 5.5878 5.5887 5.5977 5.5949 5.5957 5.5973 5.5929 5.5975 5.6023 5.5999 5.5992 5.6055 5.6011 5.6029 5.6071 5.6003 5.5972 5.5926 5.5905 5.5902 5.5886 5.5871 5.5870 5.5838 5.5798 5.5742 5.5684 5.5650 5.5644 5.5677 5.5668 5.5613 5.5680 5.5720 5.5793 5.5779 5.5792 5.5808 5.5837 5.5893 5.5749 5.5707 5.5702 5.5714 5.5829 5.5923 5.6022 5.6166 5.6273 5.6341 5.6403 5.6480 5.6581 5.6610 5.6665 5.6748 5.6845 5.6881 5.6940 5.7036 5.7115 5.7178 5.7222 5.7299 5.7339 5.7405 5.7535 5.7569 5.7556 5.7517 5.7529 5.7557 5.7643 5.7717 5.7686 5.7680 5.7634 5.7616 5.7626 5.7642 5.7637 5.7653 5.7676 5.7711 5.7692 5.7697 5.7668 5.7525 5.7431 5.7383 5.7385 5.7425 5.7439 5.7426 5.7423 5.7503 5.7463 5.7431 5.7430 5.7425 5.7402 5.7327 5.7318 5.7304 5.7318 5.7309 5.7264 5.7279 5.7226 5.7215 5.7150 5.7138 5.7062 5.7040 5.7058 5.7084 5.7092 5.7042 5.6998 5.7011 5.6955 5.6892 5.6885 5.6864 5.6818 5.6790 5.6751 5.6687 5.6657 5.6642 5.6623 5.6585 5.6528 5.6511 5.6474 5.6390 5.6317 5.6308 5.6292 5.6210 5.6215 5.6220 5.6169 5.6127 5.6129 5.6155 5.6203 5.6240 5.6265 5.6324 5.6283 5.6268 5.6266 5.6259 5.6280 5.6290 5.6301 5.6319 5.6323 5.6378 5.6407 5.6408 5.6423 5.6365 5.6374 5.6329 5.6322 5.6372 5.6400 5.6380 5.6406 5.6363 5.6345 5.6397 5.6408 5.6426 5.6426 5.6440 5.6461 5.6477 5.6462 5.6462 5.6430 5.6383 5.6385 5.6364 5.6334 5.6315 5.6278 5.6256 5.6231 5.6222 5.6243 5.6211 5.6213 5.6199 5.6202 5.6175 5.6170 5.6212 5.6223 5.6226 5.6206 5.6214 5.6196 5.6225 5.6238 5.6248 5.6252 5.6222 5.6207 5.6203 5.6186 5.6167 5.6168 5.6110 5.6080 5.6087 5.6092 5.6095 5.6032 5.5976 5.5980 5.6028 5.6078 5.6107 5.6125 5.6115 5.6080 5.6092 5.6076 5.6122 5.6098 5.6068 5.6094 5.6084 5.6076 5.6083 5.6108 5.6120 5.6141 5.6155 5.6139 5.6108 5.6116 5.6160 5.6146 5.6168 5.6136 5.6092 5.6029 5.6053 5.5998 5.5949 5.5901 5.5785 5.5731 5.5709 5.5722 5.5726 5.5733 5.5727 5.5758 5.5761 5.5767 5.5799 5.5849 5.5901 5.5888 5.5918 5.5917 5.5886 5.5851 5.5871 5.5842 5.5848 5.5852 5.5908 5.5929 5.5944 5.5933 5.5969 5.5916 5.5929 5.5933 5.5961 5.6005 5.6009 5.6047 5.5991 5.5985

Andrey36652 · 2023-03-22T17:20:31Z

@glinscott nice result. Could it be improved further with q4_1 down to 5.21 perplexity like here https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and ?
By the way, I have idle 5600x with 32GB DDR4. Will it worth to use it for perplexity scoring or it would be too slow?

Green-Sky · 2023-03-22T17:29:37Z

@Andrey36652 any compute resource counts.

Green-Sky · 2023-03-22T17:30:35Z

@Andrey36652

would be too slow

it really depends on how long you can let it compute.

Andrey36652 · 2023-03-22T18:14:16Z

@Green-Sky

@Andrey36652

would be too slow

it really depends on how long you can let it compute.

I think, 12-18 hours per day

jasontitus · 2023-03-22T18:39:22Z

I'm running 64B q4_0 and seeing really nice results so far on my M1 Ultra - averaging 3.72 after 11 steps. I'll report when it is done in 15 hours or so. Uses 80GB of RAM.

[1]3.0602,[2]3.4535,[3]4.0805,[4]3.8289,[5]3.6680,[6]3.5852,[7]3.7126,[8]3.8242,[9]3.8991,[10]3.9340,[11]3.9331

Green-Sky · 2023-03-22T18:41:45Z

@jasontitus i think @gjmulder is running 65B q4_0 right now already, could you do q4_1 instead?

jasontitus · 2023-03-22T18:42:18Z

How do I generate a q4_1 quantization?

Green-Sky · 2023-03-22T18:43:45Z

invoke ./quantize with 3
or edit the quantize.py and change the 2 to 3

$ ./quantize
usage: ./quantize model-f32.bin model-quant.bin type
  type = 2 - q4_0
  type = 3 - q4_1

edit: i am assuming you have the f16 model files.

maziyarpanahi · 2023-03-22T18:45:16Z

invoke ./quantize with 3

Like this?

./quantize ./models/65B/ggml-model-f16.bin  ./models/65B/ggml-model-q4_0.bin 3

Green-Sky · 2023-03-22T18:46:24Z

yea, but for each file individually, or edit the python script and run that

edit: editing the python script is more involved actually, since it will also name the file q4_0

maziyarpanahi · 2023-03-22T18:47:35Z

I run this for _0 so just converting 2 to 3

./quantize ./models/65B/ggml-model-f16.bin  ./models/65B/ggml-model-q4_0.bin 2
./quantize ./models/65B/ggml-model-f16.bin.1  ./models/65B/ggml-model-q4_0.bin.1 2
./quantize ./models/65B/ggml-model-f16.bin.2  ./models/65B/ggml-model-q4_0.bin.2 2
./quantize ./models/65B/ggml-model-f16.bin.3  ./models/65B/ggml-model-q4_0.bin.3 2
./quantize ./models/65B/ggml-model-f16.bin.4  ./models/65B/ggml-model-q4_0.bin.4 2
./quantize ./models/65B/ggml-model-f16.bin.5  ./models/65B/ggml-model-q4_0.bin.5 2
./quantize ./models/65B/ggml-model-f16.bin.6  ./models/65B/ggml-model-q4_0.bin.6 2
./quantize ./models/65B/ggml-model-f16.bin.7  ./models/65B/ggml-model-q4_0.bin.7 2

Green-Sky · 2023-03-22T18:48:11Z

yes, and change filename to q4_1 , so you know :)

Green-Sky · 2023-03-22T18:48:41Z

@gjmulder can you chime in and tell us which 65B model you are testing right now?

Green-Sky · 2023-03-22T18:52:16Z

I am moving this to a Discussion. too much chatter in an merged pr :)

edit: #406

jasontitus · 2023-03-22T18:56:13Z

Generating the new 65B q4_1 set now. Each file is 5.7GB rather than 4.8GB - I assume that is expected?

15G models/65B/ggml-model-f16.bin.1
4.8G models/65B/ggml-model-q4_0.bin.1
5.7G models/65B/ggml-model-q4_1.bin.1

glinscott · 2023-03-22T19:04:33Z

I'm running 64B q4_0 and seeing really nice results so far on my M1 Ultra - averaging 3.72 after 11 steps. I'll report when it is done in 15 hours or so. Uses 80GB of RAM.

[1]3.0602,[2]3.4535,[3]4.0805,[4]3.8289,[5]3.6680,[6]3.5852,[7]3.7126,[8]3.8242,[9]3.8991,[10]3.9340,[11]3.9331

@jasontitus - nice! to clarify though, the output is already averaged over the steps - so, the perplexity is 3.9331 averaged across the first 11 chunks at this point.

Green-Sky · 2023-03-22T19:22:37Z

Generating the new 65B q4_1 set now. Each file is 5.7GB rather than 4.8GB - I assume that is expected?

yes, the quantization stores extra values, but produces better results.

blackhole89 · 2023-03-22T20:57:47Z

Great work! Evaluating q4_1 on an M1 Ultra might be a bit tough, since we don't have ARM NEON code for q4_1 inference yet (only AVX2 for Skylake-and-beyond Intel processors and compatible) and so it would fall back to the much, much slower scalar code.

For those machines that do have AVX2, for Q4_1, I recommend pulling in code from my branch, which should run a little faster (5~10%in my tests).

I've also been thinking if we shouldn't introduce a quantization format that matches GPTQ exactly, with a single shared scale and offset for each block. This has potential to be much more performant, as then the data fetched in each AVX2 accumulation step (16 bytes) could be perfectly cache-aligned, memory usage would be almost 1/3 lower, and inference is largely memory-bound anyway. (@ggerganov , what would be a good name for that format? Q4_2? Q4_1_UNIFORM?)

ggerganov · 2023-03-22T21:05:13Z

Q4_2 should be ok.
Do you have an idea how to determine the shared scale and offset efficiently?
I got the impression that it involves some very heavy linear algebra computation to implement GPTQ.
Btw, I posted an idea for quantization improvement here: #397

blackhole89 · 2023-03-22T22:26:08Z

I think I understand the GPTQ algorithm a bit better after reading the original paper (but it's not like I have the hardware to run it myself).

Otherwise, your post in #397 actually subsumes most ideas I had - I was thinking it might be worth trying to do local gradient descent on the offset/range to minimize the squared error (equivalent to minimizing RMS, since sqrt is monotonic). The paper you linked is quite interesting too - I wonder if some of the simpler techniques mentioned in it/its references (such as picking the nth-percentile value rather than the max to determine the quantization range) are applicable in our case, since most of those tests seem to have been done with int8 but jointly quantizing a lot more values.

I was testing the new perplexity measure with my performance fork and was dismayed to see deteriorations on the order of 0.05 for the score on the first batch (no time to run more) (when trying to measure the perplexity on various source files from HEAD). After permuting some more (commuting on \mathbb{R}) operations in the code (and, among others, getting a small improvement relative to baseline with one ordering), I'm pretty sure there are some numerical stability issues we're neglecting there. With my ~n=3 samples (main.cpp, utils.cpp, utils.h), the differences are always directionally the same (that is: variants that are bad on one are bad on all), which means some of them probably systematically result in more big+small additions.

gjmulder · 2023-03-22T22:40:48Z

I am moving this to a Discussion. too much chatter in an merged pr :)

edit: #406

There's issue #129 as well which I recently renamed. I'll refer to #406 and close #129.

Green-Sky · 2023-03-22T23:33:47Z

There's issue #129 as well which I recently renamed. I'll refer to #406 and close #129.

ah i see ... well i think a discussion is a better fit.

alankila · 2023-03-23T06:14:56Z

@blackhole89 It is worth running some number of those tests. It doesn't really mean much if the first batch inference is a bit worse or better, only the bigger average is important. While the first result should be composed of the logarithmic average of 256 inferences, it still has a lot of noise there. I am using the 3rd number output, and by that point tests run seems to be acting as expected, e.g. if I know RMS is reduced by a quantization I am trying, by that time the test results appear to get slightly better.

This is all a bit preliminary, though. I have only had so much time to play with trying to improve the quantization, and my approach is currently quite boneheaded. I saw this link https://github.com/qwopqwop200/GPTQ-for-LLaMa with tables and they had round-to-nearest, which I take to be similar to Q4_1 judging from the result, though I don't know the details on what batching was used, for example. It had mere 6.28 perplexity result and GPTQ was only 6.26 for 7B model, barely better, with 128 element block size. This doesn't really fit, the other papers that have used round-to-nearest vs. GPTQ show improvement that approach quite close to ff16 baseline.

blackhole89 · 2023-03-23T16:36:56Z

@alankila I went and performed some tests on that (see this comment). Preliminarily, it seems to me that for quantizations and calculation tweaks to the same model, something like the first 30 batches is enough (though the <10 I did are indeed likely too few to detect improvements/deteriorations at the order of magnitude I see). However, comparing across models is much iffier (and in fact I get the sense that Wikitext itself is too short to really separate out quality differences at the scale we see: the last 5% of blocks produce unpredictable perplexity deviations of almost 0.1, so e.g. the "quality" ordering of two models with perplexity 5.1 and 5.3 resp. could be inverted by adding another 5% more text that's "like Wikitext").

"bot-in-a-box" - model d/l and automatic install into a OpenBLAS or CuBLAS Docker image

Compute perplexity over prompt

e94bd9c

Green-Sky mentioned this pull request Mar 19, 2023

Command line switch to use F16 for memory_k and memory_v (refactor of #154) #294

Merged

glinscott added 2 commits March 19, 2023 13:33

More accurate perplexity calculation - over all logits in the context…

91d71fe

… window (so 512x more tokens!)

Merge remote-tracking branch 'origin/master' into perplexity

abb82a0

bakkot mentioned this pull request Mar 20, 2023

sentencepiece bpe compatible tokenizer #252

Merged

maziyarpanahi mentioned this pull request Mar 20, 2023

Batch size affects model's output #249

Closed

glinscott deleted the perplexity branch March 22, 2023 19:04

NeonBohdan mentioned this pull request Jul 10, 2023

How to get the perplexity of the sequence marella/ctransformers#48

Closed

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Merge pull request ggerganov#270 from gjmulder/auto-docker

de8d9a8

"bot-in-a-box" - model d/l and automatic install into a OpenBLAS or CuBLAS Docker image

Compute perplexity over prompt #270

Compute perplexity over prompt #270

Conversation

glinscott commented Mar 18, 2023 • edited Loading

glinscott commented Mar 18, 2023

Green-Sky commented Mar 18, 2023

Green-Sky commented Mar 18, 2023

ggerganov commented Mar 19, 2023

bakkot commented Mar 19, 2023

glinscott commented Mar 19, 2023

Green-Sky commented Mar 19, 2023

Green-Sky commented Mar 19, 2023

glinscott commented Mar 19, 2023

Green-Sky commented Mar 19, 2023

glinscott commented Mar 19, 2023

Green-Sky commented Mar 19, 2023

Green-Sky commented Mar 19, 2023

bakkot commented Mar 19, 2023

glinscott commented Mar 20, 2023

glinscott commented Mar 20, 2023

glinscott commented Mar 20, 2023

bakkot commented Mar 20, 2023 • edited Loading

glinscott commented Mar 20, 2023

bakkot commented Mar 20, 2023

glinscott commented Mar 20, 2023

bakkot commented Mar 20, 2023 • edited Loading

glinscott commented Mar 20, 2023

bakkot commented Mar 20, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

alankila commented Mar 22, 2023 • edited Loading

glinscott commented Mar 22, 2023 • edited Loading

Andrey36652 commented Mar 22, 2023

Green-Sky commented Mar 22, 2023

Green-Sky commented Mar 22, 2023

Andrey36652 commented Mar 22, 2023

jasontitus commented Mar 22, 2023

Green-Sky commented Mar 22, 2023

jasontitus commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

maziyarpanahi commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

maziyarpanahi commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

Green-Sky commented Mar 22, 2023

Green-Sky commented Mar 22, 2023 • edited Loading

jasontitus commented Mar 22, 2023

glinscott commented Mar 22, 2023

Green-Sky commented Mar 22, 2023

blackhole89 commented Mar 22, 2023 • edited Loading

ggerganov commented Mar 22, 2023

blackhole89 commented Mar 22, 2023 • edited Loading

gjmulder commented Mar 22, 2023

Green-Sky commented Mar 22, 2023

alankila commented Mar 23, 2023

blackhole89 commented Mar 23, 2023 • edited Loading

glinscott commented Mar 18, 2023 •

edited

Loading

bakkot commented Mar 20, 2023 •

edited

Loading

bakkot commented Mar 20, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

alankila commented Mar 22, 2023 •

edited

Loading

glinscott commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

Green-Sky commented Mar 22, 2023 •

edited

Loading

blackhole89 commented Mar 22, 2023 •

edited

Loading

blackhole89 commented Mar 22, 2023 •

edited

Loading

blackhole89 commented Mar 23, 2023 •

edited

Loading