-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RMSNorm #173
Comments
Thanks for looking into this. Lines 5325 to 5385 in 2d64715
Did I miss something in the RMSNorm implementation? |
RMS norm does not need to compute the mean of the input elements. The implementation here has "v = x[i00] - mean" ... sum2 += v*v. It looks similar to Layer norm (but not exactly Layer norm). May be I missed some details. References:
|
I think you are correct. We have to fix this. |
I have some limited evidence (see what I posted in #193) that this might have lead to a regression in text quality at least at 13B. Reopening for now, because I think it would be good to gather evidence. |
I tried to run 13B Q4_0 (i.e. not affected by my patch) with the RMS norm, and it also acted in a subpar way. In particular, "Allice" made a return; I've never seen it mangle the bot's name in that particular fashion with the original norm, but with RMS it's quite frequent.
Fixing the same seed, and resampling from the "What is the most important difference between London and Moscow?" point (i.e. after already importing the "Allice" mistake), without RMS I get the following far superior continuation:
The seed is 1678928825, running with -t 4 on q4_0 quantised 13B weights. Not sure how reproducible the results are across machines, though. |
A few experiments with q4_1 13B and different languages, same seed and initial prompt as before. Old norm:
(My rating: German, Russian correct, Japanese starts off almost passable (though sounding like you just took it along for your trip to the repair shop rather than handing it in), but the second half is hallucinated, seemingly on the basis of reading "closed" as being about close proximity) New RMS code:
(German is like "I brought my computeress the repair shop", with incoherent gender/case structure; Russian as good as before, arguably with a slightly better alternative for repair shop; Japanese amounts to "I went the computer to the repair bureau, but [indecipherable]") The way in which it fails in the Japanese translation is actually quite fascinating. The first attempt seems like something that someone with minimal understanding of English and an active fantasy could make up based on recognising the word "close", and my best guess for the second one (the whole sentence has some vaguely religious colour to it, and the second clause seems to be trying to talk about some sort of failure of pity, which makes me wonder if it wound up in "closing off your heart" concept space) |
Let's revert the change in "main.cpp" (e.g. 3 instances of "ggml_rms_norm" back to "ggml_norm"), if you think it get obviously worse. We need some quantifiable quality test to catch type of regression. May be add perplexity? |
I think the worseness is usually more on the subtle end, and haven't done enough of the reverse test (take a generation I'm unhappy with in the old norm mode and retry it in RMS), so it would be good if someone else could also contribute their thoughts. Also, I've only experimented with Q4_0/1 on the 13B model so far. I agree it would be very nice if we had some quality metrics to evaluate various tweaks. Are there benchmarks out there that it would be easy to obtain and run on our output without it being too slow? |
If we have a python interface (text input -> next word) for this, it would be much easier to perform quality test. Most of the nlp toolkits and datasets are readily available in the python world. |
Wrapping it like that would be pretty easy, but we'd have to decide on the sampling parameters. Do we know what Meta used in their own evaluation? |
They don't specified the exact details in the paper. One of the figures shows "training loss". We can just use the basic perplexity measurement on its training data e.g. how good the model recite portion of wikipedia. |
We also need to make sure the (non quantized) FP16 gives similar probability distribution to the pytorch reference. That is also easy to check. |
Could it be due to the different
|
FYI: I am observing this weird 'swap to code gibberish' with consistently higher probability if the question (or the answer) contains a single quote (') I have observed this before and after RMS |
And btw, since people are now looking more in-depth into the codebase (which btw is awesome!), the RoPE computation is another place to look for potential mistakes. |
RoPE is tricky and easy to get wrong. We need a lot of unit tests for operators. We have reference implementation so generating test data is not too hard. |
I'll try to look into rigging our output up with some benchmark once I get the time to. It might be hard to reproduce the exact conditions of their evaluation (with Wikipedia articles we wouldn't have the same weight/mixture, and with the more standardised benchmarks I'm struggling to identify the exact prompts and parameters they used), but it's probably reasonable to assume that no fundamentally deleterious change like bad quantization or the use of a wrong norm should result in a spurious improvement of peplexity on most anything. |
I think this is what is used in the Python code
@blackhole89 I will close this issue now. Please let us know if you make any progress with the benchmark and open an issue if needed. |
I think this is what is used in the Python code
Added julia port to notable forks section in README.md
…aterial-9.1.11 Bump mkdocs-material from 9.1.9 to 9.1.11
The original paper, and the reference implementation [1] uses RMS norm. However, llama.cpp uses ggml_norm() which looks like Layer norm?
The differences between these may not be too obvious, because the mean is probably around 0. However, we should follow the original design.
[1] https://github.com/facebookresearch/llama/blob/main/llama/model.py
The text was updated successfully, but these errors were encountered: