-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: how to apply this experiment to the llama2 70B model? #11
Comments
8x80GB GPUs would be enough for 7b models, however I do not know if 70B would fit on the 4xA100 nodes... Pinging @jquesnelle and @conceptofmind It all depends on how much effort we can do to write the distributed training code (and how long we are willing to wait) |
It can be done through proper parallelization. We were limited to what we could use on the Stability AI due to both potential intellectual property constraints and lack of computing. If those are adequately taken into consideration through other sponsors then we should be able to build a 70B model at longer context lengths (8k-128k) without any issues. I am currently communicating with LAION and Together. We should seek every possible grant available. |
any plans to implement yarn into llama.cpp? need to show poc to potential pi for smaller models |
It could be built off of ggerganov/llama.cpp#2268 which was based on the code in this repo, but it was written before the paper came out and I haven't had a chance to read it. |
YaRN is just like NTK-by-parts as implemented in your implementation, but without the "gamma" factors (thus no more base change), plus an additional
https://github.com/jquesnelle/yarn/blob/master/scaled_rope/LlamaYaRNScaledRotaryEmbedding.py We've intentionally made YaRN as simple as possible to implement. (by ablating everything that had a negligible effect after the finetune) |
Not easy finding a PI, any good ideas for putting together a PowerPoint? |
I am curious what is required to apply this method to the 70B parameter version of the llama2 model?
On reddit, noticed you mention: "For training, these models barely fit in 128 80GB A100s using DeepSpeed and FA2"
Would the computer at OSC be enough? https://www.osc.edu/resources/technical_support/supercomputers/ascend
Only 96 80GB A100 GPUs: Is that enough to contribute to the SoTA (State of the art)?
The text was updated successfully, but these errors were encountered: