-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration of llama3.1 fixes #197
Comments
I will look into this today as well see also |
I staged some changes on my local repo, and when the PR request for optimum is finished, i will update my fork and make a PR to update the dependencies. |
I created a fork and was able to get llama3.1 8b instruct working, but it reports that some of the token id's are wrong, but the inference appears to work correctly, see e.g. #199 (comment) However for the moment being I have not yet gotten llama 3.1 405b fp8 working |
Hi
|
this branch is for debugging, there was a push today, huggingface/optimum-habana#1163 (comment) i will make a new docker container based on the new push |
Thanks a lot! Can I ask when will the new TGI docker container ready? I may want to directly try that one~ |
I have fixed the dependencies and built the docker container |
Great!! Where can I found the ready docker container, is there a link in dockerhub? Thanks a lot! |
I just pushed it to endomorphosis/tgi_gaudi as per your request Note: I have not yet fixed the quantization bug present in huggingface/optimum_habana json configuration key mismatch, and i have not yet validated whether I can quantize llama 3.1 405B with a single node using parameter offloading, nor do I have multiple gaudi machines to quantize the llama 405b for habana, and llama 3.1 405B fp8 huggingface repository will load weights as bf16 right now. Please inquire with the OPEA team whether they can assist me with the quantization effort, so that I can subsequently then try to add speculative decoding with llama 3.1 8b as the draft model. |
Thanks a lot for your docker container, I will download and have a check~ |
I haven't tested it in a while, I gave up on trying to get llama405b on a single node because of the dependency problems, that come along with using any method of quantization, but I assume that any half precision models should work. |
We tested Llama3.1-8B and Llama3.1-70B bf16 and fp8 Llama3.1-70B 8 cards |
you shouldn't need 8 cards, two cards is sufficient. |
That could work, just haven't tested it |
Quick question: when is an update to an optimum-habana version which includes huggingface/optimum-habana#1154 (fix for rope_scaling @ llama3.1 family) planned?
The text was updated successfully, but these errors were encountered: