Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ€– Now run grok-1 with less than πŸ”² 420 G VRAM ⚑ #42

Closed
trholding opened this issue Mar 18, 2024 · 7 comments
Closed

Comments

@trholding
Copy link

trholding commented Mar 18, 2024

Surprise! You can't run it on your average desktop or laptop!

Run grok-1 with less than πŸ”² 420 G VRAM

Run grok on an Mac Studio with an M2 Ultra and 192GB of unified ram: See: llama2.cpp / grok-1 support
@ibab_ml on X

You need a beefy machine to run grok-1

Grok-1 is a true mystical creature. Rumor has it that it lives in the cores of 8 GPU's and that the Model must fit in the VRAM.

This implies that you need a very beefy machine. Very very beefy machine. So beefy...

How do you know if your machine is beefy or not?

Your machine is not beefy if it is not big - the bigger the better, size matters! It has to make the sound of a jet engine when it thinks, also it has to be hot to the touch mostly.

It must also smell like burnt plastic at times. The more big iron, the more heavy the more beefy! If you didn't pay a heavy price for it, such as 100k$++, an arm and a leg, then it is not beefy.

What are some of the working setups?

llama2.cpp:
Mac
  • Mac Studio with an M2 Ultra
  • 192GB of unified ram.
AMD
  • Threadripper 3955WX
  • 256GB RAM
  • 0.5 tokens per second.
This repo:
Intel + Nvidia
  • GPU: 8 x A100 80G
  • Total VRAM: 640G
  • CPU: 2 x Xeon 8480+
  • RAM: 1.5 TB

#168 (comment)

AMD
  • GPU: 8 x Instinct MI300X GPU 190G
  • Total VRAM: 1520G

#130 (comment)

Other / Container / Cloud
  • GPU: 8 x A100 80G
  • Total VRAM: 640G
  • K8 cluster

#6 (comment)

What can you do about it?

Try: See: llama2.cpp / grok-1 support

What are the other options?

  • Rent a GPU cloud instance with sufficient resources
  • Subscribe to grok at X (twitter.com)
  • Study the blade, save up money
  • Get someone to cosplay as grok

What is the Answer to the Ultimate Question of Life, the Universe, and Everything?

#42

Ref:
#168 (comment)
#130 (comment)
#130 (comment)
#125 (comment)
#6 (comment)
ggerganov/llama.cpp#6204 (comment)
ggerganov/llama.cpp#6204 (comment)
ggerganov/llama.cpp#6204 (comment)

See: Discussion
Note: This issue has been edited totally to elevate Issue 42 to serve a much better cause. @xSetech Would you not be tempted to pin this?
Edit: Corrected llama2.cpp inaccuracies

@yarodevuci
Copy link

@trholding and it will work with one GPU?

@trholding
Copy link
Author

Model sizes must fit in the GPU memory. If the model is too large like most cutting age models, then they split parts of the model and submit the work onto multiple GPU's. So a large model like this would need multiple GPUs.

@gardner
Copy link

gardner commented Mar 18, 2024

A 4 bit quantized model would likely be at least 96GB, so it might fit on four 24GB cards.

@akumaburn
Copy link

akumaburn commented Mar 18, 2024

Model sizes must fit in the GPU memory. If the model is too large like most cutting age models, then they split parts of the model and submit the work onto multiple GPU's. So a large model like this would need multiple GPUs.

They can technically overflow into system ram if running in OpenCL/CLBlast Mode (slower but working).

@trholding trholding changed the title GGML and llama.cpp support πŸ€– Min HW Specs: πŸ”² 420 G+ VRAM β€’ 🧠 8 GPUs β€’ πŸ’Ύ 1337 G~ SSD ⚑ Mar 19, 2024
@AdaptiveStep
Copy link

I would rather have elon give us GPU's

@surak
Copy link

surak commented Mar 22, 2024

@trholding and it will work with one GPU?

No way. Not this model, even highly quantized. Unless it's a GH200 Data Center edition, which does have 96gb of VRAM integrated with 480GB of CPU ram. Then MAYBE.

@trholding trholding changed the title πŸ€– Min HW Specs: πŸ”² 420 G+ VRAM β€’ 🧠 8 GPUs β€’ πŸ’Ύ 1337 G~ SSD ⚑ πŸ€– Now run grok-1 with less than πŸ”² 420 G+ VRAM ⚑ Mar 23, 2024
@trholding trholding changed the title πŸ€– Now run grok-1 with less than πŸ”² 420 G+ VRAM ⚑ πŸ€– Now run grok-1 with less than πŸ”² 420 G VRAM ⚑ Mar 23, 2024
@davidearlyoung
Copy link

A 4 bit quantized model would likely be at least 96GB, so it might fit on four 24GB cards.

Someone may of figured it out:
https://huggingface.co/eastwind/grok-1-hf-4bit/tree/main

Looks to be about 90.2 GB on file when adding up the safetensor shards from the mentioned hugging face eastwind repo. Not sure what that would be needed for loading for inference. Likely will need more for overhead. I can't speak for mem usage or quality since this is still beyond my capacity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants