-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when trying to make inference with a local model #896
Comments
It does work. I can't debug your environement, but something is wrong between where the local model lives, and the |
Yes, all right. But when I launch docker run --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /data/myllama/ --disable-custom-kernels I have this error /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. File "/opt/conda/bin/text-generation-server", line 8, in File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 117, in get_model File "/opt/conda/lib/python3.9/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict File "/opt/conda/lib/python3.9/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict File "/opt/conda/lib/python3.9/site-packages/transformers/utils/hub.py", line 388, in cached_file OSError: /data/myllama/ does not appear to have a file named config.json. Checkout 'https://huggingface.co//data/myllama//None' for available files. The output model trained with autotrain contains these files: user_dev@myllm:~/data/myllama$ ls There isn't any config.json file |
As you can see, there is no |
There is an adapter_config.json though. Looks like OscarAGracia is trying to run a LoRA model, which should be support by TGI from this #762. LoRA training usually results in just a |
Scratch that, it works: https://huggingface.co/mariiaponom/llama_7b_class_lora |
@Narsil but if adapter_config.json contains base_model_name_or_path a model name (e.g. meta-ai/llama-2-hf), according to #762 it should be downloading this model as the next step? Needing the (possibly huge) base model in the same repo leads to all kinds of production issues. No one will cache my copy of Llama 70B if everyone else in the cluster / cloud will download meta-llama/llama-2-hf. Storing merged copies is expensive, both in terms of boot times as well as cost, add 30-100 $ for every copy per month, depending on where you host. TGI is well suited for on-demand/ cloud burst style of workloads and people should be able to use it properly without being tied into a specific ecosystem or dogma towards single model, long-running hosted cloud instances that are unaffordable for most users of TGI out there. Update
I'll be working on supporting a more production friendly version of TGI (loading models from s3, hot-swapping adapters, back to truly OS) via https://github.com/ohmytofu-ai/tgi-angryface |
Not really imo, and it certainly wasn't designed with that in mind. Supporting LORA as first class citizen is something we have in mind : #907 (comment) @chris-aeviator I'm sorry if you feel angry or let down, but you're more than welcome to fork and add support if we cannot provide. This is what open source is for !
Really ? Something seems to be wrong there. EFS or S3 shouldn't cost nearly as much. |
Re: s3 Don't want to mess around here or get emotional though. What's holding adapters back? For single GPU deployment with B&B 4 bit I can (POC) successfully load a model https://github.com/ohmytofu-ai/tgi-angryface/blob/8028ebb93c58fe58673c80a16c482b3e1cad6807/server/text_generation_server/models/causal_lm.py#L489 and the implementation took very little time (might be something wrong with it still, I didn't test thoroughly yet). |
But... you shouldn't be moving around paying egress costs (unless you're running on a different cloud). So there's no bandwidth cost if you're using S3 from the same region in AWS. EFS should work (not cheap either) so you can keep models there all the time. Currently what happens is just that the model is loading the base, and adding the adapter on top, saving that back onto the docker, and loading from it. The issue with the code you linked is that is works just for With the current method, everything comes for free (you just have to have a bit of disk on the host node to save the fused model). So there's a little bit of boot latency, but zero engineering maintenance cost basically (and you get all new features pretty much for free without us having to think about peft). As I said, lora ARE interesting, but we're going to try and exploit the lora to the fullest when we do, which is actually running multiple lora with the same backbone on a single server. Engineering/maintenance cost will be high, but so will the benefit. Ideally you could keep a single server running, and serve pretty much all loras on it. I'm not sure everyone realizes but we're still a pretty small team (TGI is like 1 person basically). Things to do are plenty, we're permanently thinking about what's the biggest impact feature, we just cannot deliver everything. Happy, to continue the discussion about your use case though. We both agree |
Appreciate your time taken drafting a technical solution. However the TLDR is 99% of people cannot do what you propose. There is no one cloud provider as of August 2023 that will provide you with more than a handful of GPUs.
That's why more dynamic deployment schemes are essential for "the GPU poor" right now. |
System Info
text-generation-inference
Information
Tasks
Reproduction
I have a local llama-2-7b-chat model trained with autotrain. My new model is stored in /data/my-llama folder
autotrain llm --train --project_name my-llama --model meta-llama/Llama-2-7b-hf --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 2 --num_train_epochs 3 --trainer sft
Auto-train finished ok
I deploy that model with text_generation-inference
docker run --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /data/ --disable-custom-kernels
but I have this error
ERROR download: text_generation_launcher: Download encountered an error: /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 147, in download_weights
local_pt_files = utils.weight_files(model_id, revision, ".bin")
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py", line 90, in weight_files
raise FileNotFoundError(
FileNotFoundError: No local weights found in /data/ with extension .bin
Expected behavior
No error
The text was updated successfully, but these errors were encountered: