Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to make inference with a local model #896

Closed
2 of 4 tasks
OscarAGracia opened this issue Aug 21, 2023 · 10 comments
Closed
2 of 4 tasks

Error when trying to make inference with a local model #896

OscarAGracia opened this issue Aug 21, 2023 · 10 comments

Comments

@OscarAGracia
Copy link

OscarAGracia commented Aug 21, 2023

System Info

text-generation-inference

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

I have a local llama-2-7b-chat model trained with autotrain. My new model is stored in /data/my-llama folder

autotrain llm --train --project_name my-llama --model meta-llama/Llama-2-7b-hf --data_path . --use_peft --use_int4 --learning_rate 2e-4 --train_batch_size 2 --num_train_epochs 3 --trainer sft

Auto-train finished ok

I deploy that model with text_generation-inference
docker run --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /data/ --disable-custom-kernels

but I have this error

ERROR download: text_generation_launcher: Download encountered an error: /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 147, in download_weights
local_pt_files = utils.weight_files(model_id, revision, ".bin")

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/hub.py", line 90, in weight_files
raise FileNotFoundError(

FileNotFoundError: No local weights found in /data/ with extension .bin

Expected behavior

No error

@Narsil
Copy link
Collaborator

Narsil commented Aug 22, 2023

It does work.

I can't debug your environement, but something is wrong between where the local model lives, and the -v $PWD/data:/data argument.

@OscarAGracia
Copy link
Author

OscarAGracia commented Aug 22, 2023

Yes, all right. But when I launch

docker run --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.0.1 --model-id /data/myllama/ --disable-custom-kernels

I have this error

/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 81, in serve
server.serve(

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 184, in serve
asyncio.run(

File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)

File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 136, in serve_inner
model = get_model(

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 117, in get_model
config_dict, _ = PretrainedConfig.get_config_dict(

File "/opt/conda/lib/python3.9/site-packages/transformers/configuration_utils.py", line 617, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

File "/opt/conda/lib/python3.9/site-packages/transformers/configuration_utils.py", line 672, in _get_config_dict
resolved_config_file = cached_file(

File "/opt/conda/lib/python3.9/site-packages/transformers/utils/hub.py", line 388, in cached_file
raise EnvironmentError(

OSError: /data/myllama/ does not appear to have a file named config.json. Checkout 'https://huggingface.co//data/myllama//None' for available files.
rank=0
2023-08-22T19:14:35.330308Z ERROR text_generation_launcher: Shard 0 failed to start
2023-08-22T19:14:35.330333Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

The output model trained with autotrain contains these files:

user_dev@myllm:~/data/myllama$ ls
README.md adapter_model.bin checkpoint-6 special_tokens_map.json tokenizer.model training_args.bin
adapter_config.json adapter_model.safetensors runs tokenizer.json tokenizer_config.json training_params.json

There isn't any config.json file

@Narsil
Copy link
Collaborator

Narsil commented Aug 23, 2023

As you can see, there is no config.json.

@Narsil Narsil closed this as completed Aug 23, 2023
@loganlebanoff
Copy link

There is an adapter_config.json though. Looks like OscarAGracia is trying to run a LoRA model, which should be support by TGI from this #762. LoRA training usually results in just a adapter_model.bin and adapter_config.json. Does this mean we have to manually add in the base model's config.json to this folder as well?

@Narsil
Copy link
Collaborator

Narsil commented Aug 24, 2023

Yes, tgi will not work cross repo, you need both the load and the original model is the same location, or have a pre-fused model.

Scratch that, it works: https://huggingface.co/mariiaponom/llama_7b_class_lora
It does create copies though.

@chris-aeviator
Copy link

chris-aeviator commented Aug 26, 2023

@Narsil but if adapter_config.json contains base_model_name_or_path a model name (e.g. meta-ai/llama-2-hf), according to #762 it should be downloading this model as the next step?

Needing the (possibly huge) base model in the same repo leads to all kinds of production issues. No one will cache my copy of Llama 70B if everyone else in the cluster / cloud will download meta-llama/llama-2-hf. Storing merged copies is expensive, both in terms of boot times as well as cost, add 30-100 $ for every copy per month, depending on where you host.

TGI is well suited for on-demand/ cloud burst style of workloads and people should be able to use it properly without being tied into a specific ecosystem or dogma towards single model, long-running hosted cloud instances that are unaffordable for most users of TGI out there.

Update

Yes, tgi will not work cross repo, you need both the load and the original model is the same location, or have a pre-fused model.

I'll be working on supporting a more production friendly version of TGI (loading models from s3, hot-swapping adapters, back to truly OS) via https://github.com/ohmytofu-ai/tgi-angryface

@Narsil
Copy link
Collaborator

Narsil commented Aug 28, 2023

on-demand/ cloud burst

Not really imo, and it certainly wasn't designed with that in mind.
It's designed for running large LLM, on long running servers, and be the most efficient possible on those.
All the rest is just bonus.

Supporting LORA as first class citizen is something we have in mind : #907 (comment)
However, supporting non merged models atm is a lot of engineering work not worth the effort in our opinions. (Supporting MANY loras on the same deployment is much more attractive, for maybe a little bit more engineering).
Cost wise it would be much more beneficial that the extra storage cost we're talking about here.

@chris-aeviator I'm sorry if you feel angry or let down, but you're more than welcome to fork and add support if we cannot provide. This is what open source is for !

add 30-100 $ for every copy per month,

Really ? Something seems to be wrong there. EFS or S3 shouldn't cost nearly as much.

@chris-aeviator
Copy link

Re: s3
very conservative 2 TB egress, loads my model 12 times per month to a server location of my choice.

grafik

Don't want to mess around here or get emotional though.

What's holding adapters back? For single GPU deployment with B&B 4 bit I can (POC) successfully load a model https://github.com/ohmytofu-ai/tgi-angryface/blob/8028ebb93c58fe58673c80a16c482b3e1cad6807/server/text_generation_server/models/causal_lm.py#L489 and the implementation took very little time (might be something wrong with it still, I didn't test thoroughly yet).

@Narsil
Copy link
Collaborator

Narsil commented Aug 28, 2023

But... you shouldn't be moving around paying egress costs (unless you're running on a different cloud). So there's no bandwidth cost if you're using S3 from the same region in AWS.

EFS should work (not cheap either) so you can keep models there all the time.
For a little more engineering work (not that much, but more maintenance) you can get NFS to work like EFS and that would be quite cheap.

Currently what happens is just that the model is loading the base, and adding the adapter on top, saving that back onto the docker, and loading from it.

The issue with the code you linked is that is works just for CausalLM models, not FlashCausalLM ones nor Seq2SeqLM.
And this is where the engineering debt starts, because you also don't have TP sharding nor GPTQ/BITSANDBYTES support.
The code you're calling is the "graceful degradation" path, not the main one.

With the current method, everything comes for free (you just have to have a bit of disk on the host node to save the fused model).
I just realized there was indeed an issue when deploying from docker with the current code though: #935. It didn't trigger locally because when creating the code it created the fused version, and things worked afterwards, whereas in docker everything is created at every boot (still cached everything from the hub in the /data folder).

So there's a little bit of boot latency, but zero engineering maintenance cost basically (and you get all new features pretty much for free without us having to think about peft).

As I said, lora ARE interesting, but we're going to try and exploit the lora to the fullest when we do, which is actually running multiple lora with the same backbone on a single server. Engineering/maintenance cost will be high, but so will the benefit. Ideally you could keep a single server running, and serve pretty much all loras on it.
It's still an idea for now, but we have brainstormed it a bit, it's just a matter of making it prio #1.

I'm not sure everyone realizes but we're still a pretty small team (TGI is like 1 person basically). Things to do are plenty, we're permanently thinking about what's the biggest impact feature, we just cannot deliver everything.

Happy, to continue the discussion about your use case though. We both agree 30-100 $ is not a real good cost for maintaining loras.

@chris-aeviator
Copy link

Appreciate your time taken drafting a technical solution. However the TLDR is 99% of people cannot do what you propose. There is no one cloud provider as of August 2023 that will provide you with more than a handful of GPUs.

  • Lambda - no GPUs avail
  • Runpod - some GPUs
  • AWS - a single A10G with the default quota. Each request for quota increase, even if declined, costs around 13€ in "business support fees"

That's why more dynamic deployment schemes are essential for "the GPU poor" right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants