Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] gorilla-openfunctions-v1-q4f16_1-MLC crashes on JIT lib build on cuda12.2 #2113

Closed
Sing-Li opened this issue Apr 10, 2024 · 7 comments
Closed
Labels
bug Confirmed bugs

Comments

@Sing-Li
Copy link
Contributor

Sing-Li commented Apr 10, 2024

🐛 Bug

Trying to serve gorilla openfunctions v1 will crash during initial jit library build. Same happens with openfunctions v2 and f16 or f32

To Reproduce

Steps to reproduce the behavior:

  1. install cuda 12.2 nightly
  2. mlc_llm serve HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
  3. crashes with log attached below
Found device: cuda:0
[2024-04-10 00:12:41] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-10 00:12:42] INFO auto_device.py:85: Not found device: metal:0
[2024-04-10 00:12:43] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-10 00:12:44] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-10 00:12:44] INFO auto_device.py:33: Using device: cuda:0
[2024-04-10 00:12:44] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
[2024-04-10 00:12:44] INFO download.py:40: [Git] Cloning https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC.git to /tmp/tmpgtd319_l/tmp
[2024-04-10 00:12:45] INFO download.py:76: [Git LFS] Downloading 1 files with Git LFS: ['tokenizer.model']
[2024-04-10 00:12:45] INFO download.py:79: [Git LFS] Downloading tokenizer.model
100%|██████████| 1/1 [00:00<00:00,  1.83it/s]
[2024-04-10 00:12:47] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_1.bin to /tmp/tmpgtd319_l/tmp/params_shard_1.bin
...

[2024-04-10 00:13:29] INFO download.py:152: Downloaded https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/resolve/main/params_shard_112.bin to /tmp/tmpgtd319_l/tmp/params_shard_112.bin
100%|██████████| 115/115 [00:43<00:00,  2.62it/s]
[2024-04-10 00:13:29] INFO download.py:153: Moving /tmp/tmpgtd319_l/tmp to /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC
Traceback (most recent call last):
  File "/usr/local/bin/mlc_llm", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 41, in main
    cli.main(sys.argv[2:])
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/serve.py", line 75, in main
    serve(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/serve.py", line 42, in serve
    engine = async_engine.AsyncThreadedEngine(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/async_engine.py", line 274, in __init__
    ) = _process_model_args(models)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 125, in _process_model_args
    model_args: List[Any] = sum(
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 126, in <genexpr>
    (_convert_model_info(model) for model in models),
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine.py", line 101, in _convert_model_info
    assert isinstance(chat_config.conv_template, Conversation)
AssertionError

Expected behavior

Should work as with Llama and Mistral and Gemma.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA 12.2
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 20.04lts
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) 3060 12 GB
  • How you installed MLC-LLM (conda, source): nightly
  • How you installed TVM-Unity (pip, source): nightly prevuilt
  • Python version (e.g. 3.10): 3.11
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
  • Any other relevant information:

Additional context

@MasterJH5574
Copy link
Member

Thank you @Sing-Li for reporting! That is because the mlc-chat-config.json in the prebuilt weight repo was not updated.

I just updated the conv_template field https://huggingface.co/mlc-ai/gorilla-openfunctions-v1-q4f16_1-MLC/commit/e83c4a2bbb4735c1ccde096dae0df635dd172310 and I think it should be good now. Would you mind trying again?

@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 10, 2024

Thank you @MasterJH5574 It works fine now. Closing the issue.

@Sing-Li Sing-Li closed this as completed Apr 10, 2024
@Sing-Li Sing-Li reopened this Apr 10, 2024
@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 10, 2024

Sorry, @MasterJH5574 Is it possible to update the configs for the other two gorilla function weights as well 🙏

https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f32_1-MLC

https://huggingface.co/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC

@MasterJH5574
Copy link
Member

Hey @Sing-Li, sorry for the late reply. Just updated these two repositories. If I remember correctly, there might still be some output formatting issue for the function calling of gorilla v2. Could you try a bit at your convenience and see how it goes?

@Sing-Li
Copy link
Contributor Author

Sing-Li commented Apr 16, 2024

Thanks @MasterJH5574

Test results:
gorilla-openfunctions-v2-q4f32_1

  • chat - seems to work
  • serve - I only have 12GB VRAM and serve ran out of memory

gorilla-openfunctions-v2-q4f16_1

  • chat - crashes with the following dump
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `sampler_take_probs`: 0.00 MB
[2024-04-16 04:10:14] INFO estimate_memory_usage.py:57: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-04-16 04:10:14] INFO pipeline.py:50: Compiling external modules
[2024-04-16 04:10:14] INFO pipeline.py:50: Compilation complete! Exporting to disk
[2024-04-16 04:10:31] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:31] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-04-16 04:10:31] INFO compile.py:198: Generated: /tmp/tmphmrwlwhl/lib.so
[2024-04-16 04:10:31] INFO jit.py:98: Using compiled model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:10:32] INFO model_metadata.py:96: Total memory usage: 4169.98 MB (Parameters: 3707.35 MB. KVCache: 0.00 MB. Temporary buffer: 462.62 MB)
[2024-04-16 04:10:32] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
You can use the following special commands:
 /help               print the special commands
 /exit               quit the cli
 /stats              print out the latest stats (token/sec)
 /reset              restart a fresh chat
 /set [overrides]    override settings in the generation config. For example,
                     `/set temperature=0.5;max_gen_len=100;stop=end,stop`
                     Note: Separate stop words in the `stop` option with commas (,).
 Multi-line input: Use escape+enter to start a new line.

Traceback (most recent call last):
 File "/usr/local/bin/mlc_llm", line 8, in <module>
   sys.exit(main())
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/__main__.py", line 37, in main
   cli.main(sys.argv[2:])
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/cli/chat.py", line 41, in main
   chat(
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/interface/chat.py", line 135, in chat
   cm._process_system_prompts()  # pylint: disable=protected-access
 File "/usr/local/lib/python3.10/dist-packages/mlc_llm/chat_module.py", line 1228, in _process_system_prompts
   self._process_system_prompts_func()
 File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
 File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
 File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
 File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
 File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
   raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0

running serve also crashes with same error when a REST completion requests comes in:

[2024-04-16 04:11:59] INFO auto_device.py:76: Found device: cuda:0
[2024-04-16 04:12:00] INFO auto_device.py:85: Not found device: rocm:0
[2024-04-16 04:12:01] INFO auto_device.py:85: Not found device: metal:0
[2024-04-16 04:12:02] INFO auto_device.py:85: Not found device: vulkan:0
[2024-04-16 04:12:03] INFO auto_device.py:85: Not found device: opencl:0
[2024-04-16 04:12:03] INFO auto_device.py:33: Using device: cuda:0
[2024-04-16 04:12:03] INFO chat_module.py:362: Downloading model from HuggingFace: HF://mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO download.py:131: Weights already downloaded: /root/.cache/mlc_llm/model_weights/mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC
[2024-04-16 04:12:03] INFO jit.py:35: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-04-16 04:12:03] INFO jit.py:117: Using cached model lib: /root/.cache/mlc_llm/model_lib/5c413127c1217b4fc4779c7be427b220.so
[2024-04-16 04:12:05] INFO engine_base.py:241: Estimated KVCacheConfig "max_total_sequence_length": 13445.
[2024-04-16 04:12:05] INFO engine_base.py:246: Estimated total single GPU memory usage: 10839.99 MB (Parameters: 3707.35 MB. KVCache: 6479.40 MB. Temporary buffer: 653.24 MB)
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Exception in thread Thread-1 (_background_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mlc_llm/serve/engine_base.py", line 602, in _background_loop
    self._ffi["run_background_loop"]()
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: TVMError: Unsupported layout: 0

@MasterJH5574
Copy link
Member

Thank you @Sing-Li for checking again. This issue #2121 (comment) also reports the similar error. We will look into that.

@MasterJH5574
Copy link
Member

Hi @Sing-Li @ollmer, we have fixed this issue in the latest pip package. Please update the packages and try again, thank you!

@tqchen tqchen closed this as completed May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

3 participants