Profile memory usage #59

zhuohan123 · 2023-05-03T06:00:44Z

No description provided.

SUMMARY: * updates whl generation workflow to add testing and `testmo` integration * add top-level generate whls workflow TEST PLAN: ran manually ... --------- Co-authored-by: andy-neuma <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Michael Goin <[email protected]>

remove expert_max hard code (vllm-project#47) vLLM-Ext: Full enabling of ALiBi (vllm-project#34) Add version inference via setuptools-scm (vllm-project#58) Revert "vLLM-Ext: Full enabling of ALiBi (vllm-project#34)" (vllm-project#59) Remove punica_hpu.py from vllm_hpu_extension (vllm-project#66) Removed previous (not-pipelined) pa implementation (vllm-project#72) Add flag to enable running softmax in fp32 (vllm-project#71) Update calibration readme link (vllm-project#73) allow lm_head quantization in calibration process (vllm-project#65) Pad to bmin if value is less (vllm-project#67) Update pyproject.toml (HabanaAI#75) --------- Co-authored-by: Michał Kuligowski <[email protected]>

This PR enables the Spyre tests to run as a Github action. I realized that the model we were using for the tests `llama-194m` is not available on HF hub, but if we want to run the tests externally we need to use some model that is available. I've replaced it with this one: https://huggingface.co/JackFram/llama-160m Note I haven't actually changed the model name in the tests, I just "hacked" it for now using a soft link in the docker container. This is because there is some ongoing work to introduce environment variables to control the tests and I don't want to complicate things. For this model I see some quite weird behaviour where the tokens produced by vLLM and HF Transformers are identical but the decode text is slightly different (they are the same up to a leading space). I don't think this difference is related to Spyre so I've just changed the test to compare token ids instead. --------- Signed-off-by: Thomas Parnell <[email protected]>

zhuohan123 self-assigned this May 3, 2023

WoosukKwon mentioned this issue May 7, 2023

Implement client API #75

Closed

zhuohan123 mentioned this issue May 7, 2023

Use runtime profiling to replace manual memory analyzers #81

Merged

WoosukKwon added the P0 label May 10, 2023

zhuohan123 closed this as completed in #81 May 19, 2023

shanshanpt mentioned this issue Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this issue Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

ZHJ19970917 mentioned this issue Jul 14, 2024

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“ #6421

Closed

alixiaodi mentioned this issue Aug 2, 2024

[Bug]: #7072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile memory usage #59

Profile memory usage #59

zhuohan123 commented May 3, 2023

Profile memory usage #59

Profile memory usage #59

Comments

zhuohan123 commented May 3, 2023