Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Patch Generation Script for Locally Hosted OSS model #537

Merged
merged 13 commits into from
Jul 24, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

This PR updates the inference method in the oss_handler to improve the generation task for locally-hosted models. Previously, we used Ray for multi-node single-GPU inference (pipeline parallel), which is uncommon as most setups are single machines with multiple GPUs. This approach also led to out-of-memory errors for some large models.

In this PR:

  • Ray is removed and replaced with vllm's built-in single-node multi-GPU inference method (tensor parallel).
  • The use of the @torch.inference_mode() decorator has been removed.

@HuanzhiMao HuanzhiMao marked this pull request as ready for review July 21, 2024 04:56
HuanzhiMao added a commit to HuanzhiMao/gorilla that referenced this pull request Jul 21, 2024
Squashed commit of the following:

commit e65a108
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 21:50:26 2024 -0700

    update README

commit 8034aed
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 17:44:50 2024 -0700

    refactor glm_handler to simplify logic and apply fix

commit 83912f0
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 17:31:33 2024 -0700

    polish process_input section

commit 7d08daf
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 15:46:06 2024 -0700

    simplify _batch_generate logic; seperate out process_input section

commit c5ac395
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 15:27:42 2024 -0700

    remove outdated gemma model name

commit b59af2c
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 14:32:23 2024 -0700

    revert, as vllm still requires ray

commit 7a275d7
Author: Huanzhi Mao <[email protected]>
Date:   Sat Jul 20 14:27:44 2024 -0700

    remove ray from requirements.txt

commit 0d1c478
Merge: 32c1ad4 7b230df
Author: Huanzhi (Hans) Mao <[email protected]>
Date:   Sat Jul 20 00:01:25 2024 -0700

    Merge branch 'main' into main

commit 32c1ad4
Author: Huanzhi Mao <[email protected]>
Date:   Fri Jul 19 23:36:42 2024 -0700

    remove ray; use vllm tensor_parallel_size

commit 5ff790e
Author: Huanzhi Mao <[email protected]>
Date:   Fri Jul 19 21:21:08 2024 -0700

    remove torch inference_mode
@HuanzhiMao
Copy link
Collaborator Author

Sorry, this PR was accidentally closed.

@HuanzhiMao HuanzhiMao reopened this Jul 22, 2024
Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShishirPatil ShishirPatil merged commit c0637bb into ShishirPatil:main Jul 24, 2024
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
…atil#537)

This PR updates the `inference` method in the `oss_handler` to improve
the generation task for locally-hosted models. Previously, we used Ray
for multi-node single-GPU inference (pipeline parallel), which is
uncommon as most setups are single machines with multiple GPUs. This
approach also led to out-of-memory errors for some large models.

In this PR:
- Ray is removed and replaced with vllm's built-in single-node multi-GPU
inference method (tensor parallel).
- The use of the `@torch.inference_mode()` decorator has been removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants