-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dataclass args for accelerated MoE tuning #390
feat: dataclass args for accelerated MoE tuning #390
Conversation
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Thanks for making a pull request! 😃 |
tuning/config/acceleration_configs/acceleration_framework_config.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Tested using new flag on granite 3 3b MoE, inference up next Regular MOE tuningTested this branch without
Training logs:
Location: Fast MOEAnd with
Training logs
Location: ResultsWe see a 2.48x speedup |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
for f in fields(dataclass): | ||
nested_type = type_hints[f.name] | ||
values = getattr(dataclass, f.name) | ||
if values is not None and not is_dataclass(values): | ||
values = nested_type(*values) | ||
if isinstance(values, Iterable) and not isinstance(values, (str, bytes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmj can explain this fix? in this PR we are not doing anything much different then before, so why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's only one value in FastMoeConfig that's an int, if the previous implementation tried to unpack it, it caused this error:
Traceback (most recent call last):
File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 601, in main
) = parse_arguments(parser, job_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 535, in parse_arguments
) = parser.parse_dict(json_config, allow_extra_keys=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tuning/.local/lib/python3.11/site-packages/transformers/hf_argparser.py", line 374, in parse_dict
obj = dtype(**inputs)
^^^^^^^^^^^^^^^
File "<string>", line 4, in __init__
File "/home/tuning/.local/lib/python3.11/site-packages/tuning/config/acceleration_configs/fast_moe.py", line 36, in __post_init__
ensure_nested_dataclasses_initialized(self)
File "/home/tuning/.local/lib/python3.11/site-packages/tuning/config/acceleration_configs/utils.py", line 34, in ensure_nested_dataclasses_initialized
values = nested_type(*values)
^^^^^^^^^^^^^^^^^^^^
TypeError: tuning.config.acceleration_configs.utils.parsable_dataclass.<locals>.ParsableDataclass() argument after * must be an iterable, not int
So this seemed like the best way to avoid that. If you have another solution though let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmj thanks for the explaination. this fix seems reasonable, but i would suggest putting comments to explain this in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it makes more sense to check if is int and then the else case can be the more common case instead. You could also future proof more by checking if is int/float/bool
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
After running checkpoint utils on the branch Fabian created for safetensors, vLLM inference ran as expected:
Post-processing completed with this script (thanks again Fabian!):
FastMOE model saved in: |
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
443b6b5
to
c746655
Compare
Signed-off-by: Will Johnson <[email protected]>
eda08da
to
bd7e2ad
Compare
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the fast MoE plugin and flag Will! A few clarifying questions...
@@ -760,6 +762,9 @@ Notes: | |||
* Notes on Multipack | |||
- works only for *multi-gpu*. | |||
- currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*. | |||
* Notes of Fast MOE | |||
- `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a recommended value to use for this? I see the default is 1, would 1 work to configure the amount of expert parallel sharding or does it need to be greater than 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding 1 should work in every case and does work to configure the ep sharding. @fabianlim might be able to provide more insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for my understanding, so this flag can be used with any MoE model and only for fine tuning (as described in fms-acceleration docs)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a recommended value to use for this? I see the default is 1, would 1 work to configure the amount of expert parallel sharding or does it need to be greater than 1?
If you use 1 there you will switch to the scattermoe kernels without sharding. If > 1 it will be the kernels + sharding. In both cases gains will be observed in here; the ep
in the names refer to expert world size
Also for my understanding, so this flag can be used with any MoE model and only for fine tuning (as described in fms-acceleration docs)?
No sorry, our moe plugin reqiures a module swap, which means we require code to tell how to swap a the MoE modele for any particular model, currently 2 MoE models with representative supported, and for additional model support, we need to add a new spec, but hopefully its a copy-and-paste effort if the archs are not much different.
for f in fields(dataclass): | ||
nested_type = type_hints[f.name] | ||
values = getattr(dataclass, f.name) | ||
if values is not None and not is_dataclass(values): | ||
values = nested_type(*values) | ||
if isinstance(values, Iterable) and not isinstance(values, (str, bytes)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it makes more sense to check if is int and then the else case can be the more common case instead. You could also future proof more by checking if is int/float/bool
…se statement Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small spelling error but otherwise the changes look good 👍
@@ -760,6 +762,9 @@ Notes: | |||
* Notes on Multipack | |||
- works only for *multi-gpu*. | |||
- currently only includes the version of *multipack* optimized for linear attention implementations like *flash-attn*. | |||
* Notes of Fast MOE | |||
- `--fast_moe` is an integer value that configures the amount of expert parallel sharding (ep_degree). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for my understanding, so this flag can be used with any MoE model and only for fine tuning (as described in fms-acceleration docs)?
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
Yes that's correct! LoRA activation will be coming soon though. |
Description of the change
This PR adds one dataclass argument to enable accelerted moe for
sft_trainer.py
, via the new fms-acceleration accelerated-moe plugin and allows for accelerated MoE full-finetuning with the--fast_moe
flag.--fast_moe
enables a technique to train Mixture of Expert (MoE) models in parallel instead of sequentially.With this flag, we expect major speedup in train time and decrease in memory usage on Mixture of Expert models.
Related issue number
How to verify the PR
This PR is a work-in-progress and requires more testing, and the official release of
fms-acceleration-moe
fast_moe
.fast_moe
Was the PR tested