Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move model loader functionality to augmentation #119

Merged
merged 7 commits into from
Jan 22, 2025

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Jan 7, 2025

Description
Step 1 of 3 for enabling LoRA on ScatterMoE: move model loader functionality to augmentation. This makes it so the plugin doesn't have to be standalone as well.

Testing
Testing on fms-hf-tuning with augmentation function instead of model loader shows similar results to #390:

      {
          "model_name_or_path": "/ibm_dmf_lakehouse/models/base_training/shared/granite-3.0-3b-a800m-base/r240924a",
          "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
          "output_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST",
          "save_model_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST/save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-5,
          "response_template": "\n### Response:",
          "dataset_text_field": "output",
          "fast_moe": 1
      }

Results:

{'loss': 0.834, 'grad_norm': 326.0, 'learning_rate': 9e-06, 'epoch': 1.0}
{'loss': 0.4279, 'grad_norm': 0.076171875, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'loss': 0.1377, 'grad_norm': 3.78125, 'learning_rate': 7e-06, 'epoch': 3.0}
{'loss': 0.0384, 'grad_norm': 0.81640625, 'learning_rate': 6e-06, 'epoch': 4.0}
{'loss': 0.0031, 'grad_norm': 0.003997802734375, 'learning_rate': 5e-06, 'epoch': 5.0}
{'loss': 0.0006, 'grad_norm': 0.002044677734375, 'learning_rate': 4.000000000000001e-06, 'epoch': 6.0}
{'loss': 0.0002, 'grad_norm': 0.0032196044921875, 'learning_rate': 3e-06, 'epoch': 7.0}
{'loss': 0.0001, 'grad_norm': 0.002288818359375, 'learning_rate': 2.0000000000000003e-06, 'epoch': 8.0}
{'loss': 0.0001, 'grad_norm': 0.0087890625, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.0}
{'loss': 0.0001, 'grad_norm': 0.0115966796875, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 2125.7018, 'train_samples_per_second': 4.704, 'train_steps_per_second': 2.352, 'train_loss': 0.14420232288464904, 'epoch': 10.0}                                                                                           

model location: /testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone/save_model

@willmj willmj requested a review from fabianlim as a code owner January 7, 2025 19:56
willmj added 2 commits January 7, 2025 15:04
Signed-off-by: Will Johnson <[email protected]>
Signed-off-by: Will Johnson <[email protected]>
rank, world_size = 0, 1
if torch.distributed.is_initialized():
world_size = torch.distributed.get_world_size()
rank = torch.distributed.get_rank()

# shard the MOE, and store the component names, eventually needed
# to configure the FSDP
model_name = model.config.name_or_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say add a check for the prescence of name_or_path in model.config, and if not there, raise a ValueError explaining that for scattermoe, we require a name_or_path to point to the model in the config

Copy link
Contributor

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fabianlim
Copy link
Contributor

maybe before merging this will be good to test a multi-gpu run

@willmj
Copy link
Collaborator Author

willmj commented Jan 14, 2025

This change seems to not work on multi-GPU:
***** Running training *****
  Num examples = 1,000
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 2,500
  Number of trainable parameters = 3,160,327,680
  0%|          | 0/2500 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Detected flash_attn version: 2.6.3
{'loss': 0.6756, 'grad_norm': 114.0, 'learning_rate': 9e-06, 'epoch': 1.0}
 10%|█         | 250/2500 [01:59<15:02,  2.49it/s]/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 665, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 418, in train
    trainer.train(resume_from_checkpoint)
  File "/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2487, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3012, in _save_checkpoint
    self._save_optimizer_and_scheduler(output_dir)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3150, in _save_optimizer_and_scheduler
    save_fsdp_optimizer(
  File "/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py", line 82, in save_fsdp_optimizer
    raise NotImplementedError(
NotImplementedError: Checkpointing for megablocks only enabled for sharded state dict.

Saving model checkpoint to /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250
Configuration saved in /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250/config.json
Configuration saved in /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250/generation_config.json
/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py:17: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return tensor.storage().data_ptr()
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 665, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 418, in train
    trainer.train(resume_from_checkpoint)
  File "/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2487, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3008, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3605, in save_model
    self._save(output_dir, state_dict=state_dict)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3715, in _save
    self.accelerator.unwrap_model(self.model).save_pretrained(
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2700, in save_pretrained
    ptrs[id_tensor_storage(tensor)].append(name)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 305, in id_tensor_storage
    unique_id = storage_ptr(tensor)
                ^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py", line 17, in storage_ptr
    return tensor.storage().data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/storage.py", line 1011, in data_ptr
    return self._data_ptr()
           ^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/storage.py", line 1015, in _data_ptr
    return self._untyped_storage.data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Attempted to access the data pointer on an invalid python storage.

 10%|█         | 250/2500 [02:00<18:04,  2.07it/s]
W0114 15:34:38.379000 140066601367360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1803 closing signal SIGTERM
E0114 15:34:39.045000 140066601367360 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 203) local_rank: 1 (pid: 1804) of binary: /usr/bin/python
ERROR:root:Traceback (most recent call last):
  File "/app/accelerate_launch.py", line 99, in main
    launch_command(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tuning.sft_trainer FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-14_15:34:38
  host      : will-dev-sleep-pod-sft-trainer-2-gpu
  rank      : 1 (local_rank: 1)
  exitcode  : 203 (pid: 1804)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@willmj
Copy link
Collaborator Author

willmj commented Jan 21, 2025

fsdp_state_dict_type was set to FULL_STATE_DICT in accelerate_fsdp_defaults.yaml, ran correctly once changed to SHARDED_STATE_DICT for ep_degree equal to 1 and 2 on multi GPU.

@fabianlim fabianlim merged commit 8787ca1 into foundation-model-stack:main Jan 22, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants