feat: move model loader functionality to augmentation #119

willmj · 2025-01-07T19:56:52Z

Description
Step 1 of 3 for enabling LoRA on ScatterMoE: move model loader functionality to augmentation. This makes it so the plugin doesn't have to be standalone as well.

Testing
Testing on fms-hf-tuning with augmentation function instead of model loader shows similar results to #390:

      {
          "model_name_or_path": "/ibm_dmf_lakehouse/models/base_training/shared/granite-3.0-3b-a800m-base/r240924a",
          "training_data_path": "/testing/tuning/input/cc_tone_sft_format_1000_train.json",
          "output_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST",
          "save_model_dir": "/testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone-FAST/save_model",
          "num_train_epochs": 10.0,
          "per_device_train_batch_size": 2,
          "gradient_accumulation_steps": 1,
          "learning_rate": 1e-5,
          "response_template": "\n### Response:",
          "dataset_text_field": "output",
          "fast_moe": 1
      }

Results:

{'loss': 0.834, 'grad_norm': 326.0, 'learning_rate': 9e-06, 'epoch': 1.0}
{'loss': 0.4279, 'grad_norm': 0.076171875, 'learning_rate': 8.000000000000001e-06, 'epoch': 2.0}
{'loss': 0.1377, 'grad_norm': 3.78125, 'learning_rate': 7e-06, 'epoch': 3.0}
{'loss': 0.0384, 'grad_norm': 0.81640625, 'learning_rate': 6e-06, 'epoch': 4.0}
{'loss': 0.0031, 'grad_norm': 0.003997802734375, 'learning_rate': 5e-06, 'epoch': 5.0}
{'loss': 0.0006, 'grad_norm': 0.002044677734375, 'learning_rate': 4.000000000000001e-06, 'epoch': 6.0}
{'loss': 0.0002, 'grad_norm': 0.0032196044921875, 'learning_rate': 3e-06, 'epoch': 7.0}
{'loss': 0.0001, 'grad_norm': 0.002288818359375, 'learning_rate': 2.0000000000000003e-06, 'epoch': 8.0}
{'loss': 0.0001, 'grad_norm': 0.0087890625, 'learning_rate': 1.0000000000000002e-06, 'epoch': 9.0}
{'loss': 0.0001, 'grad_norm': 0.0115966796875, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 2125.7018, 'train_samples_per_second': 4.704, 'train_steps_per_second': 2.352, 'train_loss': 0.14420232288464904, 'epoch': 10.0}

model location: /testing/tuning/output/granite-3b-moe/ft/20240107_1014-tone/save_model

Signed-off-by: Will Johnson <[email protected]>

fabianlim · 2025-01-07T23:48:41Z

plugins/accelerated-moe/src/fms_acceleration_moe/framework_plugin_scattermoe.py

        rank, world_size = 0, 1
        if torch.distributed.is_initialized():
            world_size = torch.distributed.get_world_size()
            rank = torch.distributed.get_rank()

-        # shard the MOE, and store the component names, eventually needed
-        # to configure the FSDP
+        model_name = model.config.name_or_path


I would say add a check for the prescence of name_or_path in model.config, and if not there, raise a ValueError explaining that for scattermoe, we require a name_or_path to point to the model in the config

Signed-off-by: Will Johnson <[email protected]>

fabianlim

LGTM

fabianlim · 2025-01-14T14:59:07Z

maybe before merging this will be good to test a multi-gpu run

willmj · 2025-01-14T16:10:33Z

This change seems to not work on multi-GPU:

***** Running training *****
  Num examples = 1,000
  Num Epochs = 10
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 2,500
  Number of trainable parameters = 3,160,327,680
  0%|          | 0/2500 [00:00<?, ?it/s]We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Detected flash_attn version: 2.6.3
{'loss': 0.6756, 'grad_norm': 114.0, 'learning_rate': 9e-06, 'epoch': 1.0}
 10%|█         | 250/2500 [01:59<15:02,  2.49it/s]/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 665, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 418, in train
    trainer.train(resume_from_checkpoint)
  File "/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2487, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3012, in _save_checkpoint
    self._save_optimizer_and_scheduler(output_dir)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3150, in _save_optimizer_and_scheduler
    save_fsdp_optimizer(
  File "/app/fms-acceleration/plugins/accelerated-moe/src/fms_acceleration_moe/utils/checkpoint_utils.py", line 82, in save_fsdp_optimizer
    raise NotImplementedError(
NotImplementedError: Checkpointing for megablocks only enabled for sharded state dict.

Saving model checkpoint to /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250
Configuration saved in /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250/config.json
Configuration saved in /testing/tuning/output/granite-3b-moe/ft/20250114_1014-tone-FAST/checkpoint-250/generation_config.json
/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py:17: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return tensor.storage().data_ptr()
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py", line 13, in storage_ptr
    return tensor.untyped_storage().data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Attempted to access the data pointer on an invalid python storage.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 665, in main
    trainer, additional_train_info = train(
                                     ^^^^^^
  File "/app/fms-hf-tuning/tuning/sft_trainer.py", line 418, in train
    trainer.train(resume_from_checkpoint)
  File "/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2052, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2487, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2918, in _maybe_log_save_evaluate
    self._save_checkpoint(model, trial, metrics=metrics)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3008, in _save_checkpoint
    self.save_model(output_dir, _internal_call=True)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3605, in save_model
    self._save(output_dir, state_dict=state_dict)
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 3715, in _save
    self.accelerator.unwrap_model(self.model).save_pretrained(
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 2700, in save_pretrained
    ptrs[id_tensor_storage(tensor)].append(name)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/pytorch_utils.py", line 305, in id_tensor_storage
    unique_id = storage_ptr(tensor)
                ^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/safetensors/torch.py", line 17, in storage_ptr
    return tensor.storage().data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/storage.py", line 1011, in data_ptr
    return self._data_ptr()
           ^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/storage.py", line 1015, in _data_ptr
    return self._untyped_storage.data_ptr()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Attempted to access the data pointer on an invalid python storage.

 10%|█         | 250/2500 [02:00<18:04,  2.07it/s]
W0114 15:34:38.379000 140066601367360 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1803 closing signal SIGTERM
E0114 15:34:39.045000 140066601367360 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 203) local_rank: 1 (pid: 1804) of binary: /usr/bin/python
ERROR:root:Traceback (most recent call last):
  File "/app/accelerate_launch.py", line 99, in main
    launch_command(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tuning.sft_trainer FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-14_15:34:38
  host      : will-dev-sleep-pod-sft-trainer-2-gpu
  rank      : 1 (local_rank: 1)
  exitcode  : 203 (pid: 1804)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

willmj · 2025-01-21T18:27:56Z

fsdp_state_dict_type was set to FULL_STATE_DICT in accelerate_fsdp_defaults.yaml, ran correctly once changed to SHARDED_STATE_DICT for ep_degree equal to 1 and 2 on multi GPU.

willmj added 2 commits January 3, 2025 11:07

feat: add augmentation, comment out model loader (first draft)

c5d6200

Signed-off-by: Will Johnson <[email protected]>

feat: move model loader functionality to augmentation

b154026

Signed-off-by: Will Johnson <[email protected]>

willmj requested a review from fabianlim as a code owner January 7, 2025 19:56

willmj added 2 commits January 7, 2025 15:04

lint: remove unused import

c1e866a

Signed-off-by: Will Johnson <[email protected]>

fmt

17d0d6c

Signed-off-by: Will Johnson <[email protected]>

fabianlim requested changes Jan 7, 2025

View reviewed changes

willmj added 3 commits January 8, 2025 10:40

fix: raise error if no model name

2f7df93

Signed-off-by: Will Johnson <[email protected]>

Merge branch 'main' into feat-lora-scattermoe

0fd0254

Signed-off-by: Will Johnson <[email protected]>

fix: requires_agumentation -> requires_augmentation

b28a073

Signed-off-by: Will Johnson <[email protected]>

fabianlim approved these changes Jan 14, 2025

View reviewed changes

fabianlim merged commit 8787ca1 into foundation-model-stack:main Jan 22, 2025
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: move model loader functionality to augmentation #119

feat: move model loader functionality to augmentation #119

willmj commented Jan 7, 2025 •

edited

Loading

fabianlim Jan 7, 2025

fabianlim left a comment

fabianlim commented Jan 14, 2025

willmj commented Jan 14, 2025

willmj commented Jan 21, 2025

feat: move model loader functionality to augmentation #119

feat: move model loader functionality to augmentation #119

Conversation

willmj commented Jan 7, 2025 • edited Loading

fabianlim Jan 7, 2025

Choose a reason for hiding this comment

fabianlim left a comment

Choose a reason for hiding this comment

fabianlim commented Jan 14, 2025

willmj commented Jan 14, 2025

willmj commented Jan 21, 2025

willmj commented Jan 7, 2025 •

edited

Loading