-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperparameter tuning with Conversational AI Models #16878
Comments
Hey @ericharper, thanks for raising this up! I just took a look at the issues, and it seems like there's two separate threads:
RE: distributed fine-tuning with Ray Lightning, it seems like there's some sort of serialization error for NeMo. Lightning requires the model to be instantiated before parallelization, and Ray will transfer the model through RPC, serializing it with cloudpickle. What's the fundamental reason for not being able to pickle? Is it some IO mechanism or underlying library? cc @amogkam RE: hyperparameter tuning, users should be able to use Ray Tune without running into the pickle problem. Specifically, we require users to instantiate the model in the function, so you would never need to move it around: def train(hyperparameters):
model = create_nemo_model(hyperparameters)
return model.train()
ray.tune.run(train, resources={"GPU": 1}) I hope that gives a bit of clarity to answer your question! Happy to help out any way we can. |
Thank you @ericharper for opening this!
I tried to add the pickle.dumps(trainable) part but that did not work. |
@Amels404 do you mind posting your stack trace and also your training function? |
Thank you @richardliaw and @ericharper. My issue is related to BERT pretraining with ray_lightning and the BERTLMModel not being serializable apparently. I discussed this with @amogkam who suggested asking NeMo's team about it (#2376). |
Yes sure, @richard4912 The cfg is a yaml file containing the parameters of the model. ` def train(config): config = { ray.tune.run(train, config=config) |
I'm sharing the trace using a better format. Thank you!
|
Hmm, just to diagnose the issue, can you try commenting out
|
The main issue is that there's something that is not serializable in |
Thanks @richardliaw I've tried as you suggested, I think the issue was related to the logging part. Thank you! ************************************************************************************* 2021-07-13 12:03:03,873 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265 ************************************************************************************************------------------------------------ 2021-07-13 12:03:05,755 WARNING function_runner.py:546 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be |
OK, so this error comes up because hydra_runner seems to fail inside your training job. Is the primary benefit of using Hydra to convert YAML -> a structured object? Hydra generally assumes it is a top level construct (i.e., put Tune inside the Hydra call). |
Thank you for your reply, I'm not sure if this is what you mean by put tune inside Hydra call, I think I'm missing something out because I have the same issue.
|
Hmm this one looks more promising to me. Can you post the stack trace? BTW, use three backticks to format your code: ``` |
Yes, sure, thanks for the note! 2021-07-15 16:01:30,268 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265 ************************************************************************************************------------------------------------ 2021-07-15 16:01:32,193 WARNING function_runner.py:546 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be |
@Amels404 what does |
Yes, sure @richardliaw, we use hydra_runner from the nemo package that you can find here: https://github.com/NVIDIA/NeMo/blob/v1.0.0rc1/nemo/core/config/hydra_runner.py and the imports in our file are like this:
|
OK thanks!
Can you tell me more about what this does? I see you have a |
Here's an attempt at getting you a bit farther:
|
@richardliaw I'm sorry for getting back to you late. I think the issue that I have is: how to incorporate the ray config to the omega config file (precisely omegaconf.dictconfig.DictConfig), this is an example of the nemo config file: https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf I tried as you suggested but it doesn't seem to work, here is the trace: thanks!
|
Awesome! I think we're close. It seems like we've overwritten Tune as a function. Could you try:
|
Yes, sure.
|
Hey! I posted a new function to run in the previous message. Could you try
that again?
…On Tue, Jul 27, 2021 at 3:35 AM Amels404 ***@***.***> wrote:
Yes, sure.
You can find also the print results of the vars and dir, I'm not sure if
this might be useful.
Thanks!
<function tune at 0x7fd826b67d90> {'__wrapped__': <function tune at
0x7fd826b67d08>} ['__annotations__', '__call__', '__class__',
'__closure__', '__code__', '__defaults__', '__delattr__', '__dict__',
'__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__',
'__getattribute__', '__globals__', '__gt__', '__hash__', '__init__',
'__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__',
'__name__', '__ne__', '__new__', '__qualname__', '__reduce__',
'__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__',
'__subclasshook__', '__wrapped__'] {} Traceback (most recent call last):
File "aaj/asr/attempt.py", line 59, in <module> tune() File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py",
line 103, in wrapper strict=None, File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py",
line 347, in _run_hydra lambda: hydra.run( File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py",
line 201, in run_and_report raise ex File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py",
line 198, in run_and_report return func() File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py",
line 350, in <lambda> overrides=args.overrides, File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/hydra.py",
line 112, in run configure_logging=with_log_configuration, File
"/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/core/utils.py",
line 125, in run_job ret.return_value = task_function(task_cfg) File
"x/asr/attempt.py", line 53, in tune "lr": tune.loguniform(1e-4, 1e-1),
AttributeError: 'function' object has no attribute 'loguniform'
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16878 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCRZZMH36WW55K7XWMF7BLTZ2DYRANCNFSM473DTFNA>
.
|
Sure! I only omitted the config part.
(pid=19669) 2021-07-27 17:20:54,823 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=19669) Traceback (most recent call last): (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run (pid=19669) self._entrypoint() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint (pid=19669) self._status_reporter.get_checkpoint()) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func (pid=19669) output = fn() (pid=19669) File "aaj/asr/attempt.py", line 41, in train (pid=19669) model = create_nemo_model(config["hydraconfig"]) (pid=19669) File "aaj/asr/attempt.py", line 36, in create_nemo_model (pid=19669) callbacks=callbacks) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars (pid=19669) return fn(self, **kwargs) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init (pid=19669) deterministic, (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init (pid=19669) self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids (pid=19669) gpus = _sanitize_gpu_ids(gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=19669) """) (pid=19669) pytorch_lightning.utilities.exceptions.MisconfigurationException: (pid=19669) You requested GPUs: [0] (pid=19669) But your machine only has: [] (pid=19669) (pid=19669) Exception in thread Thread-2: (pid=19669) Traceback (most recent call last): (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/threading.py", line 916, in _bootstrap_inner (pid=19669) self.run() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 267, in run (pid=19669) raise e (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run (pid=19669) self._entrypoint() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint (pid=19669) self._status_reporter.get_checkpoint()) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func (pid=19669) output = fn() (pid=19669) File "aaj/asr/attempt.py", line 41, in train (pid=19669) model = create_nemo_model(config["hydraconfig"]) (pid=19669) File "aaj/asr/attempt.py", line 36, in create_nemo_model (pid=19669) callbacks=callbacks) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars (pid=19669) return fn(self, **kwargs) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init (pid=19669) deterministic, (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init (pid=19669) self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids (pid=19669) gpus = _sanitize_gpu_ids(gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=19669) """) (pid=19669) pytorch_lightning.utilities.exceptions.MisconfigurationException: (pid=19669) You requested GPUs: [0] (pid=19669) But your machine only has: [] (pid=19669) (pid=19669) 2021-07-27 17:20:54,957 ERROR trial_runner.py:748 -- Trial train_9c4c3_00000: Error processing event. Traceback (most recent call last): File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper return func(*args, **kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/worker.py", line 1494, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=19669, ip=192.168.88.54) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trainable.py", line 173, in train_buffered result = self.train() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trainable.py", line 232, in train result = self.step() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 366, in step self._report_thread_runner_error(block=True) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=19669, ip=192.168.88.54) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint self._status_reporter.get_checkpoint()) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func output = fn() File "aaj/asr/attempt.py", line 41, in train model = create_nemo_model(config["hydraconfig"]) File "aaj/asr/attempt.py", line 36, in create_nemo_model callbacks=callbacks) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars return fn(self, **kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init deterministic, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids gpus = _sanitize_gpu_ids(gpus) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids """) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0] But your machine only has: [] Result for train_9c4c3_00000: {} == Status == Memory usage on this node: 3.7/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/8.05 GiB heap, 0.0/4.02 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/amel/ray_results/train_2021-07-27_17-20-51 Number of trials: 1/1 (1 ERROR) +-------------------+----------+-------+--------------+------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+------------| | train_9c4c3_00000 | ERROR | | 128 | 0.00179969 | +-------------------+----------+-------+--------------+------------+ Number of errored trials: 1 +-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------| | train_9c4c3_00000 | 1 | /home/amel/ray_results/train_2021-07-27_17-20-51/train_9c4c3_00000_0_batch_size=128,lr=0.0017997_2021-07-27_17-20-51/error.txt | +-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------+ Traceback (most recent call last): File "aaj/asr/attempt.py", line 55, in tune_function() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper strict=None, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in overrides=args.overrides, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run configure_logging=with_log_configuration, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job ret.return_value = task_function(task_cfg) File "aaj/asr/attempt.py", line 53, in tune_function ray.tune.run(train, config=config) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/tune.py", line 543, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [train_1ce37_00000]) |
Looks like great progress! We're quite close:
try the above? Added a |
It working now! I just needed to return the model instead of the model.train |
Closing this, as it looks like the workload is good. |
Hi,
We have multiple NeMo users that are interested in using Ray Tune with PyTorch Lightning. NeMo also uses PTL so it's a natural idea to leverage Ray Tune with NeMo: NVIDIA/NeMo#2442, NVIDIA/NeMo#2376
However, there is an issue with the model not being able to be pickled. Can Ray Tune + PTL be used with the large models that are used in Conversational AI?
The text was updated successfully, but these errors were encountered: