Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.9.1] [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) #64797

Merged
merged 3 commits into from
Sep 10, 2021

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Sep 10, 2021

Summary:
Pull Request resolved: #60925

  • Make torch.distributed.launch restarts to 0
  • Remove unnecessary -use_env warning, move -use_env warnings
  • Move -use_env warnings to torch.distributed.launch
  • Make default log level WARNING
  • Add new doc section around transitioning to torch.distributed.run
  • Make torch.distributed.launch not use error-propagation
  • Set default events handler to null that does not print events to console
  • Add reference from torch.distributed.launch to torch.distributed.run
  • Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

#60716
#60754

Test Plan:
sandcastle

python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Differential Revision: D29559553

nzw0301 and others added 3 commits September 9, 2021 19:54
Summary:
The current example code does not work. The correct one is like this: https://github.com/pytorch/pytorch/blob/cb7d813275a13a4233951e7cbcbb8351dbb0fd87/torch/distributed/run.py#L266

Pull Request resolved: pytorch#61127

Reviewed By: cbalioglu

Differential Revision: D29572003

Pulled By: mrshenli

fbshipit-source-id: 05b470230f3d70f8a6164edb5f92894a1112069f
Summary:
Pull Request resolved: pytorch#59152

Small change for https://fb.workplace.com/groups/319878845696681

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D28773682

Pulled By: H-Huang

fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c
….distributed.run` (pytorch#61294)

Summary:
Pull Request resolved: pytorch#61294

Pull Request resolved: pytorch#60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

pytorch#60716
pytorch#60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 10, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8c1be47 (more details on the Dr. CI page):


  • 5/5 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build Test tools / test (1/3)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

2021-09-10T03:05:58.3976892Z AssertionError: '?...\nSuccess: no issues found in 1343 source files\n"
2021-09-10T03:05:58.3956145Z test_translate_empty (test_translate_annotations.TestTranslateAnnotations) ... ok
2021-09-10T03:05:58.3957527Z test_translate_lao_tzu (test_translate_annotations.TestTranslateAnnotations) ... ok
2021-09-10T03:05:58.3960157Z test_translate_sparser (test_translate_annotations.TestTranslateAnnotations) ... ok
2021-09-10T03:05:58.3960964Z 
2021-09-10T03:05:58.3961511Z ======================================================================
2021-09-10T03:05:58.3962112Z FAIL: test_lint (test_actions_local_runner.TestEndToEnd)
2021-09-10T03:05:58.3963531Z ----------------------------------------------------------------------
2021-09-10T03:05:58.3964174Z Traceback (most recent call last):
2021-09-10T03:05:58.3966152Z   File "/home/runner/work/pytorch/pytorch/tools/test/test_actions_local_runner.py", line 86, in test_lint
2021-09-10T03:05:58.3967811Z     self.assertIn(line, stdout)
2021-09-10T03:05:58.3976892Z AssertionError: '✓ flake8' not found in "✓ mypy: Run autogen\nx flake8\n./torch/distributed/run.py:323:1: F811 redefinition of unused 'record' from line 318\n✓ quick-checks: Extract scripts from GitHub Actions workflows\n✓ quick-checks: Ensure canonical include\n✓ quick-checks: Ensure no unqualified noqa\n✓ quick-checks: Ensure no direct cub include\n✓ quick-checks: Ensure no unqualified type ignore\n✓ quick-checks: Ensure no tabs\n✓ quick-checks: Ensure no non-breaking spaces\n✓ quick-checks: Ensure no versionless Python shebangs\n✓ quick-checks: Ensure correct trailing newlines\n✓ quick-checks: Ensure no trailing spaces\n✓ quick-checks: Run ShellCheck\n✓ cmakelint: Run cmakelint\n./.github/scripts/generate_linux_ci_workflows.py\n/home/runner/work/pytorch/pytorch/.github/workflows/pytorch-linux-xenial-py3.6-gcc5.4.yml\n/home/runner/work/pytorch/pytorch/.github/workflows/pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7.yml\nmake shellcheck-gha\nmake[1]: Entering directory '/home/runner/work/pytorch/pytorch'\ntools/extract_scripts.py --out=.shellcheck_generated_gha\ntools/run_shellcheck.sh .shellcheck_generated_gha\nmake[1]: Leaving directory '/home/runner/work/pytorch/pytorch'\n✓ mypy: Run mypy\nSuccess: no issues found in 80 source files\nSuccess: no issues found in 1343 source files\n"
2021-09-10T03:05:58.3986197Z 
2021-09-10T03:05:58.3987090Z ----------------------------------------------------------------------
2021-09-10T03:05:58.3987660Z Ran 42 tests in 298.026s
2021-09-10T03:05:58.3988126Z 
2021-09-10T03:05:58.3988498Z FAILED (failures=1)
2021-09-10T03:05:58.4543648Z ##[error]Process completed with exit code 1.
2021-09-10T03:05:58.4656386Z Post job cleanup.
2021-09-10T03:05:58.6191254Z [command]/usr/bin/git version
2021-09-10T03:05:58.6192573Z git version 2.33.0
2021-09-10T03:05:58.6198713Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand

See GitHub Actions build Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / build (2/3)

Step: "Build PyTorch" (full log | diagnosis details | 🔁 rerun)

2021-09-10T03:08:50.0735231Z Build left local git repository checkout dirty
2021-09-10T03:08:49.7402046Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *rocm* ]]
2021-09-10T03:08:49.7402965Z + [[ pytorch-linux-xenial-py3.6-gcc5.4 != *xla* ]]
2021-09-10T03:08:49.7403648Z ++ git status --porcelain
2021-09-10T03:08:50.0731160Z + git_status='?? third_party/breakpad/
2021-09-10T03:08:50.0731798Z ?? third_party/cudnn_frontend/
2021-09-10T03:08:50.0732393Z ?? third_party/pocketfft/'
2021-09-10T03:08:50.0732876Z + [[ -n ?? third_party/breakpad/
2021-09-10T03:08:50.0733286Z ?? third_party/cudnn_frontend/
2021-09-10T03:08:50.0733679Z ?? third_party/pocketfft/ ]]
2021-09-10T03:08:50.0734672Z + echo 'Build left local git repository checkout dirty'
2021-09-10T03:08:50.0735231Z Build left local git repository checkout dirty
2021-09-10T03:08:50.0735811Z + echo 'git status --porcelain:'
2021-09-10T03:08:50.0736305Z git status --porcelain:
2021-09-10T03:08:50.0736786Z + echo '?? third_party/breakpad/
2021-09-10T03:08:50.0737202Z ?? third_party/cudnn_frontend/
2021-09-10T03:08:50.0737679Z ?? third_party/pocketfft/'
2021-09-10T03:08:50.0738072Z ?? third_party/breakpad/
2021-09-10T03:08:50.0738469Z ?? third_party/cudnn_frontend/
2021-09-10T03:08:50.0738859Z ?? third_party/pocketfft/
2021-09-10T03:08:50.0739203Z + exit 1
2021-09-10T03:08:50.0739479Z + cleanup

See CircleCI build pytorch_macos_10_13_py3_test (3/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 10 03:46:42 test_udf_remote_message_delay...yUniqueId(created_on=0, local_id=0) to be created.
Sep 10 03:46:01 frame #14: c10::ThreadPool::main_loop(unsigned long) + 569 (0x108c8f259 in libc10.dylib)
Sep 10 03:46:01 frame #15: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x108c8f903 in libc10.dylib)
Sep 10 03:46:01 frame #16: _pthread_start + 148 (0x7fff71b73109 in libsystem_pthread.dylib)
Sep 10 03:46:01 frame #17: thread_start + 15 (0x7fff71b6eb8b in libsystem_pthread.dylib)
Sep 10 03:46:01 
Sep 10 03:46:01 ok (3.973s)
Sep 10 03:46:17   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.949s)
Sep 10 03:46:27   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.831s)
Sep 10 03:46:31   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (4.088s)
Sep 10 03:46:39   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.028s)
Sep 10 03:46:42   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:667] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Sep 10 03:46:42 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Sep 10 03:46:42 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x109a59542 in libc10.dylib)
Sep 10 03:46:42 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x109a57d1a in libc10.dylib)
Sep 10 03:46:42 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x109a57f50 in libc10.dylib)
Sep 10 03:46:42 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1539 (0x11673b5d3 in libtorch_cpu.dylib)
Sep 10 03:46:42 frame #4: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::function<void (torch::distributed::rpc::Message)> const&, long long, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > const&, std::__1::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 151 (0x108e227f7 in libtorch_python.dylib)
Sep 10 03:46:42 frame #5: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long long, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > const&, std::__1::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 814 (0x11672bece in libtorch_cpu.dylib)
Sep 10 03:46:42 frame #6: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long long, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > const&, std::__1::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const + 64 (0x108e246b0 in libtorch_python.dylib)
Sep 10 03:46:42 frame #7: std::__1::__function::__func<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const::$_1, std::__1::allocator<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::shared_ptr<torch::distributed::rpc::LazyStreamContext>) const::$_1>, void (c10::ivalue::Future&)>::operator()(c10::ivalue::Future&) + 286 (0x116730dfe in libtorch_cpu.dylib)
Sep 10 03:46:42 frame #8: c10::ivalue::Future::invokeCallback(std::__1::function<void (c10::ivalue::Future&)>) + 697 (0x1132bc659 in libtorch_cpu.dylib)

2 failures not recognized by patterns:

Job Step Action
GitHub Actions Linux CI (pytorch-linux-xenial-py3.6-gcc5.4) / render_test_results Download PyTorch Test Reports 🔁 rerun
GitHub Actions Lint / flake8-py3 Fail if there were any warnings 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@malfet malfet changed the title [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) [Release-1.9.1] [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925) Sep 10, 2021
@malfet malfet merged commit 61e9e88 into pytorch:release/1.9 Sep 10, 2021
@malfet malfet deleted the malfet/cp-61294 branch September 10, 2021 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants