Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_group_test - Enhance fault tolerance collective tests #109

Closed
allenwang28 opened this issue Feb 12, 2025 · 2 comments · Fixed by #113
Closed

process_group_test - Enhance fault tolerance collective tests #109

allenwang28 opened this issue Feb 12, 2025 · 2 comments · Fixed by #113
Labels
enhancement New feature or request

Comments

@allenwang28
Copy link
Contributor

allenwang28 commented Feb 12, 2025

Description

process_group_test is the test suite responsible for testing collectives.

Currently it supports basic correctness tests: 1. running the collective with a single process group to ensure that it passes (and that tensors are sane) and 2. running the collective with 2 process groups to ensure that it succeeds and that numerics are correct.

As mentioned in #108, @rohan-varma had two great suggestions:

  1. Split up sequential collective tests into individual tests, and
  2. Test for actual fault tolerance

These are valuable contributions and will require some restructuring / refactoring of the tests.

Collectives as individual tests

An idea for this is to i.e. set an explicit list of the supported collectives and parameterize based on the backend. This is explicit, but one issue with the current approach is that process groups are expensive to spin up and teardown. In #103, this doubled the execution time of the test for a limited number of collectives,

Building on the above approach, this could be mitigated by creating process groups in a setupClass method and running all tests on those process groups.

TestDistBackend and MultiProcessTestCase in PT-D may provide some pointers for doing something like this.

From @wconstab:

There is a test class called MultiProcContinuousTest defined in the same test utils file as MultiProcesTestCase that shares a PG across test instances. It requires having main defined differently for that test file and isn't compatible with hahving MultiProcesTestCase instnaces inside the same file currently, but it is in use in a number of pt-d tests bc it saves a lot of time

Fault Tolerance

Collectives currently only test for "wrapper correctness", but adding in correctness for fault tolerant behaviors would provide more confidence.

One idea to test the actual fault tolerance: model it such that i.e. a sender fails, sender succeeds but receiver fails, and verify that an appropriate exception / timeout is returned back to the user process. This is then retried with success after the PG gets reconfigured.

@d4l3k
Copy link
Member

d4l3k commented Feb 18, 2025

@allenwang28 I'm pretty open to any improvements in this area. MultiProcessTestCase does have some caveats that I don't love and since for things like ProcessGroupBaby we're using a subprocess already we probably don't need to launch each test worker in a subprocess. Maybe we can pull in the MultiThreadTestCase variant and tweak it for our usecase

There's some weird behavior with the patched unit test classes which I really don't love (i.e. you need to use the custom skip functions that set exit codes rather than the built in ones)

I do like the idea of being able to reuse the subprocess in setupClass -- it may still be best to launch multiple threads in each test but we could pretty easily wrap that in a helper method that takes a function and manages the per thread PGs + set devices via torch.cuda.set_device

@d4l3k
Copy link
Member

d4l3k commented Feb 18, 2025

These improvements would also be helpful for the checkpointing transport tests. PGTransport in particular

@allenwang28 allenwang28 linked a pull request Feb 21, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants