process_group_test - Enhance fault tolerance collective tests #109

allenwang28 · 2025-02-12T21:44:18Z

Description

process_group_test is the test suite responsible for testing collectives.

Currently it supports basic correctness tests: 1. running the collective with a single process group to ensure that it passes (and that tensors are sane) and 2. running the collective with 2 process groups to ensure that it succeeds and that numerics are correct.

As mentioned in #108, @rohan-varma had two great suggestions:

Split up sequential collective tests into individual tests, and
Test for actual fault tolerance

These are valuable contributions and will require some restructuring / refactoring of the tests.

Collectives as individual tests

An idea for this is to i.e. set an explicit list of the supported collectives and parameterize based on the backend. This is explicit, but one issue with the current approach is that process groups are expensive to spin up and teardown. In #103, this doubled the execution time of the test for a limited number of collectives,

Building on the above approach, this could be mitigated by creating process groups in a setupClass method and running all tests on those process groups.

TestDistBackend and MultiProcessTestCase in PT-D may provide some pointers for doing something like this.

From @wconstab:

There is a test class called MultiProcContinuousTest defined in the same test utils file as MultiProcesTestCase that shares a PG across test instances. It requires having main defined differently for that test file and isn't compatible with hahving MultiProcesTestCase instnaces inside the same file currently, but it is in use in a number of pt-d tests bc it saves a lot of time

Fault Tolerance

Collectives currently only test for "wrapper correctness", but adding in correctness for fault tolerant behaviors would provide more confidence.

One idea to test the actual fault tolerance: model it such that i.e. a sender fails, sender succeeds but receiver fails, and verify that an appropriate exception / timeout is returned back to the user process. This is then retried with success after the PG gets reconfigured.

The text was updated successfully, but these errors were encountered:

d4l3k · 2025-02-18T16:58:22Z

@allenwang28 I'm pretty open to any improvements in this area. MultiProcessTestCase does have some caveats that I don't love and since for things like ProcessGroupBaby we're using a subprocess already we probably don't need to launch each test worker in a subprocess. Maybe we can pull in the MultiThreadTestCase variant and tweak it for our usecase

There's some weird behavior with the patched unit test classes which I really don't love (i.e. you need to use the custom skip functions that set exit codes rather than the built in ones)

I do like the idea of being able to reuse the subprocess in setupClass -- it may still be best to launch multiple threads in each test but we could pretty easily wrap that in a helper method that takes a function and manages the per thread PGs + set devices via torch.cuda.set_device

d4l3k · 2025-02-18T16:58:51Z

These improvements would also be helpful for the checkpointing transport tests. PGTransport in particular

allenwang28 added the enhancement New feature or request label Feb 12, 2025

allenwang28 mentioned this issue Feb 12, 2025

Adds more collectives to ProcessGroups #108

Merged

allenwang28 mentioned this issue Feb 20, 2025

Enhances process_group_test #113

Merged

allenwang28 linked a pull request Feb 21, 2025 that will close this issue

Enhances process_group_test #113

Merged

allenwang28 closed this as completed in #113 Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process_group_test - Enhance fault tolerance collective tests #109

process_group_test - Enhance fault tolerance collective tests #109

allenwang28 commented Feb 12, 2025 •

edited

Loading

d4l3k commented Feb 18, 2025

d4l3k commented Feb 18, 2025

process_group_test - Enhance fault tolerance collective tests #109

process_group_test - Enhance fault tolerance collective tests #109

Comments

allenwang28 commented Feb 12, 2025 • edited Loading

Description

Collectives as individual tests

Fault Tolerance

d4l3k commented Feb 18, 2025

d4l3k commented Feb 18, 2025

allenwang28 commented Feb 12, 2025 •

edited

Loading