You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
process_group_test is the test suite responsible for testing collectives.
Currently it supports basic correctness tests: 1. running the collective with a single process group to ensure that it passes (and that tensors are sane) and 2. running the collective with 2 process groups to ensure that it succeeds and that numerics are correct.
Split up sequential collective tests into individual tests, and
Test for actual fault tolerance
These are valuable contributions and will require some restructuring / refactoring of the tests.
Collectives as individual tests
An idea for this is to i.e. set an explicit list of the supported collectives and parameterize based on the backend. This is explicit, but one issue with the current approach is that process groups are expensive to spin up and teardown. In #103, this doubled the execution time of the test for a limited number of collectives,
Building on the above approach, this could be mitigated by creating process groups in a setupClass method and running all tests on those process groups.
TestDistBackend and MultiProcessTestCase in PT-D may provide some pointers for doing something like this.
There is a test class called MultiProcContinuousTest defined in the same test utils file as MultiProcesTestCase that shares a PG across test instances. It requires having main defined differently for that test file and isn't compatible with hahving MultiProcesTestCase instnaces inside the same file currently, but it is in use in a number of pt-d tests bc it saves a lot of time
Fault Tolerance
Collectives currently only test for "wrapper correctness", but adding in correctness for fault tolerant behaviors would provide more confidence.
One idea to test the actual fault tolerance: model it such that i.e. a sender fails, sender succeeds but receiver fails, and verify that an appropriate exception / timeout is returned back to the user process. This is then retried with success after the PG gets reconfigured.
The text was updated successfully, but these errors were encountered:
@allenwang28 I'm pretty open to any improvements in this area. MultiProcessTestCase does have some caveats that I don't love and since for things like ProcessGroupBaby we're using a subprocess already we probably don't need to launch each test worker in a subprocess. Maybe we can pull in the MultiThreadTestCase variant and tweak it for our usecase
There's some weird behavior with the patched unit test classes which I really don't love (i.e. you need to use the custom skip functions that set exit codes rather than the built in ones)
I do like the idea of being able to reuse the subprocess in setupClass -- it may still be best to launch multiple threads in each test but we could pretty easily wrap that in a helper method that takes a function and manages the per thread PGs + set devices via torch.cuda.set_device
Description
process_group_test
is the test suite responsible for testing collectives.Currently it supports basic correctness tests: 1. running the collective with a single process group to ensure that it passes (and that tensors are sane) and 2. running the collective with 2 process groups to ensure that it succeeds and that numerics are correct.
As mentioned in #108, @rohan-varma had two great suggestions:
These are valuable contributions and will require some restructuring / refactoring of the tests.
Collectives as individual tests
An idea for this is to i.e. set an explicit list of the supported collectives and parameterize based on the backend. This is explicit, but one issue with the current approach is that process groups are expensive to spin up and teardown. In #103, this doubled the execution time of the test for a limited number of collectives,
Building on the above approach, this could be mitigated by creating process groups in a
setupClass
method and running all tests on those process groups.TestDistBackend and MultiProcessTestCase in PT-D may provide some pointers for doing something like this.
From @wconstab:
Fault Tolerance
Collectives currently only test for "wrapper correctness", but adding in correctness for fault tolerant behaviors would provide more confidence.
One idea to test the actual fault tolerance: model it such that i.e. a sender fails, sender succeeds but receiver fails, and verify that an appropriate exception / timeout is returned back to the user process. This is then retried with success after the PG gets reconfigured.
The text was updated successfully, but these errors were encountered: