Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix][Disco] Handle NDArray larger than OS buffer for pipe #16992

Conversation

Lunderberg
Copy link
Contributor

Prior to this commit, using disco.Session methods to transfer NDArray instances to workers could raise an exception if the NDArray is larger than the buffer allocated by the OS for the controller/worker pipe. In these case, the first call to the Read method of tvm::support::Pipe would successfully return, but only with the initial bytes of the NDArray. Receiving the full NDArray requires repeatedly calling the POSIX read function.

This commit updates the Read and Write methods of tvm::support::Pipe to repeatedly call the underlying read/write methods, until the full NDArray has been transferred.

This commit does not add any unit tests, as the existing unit test tests/python/disco/test_ccl.py::test_attention[nccl-ProcessSession] requires this change to pass.

Prior to this commit, using `disco.Session` methods to transfer
`NDArray` instances to workers could raise an exception if the
`NDArray` is larger than the buffer allocated by the OS for the
controller/worker pipe.  In these case, the first call to the `Read`
method of `tvm::support::Pipe` would successfully return, but only
with the initial bytes of the `NDArray`.  Receiving the full `NDArray`
requires repeatedly calling the POSIX `read` function.

This commit updates the `Read` and `Write` methods of
`tvm::support::Pipe` to repeatedly call the underlying read/write
methods, until the full `NDArray` has been transferred.

This commit does not add any unit tests, as the existing unit test
`tests/python/disco/test_ccl.py::test_attention[nccl-ProcessSession]`
requires this change to pass.
@Lunderberg Lunderberg merged commit d9dbbc9 into apache:main May 14, 2024
19 checks passed
@Lunderberg Lunderberg deleted the bugfix_disco_transfer_larger_than_pipe_buffer branch May 14, 2024 14:39
@tqchen
Copy link
Member

tqchen commented May 14, 2024

Thanks @Lunderberg !
In this case, i think it is better to introduce a ReadAll and WriteAll function, in pairing with https://github.com/apache/tvm/blob/main/src/support/socket.h#L492, and then we call these functions instead

Just to keep Socket and pipe behavior consistent with each other, low-level read/write can remain partial while readall and writeall contains convenient method for the intended behavior

@Lunderberg
Copy link
Contributor Author

I agree. This is a stop-gap measure, as the longer-term fix will require updating the dmlc::Stream API to return size_t from Write, similar to Read. Without that API change, the calling scope cannot correctly implement a WriteAll method.

Lunderberg added a commit to Lunderberg/tvm that referenced this pull request May 14, 2024
This resolves a conflict between two recent changes.  In
apache#16989, reads of size zero are used
to identify hangups in `ProcessSession`.  In
apache#16992, reads of size zero are
treated as an error to avoid infinite loops while waiting for data to
be ready.

For a long-term resolution, the `dmlc::Stream` interface will need to
be updated, so that the `Write` method returns the number of bytes
written, just as the `Read` method currently does.  This will allow
the calling scope to verify the number of bytes received.
@Lunderberg
Copy link
Contributor Author

@tqchen The changes for the long-term fix are implemented in dmlc/dmlc-core#686, with TVM compatibility changes implemented in #16998. Once those changes land, we will be able to remove the stop-gap implementation, and provide ReadAll and WriteAll helper methods.

@tqchen
Copy link
Member

tqchen commented May 14, 2024

Ah I see, the Stream interface was actually created to be in analogy of FileStream interface(which do not have partial writes). After seeing your fix, i realized that this PR actually did the right thing given pipe inheritated from the stream interface

@tqchen
Copy link
Member

tqchen commented May 14, 2024

The socket interface was mainly meant to accomodate for the non-blocking case where full read is not always possible

@Lunderberg
Copy link
Contributor Author

Ah I see, the Stream interface was actually created to be in analogy of FileStream interface(which do not have partial writes).

Thank you, and that history makes a lot of sense. I'm thinking it would still be a good change (though low priority) to expose partial writes in dmlc::Stream interface, because the Stream interface is also used for pipes and TCP sockets.

The socket interface was mainly meant to accomodate for the non-blocking case where full read is not always possible

Good point. Looking into it, it looks like the Socket interface also includes functionality that isn't appropriate for a pipe, such as the address/port to listen on, accepting new connections, etc.

Lunderberg added a commit that referenced this pull request May 15, 2024
This resolves a conflict between two recent changes.  In
#16989, reads of size zero are used
to identify hangups in `ProcessSession`.  In
#16992, reads of size zero are
treated as an error to avoid infinite loops while waiting for data to
be ready.

For a long-term resolution, the `dmlc::Stream` interface will need to
be updated, so that the `Write` method returns the number of bytes
written, just as the `Read` method currently does.  This will allow
the calling scope to verify the number of bytes received.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants