You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have done several tests to try and repro this, especially around the keepalive configuration on the host and for UCX.
At this stage we are getting 0 failures, but the system has had a reboot, and we have a version of UCX that @evgeny-leksikov had prepared. Our next step will be to move to UCX 1.15 as released, we'll update here if anything changes. Unfortunately, none of the investigation we have done has yielded a root cause.
Our CI detected an issue that I didn't see while manually testing with UCX 1.14.0: NVIDIA/spark-rapids#7940
Essentially we are loosing endpoints and the only error we get in our listener is that there was a timeout.
This started to happen after we upgraded to UCX 1.14.0. The version we were using before was 1.12.1.
Any pointers on what may have changed related to different timeout (keepalive?) error handling would be great.
The text was updated successfully, but these errors were encountered: