-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ofi: Improve provider filtering / matching #776
ofi: Improve provider filtering / matching #776
Conversation
@sunkuamzn and @shijin-aws do you mind looking at |
575e91c
to
ff429bb
Compare
The AWS Libfabric team pointed out that we were being way too loose with our provider matching when looking for rails, in a way that could result in us finding multiple provider entries for the same underlying NIC. This patch's primary goal is to improve the filtering logic for providers by doing a deeper match, as suggested by the team. This required writing yet another loop to pull entries out of a provider list, so at the same time, I refactored that code to have a single function that did the list management bits and simpler filtering functions for deciding what goes and what stays. This is marginally more readable, but should make adding additional provider list manipulations much easier. Signed-off-by: Brian Barrett <[email protected]>
ff429bb
to
4a4011a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fabric_attr->name
and mode
will catch the differences between efa
and efa-direct
See also: ofiwg/libfabric#10757 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also: ofiwg/libfabric#10757
Yeah, that needs to be fixed in libfabric for sure. That shouldn't affect our filtering/rail counting though. You think it will?
We're not currently using FI_PROV_ATTR_ONLY, because we don't actually want to look at as "fastest NIC", since we're so constrained in what features we support. So I don't think that bug being fixed will change anything. And, obviously, our current thing works. |
It would be great if the logic for making a decision during init was to start by sorting a list of available providers by our preferences, and that ought to start with first forming a list of providers compiled into libfabric. But if that's broken, we can't really do that. |
The AWS Libfabric team pointed out that we were being way too loose with our provider matching when looking for rails, in a way that could result in us finding multiple provider entries for the same underlying NIC. This patch's primary goal is to improve the filtering logic for providers by doing a deeper match, as suggested by the team.
This required writing yet another loop to pull entries out of a provider list, so at the same time, I refactored that code to have a single function that did the list management bits and simpler filtering functions for deciding what goes and what stays. This is marginally more readable, but should make adding additional provider list manipulations much easier.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.