Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ofi: Improve provider filtering / matching #776

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

bwbarrett
Copy link
Contributor

The AWS Libfabric team pointed out that we were being way too loose with our provider matching when looking for rails, in a way that could result in us finding multiple provider entries for the same underlying NIC. This patch's primary goal is to improve the filtering logic for providers by doing a deeper match, as suggested by the team.

This required writing yet another loop to pull entries out of a provider list, so at the same time, I refactored that code to have a single function that did the list management bits and simpler filtering functions for deciding what goes and what stays. This is marginally more readable, but should make adding additional provider list manipulations much easier.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwbarrett bwbarrett requested a review from a team as a code owner January 30, 2025 00:30
@bwbarrett
Copy link
Contributor Author

@sunkuamzn and @shijin-aws do you mind looking at provi_filter_by_match() and make sure that does what you expect?

@bwbarrett bwbarrett force-pushed the bugfix/better-fabric-comparisons branch from 575e91c to ff429bb Compare January 30, 2025 00:33
The AWS Libfabric team pointed out that we were being way too loose
with our provider matching when looking for rails, in a way that could
result in us finding multiple provider entries for the same underlying
NIC.  This patch's primary goal is to improve the filtering logic for
providers by doing a deeper match, as suggested by the team.

This required writing yet another loop to pull entries out of a
provider list, so at the same time, I refactored that code to
have a single function that did the list management bits and simpler
filtering functions for deciding what goes and what stays.  This is
marginally more readable, but should make adding additional provider
list manipulations much easier.

Signed-off-by: Brian Barrett <[email protected]>
@bwbarrett bwbarrett force-pushed the bugfix/better-fabric-comparisons branch from ff429bb to 4a4011a Compare January 30, 2025 05:00
Copy link
Contributor

@sunkuamzn sunkuamzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fabric_attr->name and mode will catch the differences between efa and efa-direct

@aws-nslick
Copy link
Contributor

See also: ofiwg/libfabric#10757

Copy link
Member

@rajachan rajachan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also: ofiwg/libfabric#10757

Yeah, that needs to be fixed in libfabric for sure. That shouldn't affect our filtering/rail counting though. You think it will?

@bwbarrett
Copy link
Contributor Author

We're not currently using FI_PROV_ATTR_ONLY, because we don't actually want to look at as "fastest NIC", since we're so constrained in what features we support. So I don't think that bug being fixed will change anything. And, obviously, our current thing works.

@bwbarrett bwbarrett merged commit 5f451d2 into aws:master Feb 5, 2025
23 checks passed
@bwbarrett bwbarrett deleted the bugfix/better-fabric-comparisons branch February 5, 2025 20:32
@aws-nslick
Copy link
Contributor

aws-nslick commented Feb 5, 2025

We're not currently using FI_PROV_ATTR_ONLY, because we don't actually want to look at as "fastest NIC", since we're so constrained in what features we support. So I don't think that bug being fixed will change anything. And, obviously, our current thing works.

It would be great if the logic for making a decision during init was to start by sorting a list of available providers by our preferences, and that ought to start with first forming a list of providers compiled into libfabric. But if that's broken, we can't really do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants