Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address issues with existing determinism guarantees #886

Open
alliepiper opened this issue May 6, 2022 · 9 comments
Open

Address issues with existing determinism guarantees #886

alliepiper opened this issue May 6, 2022 · 9 comments
Labels
2024-2025 goal cub For all items related to CUB

Comments

@alliepiper
Copy link
Collaborator

alliepiper commented May 6, 2022

Overview

Some cub::Device* algorithms are/were documented to be run-to-run deterministic, but the implementations no longer fulfill that guarantee. This has been a major pain point for several users who were relying on this behavior. Their code does not behave as expected, and they have no way of fixing this without writing their own kernels.

An additional facet of this issue is that the existing non-deterministic algorithms are significantly faster than the former deterministic implementations, and the regression was accidentally introduced years ago specifically to provide better performance. Fixing this involves a trade-off between fixing functionally-regressed code and introducing performance regressions.

Both deterministic and non-deterministic implementations must continue to be available through public API for both sets of users.

Affected Algorithms

The following algorithms are known to violate current or former guarantees:

  • DeviceReduce::ReduceByKey is currently documented to provide run-to-run determinism, but doesn't.
  • DeviceScan::* used to provide the same broken guarantee, and the documentation was temporarily updated pending proper resolution of this issue.

These algorithms also provide such guarantees:

  • DeviceReduce::Reduce (and derived Sum, Min, Max, etc)
  • DeviceSegmentedReduce::Reduce

Requirements

  • Add explicit determinism tests for all affected algorithms that do/did provide determinism guarantees.
  • Update any algorithms that fail the new tests so that the deterministic behavior is restored.
  • Expose both the existing non-deterministic and the updated deterministic implementations via public API.
  • Document variants clearly.
  • Run new determinism tests in CI to guard against future regressions.

Open Questions

How to handle naming?

How should we spell the deterministic and non-deterministic APIs? The choice is between backward compatibility or a consistent, sensible API. I'm interested in hearing user feedback on which solution would be preferable:

Option 1: Prioritize backwards compatibility.

All existing algorithm names that promise(d) determinism have determinism restored and preserved.
Non-deterministic implementations of these algorithms will be introduced under a new name, probably prefixed/suffixed (suggestions welcome).
New deterministic APIs of existing non-deterministic algorithms will be introduced under a new name with a different prefix/suffix.

Pros:

  • Fixes functional regressions.

Cons:

  • Introduces performance regressions.
  • Inconsistent naming: The "basic" algorithm spelling will be deterministic for some algorithms but non-deterministic for others. Some algorithms will be suffixed when deterministic, other will be suffixed when non-deterministic.

Option 2: Prioritize consistency.

Algorithm implementations with no special guarantees are exposed through the basic spelling of the algorithm (e.g. Reduce).

Algorithms with determinism guarantees are prefixed (e.g. DeviceDeterministicReduce for run-to-run determinism on the same device, DeterministicReduce for consistently deterministic in all cases).

Pros:

  • Naming is consistent.
  • Matches how special requirements and guarantees are handled in other APIs (e.g. DeviceMergeSort has SortKeys, StableSortKeys, and SortKeysCopy).

Cons:

  • Introduces functional regressions as determinism guarantees on existing APIs are broken.

Scope

This issue is limited to testing and fixing pre-existing determinism guarantees in the explicitly listed algorithms. Requests for new or stronger determinism guarantees should go into new issues.

@dkolsen-pgi
Copy link
Collaborator

dkolsen-pgi commented May 7, 2022

I prefer option 2. The people who were depending on the determinism are already upset, and I am guessing that they will be (mostly) satisfied with option 2 (because CUB developers are unlikely to make the same mistake again when the function has "Deterministic" in its name). If option 1 is chosen, then a whole new group of users (who care about performance more than determinism) will become upset. Option 2 will result in a happier community (on average, if not everyone individually) and is a better long-term solution.

@maddyscientist
Copy link

I also prefer option 2, and completely agree with the points raised above by @dkolsen-pgi.

@leofang
Copy link
Member

leofang commented May 7, 2022

cc: @kmaehashi (CuPy switched to use CUB by default despite the reproducibility/determinism concern, cupy/cupy#4897)

@ekelsen
Copy link

ekelsen commented May 9, 2022

Glad to see this happening! Although not directly affected anymore, my preference would be Approach 2.

@duncanriach
Copy link

I also prefer option 2.

@lilohuang
Copy link

Option 2 :) thanks.

@bhack
Copy link

bhack commented Jul 10, 2022

Is HistogramEven currently deterministic or not?

@alliepiper
Copy link
Collaborator Author

Is HistogramEven currently deterministic or not?

Yes, assuming that your histogram output is an integral type that is either unsigned or won't overflow at runtime.

@gevtushenko
Copy link
Collaborator

Update on this issue. We incline towards having the option 2 with some changes. Since there are many forms of determinism (run-to-run, GPU-to-GPU, etc), having everything embedded into the name would explode the number of functions. Instead of encoding requirements in the name, we need an API that would allow users to provide functional requirements as a parameter.

Apart from determinism guarantees, we'd like to co-design this new API with the notion of operator complexity. Users provide quite complex operators to CUB. @elstehle provided an example of such operator:

constexpr int element_count = 5;
using element_t = int;
using item_t = cuda::std::array<element_t, element_count>;
constexpr item_t identity{0, 1, 2, 3, 4};

struct op_t
{
    __host__ __device__ __forceinline__
    item_t operator()(item_t lhs, item_t rhs)
    {
        item_t result{};
        for(int i=0; i < element_count; i++)
        {
            result[i] = rhs[lhs[i]];
        }
        return result;
    }
};

Having something along the "compute-optimal" vs "memory-optimal" requirement in the set of functional requirements alongside determinism requirements seems reasonable. Compute-optimal solution would not be allowed to perform extra invocations of the binary operator.

Since we have to implement a new deterministic scan algorithm anyways, we might experiment with implementing an algorithm that is both deterministic and work optimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-2025 goal cub For all items related to CUB
Projects
Status: Todo
Development

No branches or pull requests