Address issues with existing determinism guarantees #886

alliepiper · 2022-05-06T21:56:55Z

Overview

Some cub::Device* algorithms are/were documented to be run-to-run deterministic, but the implementations no longer fulfill that guarantee. This has been a major pain point for several users who were relying on this behavior. Their code does not behave as expected, and they have no way of fixing this without writing their own kernels.

An additional facet of this issue is that the existing non-deterministic algorithms are significantly faster than the former deterministic implementations, and the regression was accidentally introduced years ago specifically to provide better performance. Fixing this involves a trade-off between fixing functionally-regressed code and introducing performance regressions.

Both deterministic and non-deterministic implementations must continue to be available through public API for both sets of users.

Affected Algorithms

The following algorithms are known to violate current or former guarantees:

DeviceReduce::ReduceByKey is currently documented to provide run-to-run determinism, but doesn't.
DeviceScan::* used to provide the same broken guarantee, and the documentation was temporarily updated pending proper resolution of this issue.

These algorithms also provide such guarantees:

DeviceReduce::Reduce (and derived Sum, Min, Max, etc)
DeviceSegmentedReduce::Reduce

Requirements

Add explicit determinism tests for all affected algorithms that do/did provide determinism guarantees.
Update any algorithms that fail the new tests so that the deterministic behavior is restored.
Expose both the existing non-deterministic and the updated deterministic implementations via public API.
Document variants clearly.
Run new determinism tests in CI to guard against future regressions.

Open Questions

How to handle naming?

How should we spell the deterministic and non-deterministic APIs? The choice is between backward compatibility or a consistent, sensible API. I'm interested in hearing user feedback on which solution would be preferable:

Option 1: Prioritize backwards compatibility.

All existing algorithm names that promise(d) determinism have determinism restored and preserved.
Non-deterministic implementations of these algorithms will be introduced under a new name, probably prefixed/suffixed (suggestions welcome).
New deterministic APIs of existing non-deterministic algorithms will be introduced under a new name with a different prefix/suffix.

Pros:

Fixes functional regressions.

Cons:

Introduces performance regressions.
Inconsistent naming: The "basic" algorithm spelling will be deterministic for some algorithms but non-deterministic for others. Some algorithms will be suffixed when deterministic, other will be suffixed when non-deterministic.

Option 2: Prioritize consistency.

Algorithm implementations with no special guarantees are exposed through the basic spelling of the algorithm (e.g. Reduce).

Algorithms with determinism guarantees are prefixed (e.g. DeviceDeterministicReduce for run-to-run determinism on the same device, DeterministicReduce for consistently deterministic in all cases).

Pros:

Naming is consistent.
Matches how special requirements and guarantees are handled in other APIs (e.g. DeviceMergeSort has SortKeys, StableSortKeys, and SortKeysCopy).

Cons:

Introduces functional regressions as determinism guarantees on existing APIs are broken.

Scope

This issue is limited to testing and fixing pre-existing determinism guarantees in the explicitly listed algorithms. Requests for new or stronger determinism guarantees should go into new issues.

The text was updated successfully, but these errors were encountered:

dkolsen-pgi · 2022-05-07T19:56:43Z

I prefer option 2. The people who were depending on the determinism are already upset, and I am guessing that they will be (mostly) satisfied with option 2 (because CUB developers are unlikely to make the same mistake again when the function has "Deterministic" in its name). If option 1 is chosen, then a whole new group of users (who care about performance more than determinism) will become upset. Option 2 will result in a happier community (on average, if not everyone individually) and is a better long-term solution.

maddyscientist · 2022-05-07T20:59:44Z

I also prefer option 2, and completely agree with the points raised above by @dkolsen-pgi.

leofang · 2022-05-07T23:04:58Z

cc: @kmaehashi (CuPy switched to use CUB by default despite the reproducibility/determinism concern, cupy/cupy#4897)

ekelsen · 2022-05-09T15:57:42Z

Glad to see this happening! Although not directly affected anymore, my preference would be Approach 2.

duncanriach · 2022-05-09T23:36:20Z

I also prefer option 2.

lilohuang · 2022-05-16T07:39:56Z

Option 2 :) thanks.

bhack · 2022-07-10T00:30:22Z

Is HistogramEven currently deterministic or not?

alliepiper · 2022-07-25T21:58:15Z

Is HistogramEven currently deterministic or not?

Yes, assuming that your histogram output is an integral type that is either unsigned or won't overflow at runtime.

gevtushenko · 2023-10-05T16:20:54Z

Update on this issue. We incline towards having the option 2 with some changes. Since there are many forms of determinism (run-to-run, GPU-to-GPU, etc), having everything embedded into the name would explode the number of functions. Instead of encoding requirements in the name, we need an API that would allow users to provide functional requirements as a parameter.

Apart from determinism guarantees, we'd like to co-design this new API with the notion of operator complexity. Users provide quite complex operators to CUB. @elstehle provided an example of such operator:

constexpr int element_count = 5;
using element_t = int;
using item_t = cuda::std::array<element_t, element_count>;
constexpr item_t identity{0, 1, 2, 3, 4};

struct op_t
{
    __host__ __device__ __forceinline__
    item_t operator()(item_t lhs, item_t rhs)
    {
        item_t result{};
        for(int i=0; i < element_count; i++)
        {
            result[i] = rhs[lhs[i]];
        }
        return result;
    }
};

Having something along the "compute-optimal" vs "memory-optimal" requirement in the set of functional requirements alongside determinism requirements seems reasonable. Compute-optimal solution would not be allowed to perform extra invocations of the binary operator.

Since we have to implement a new deterministic scan algorithm anyways, we might experiment with implementing an algorithm that is both deterministic and work optimal.

gevtushenko mentioned this issue Jul 18, 2022

Add test for determinism on device scan[FAILING!!!!] NVIDIA/cub#453

Closed

bhack mentioned this issue Jul 26, 2022

Change XLA bincount determinism string rationale tensorflow/tensorflow#56905

Merged

ptrblck mentioned this issue Dec 22, 2022

Feature Request: deterministic CUDA cumsum pytorch/pytorch#89492

Closed

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

jarmak-nv assigned miscco Feb 23, 2023

Aidyn-A mentioned this issue May 31, 2023

[ATen][Native][CUDA] Embedding alertNotDeterministic pytorch/pytorch#102628

Closed

gevtushenko mentioned this issue Jul 24, 2023

[FEA]: Non-deterministic DeviceReduce #265

Open

1 task

github-project-automation bot added this to CCCL Nov 8, 2023

github-project-automation bot moved this to Todo in CCCL Nov 8, 2023

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

miscco removed their assignment Dec 6, 2023

tianleiwu mentioned this issue Jan 4, 2024

ONNXRT produces different outputs with CUDA EP on A100 GPU when used with different hosts (x86 CPU and ARM CPU). microsoft/onnxruntime#18725

Closed

lilohuang mentioned this issue Jun 5, 2024

cuPy cumsum results are non-deterministic for floats cupy/cupy#8350

Open

GentikSolm mentioned this issue Aug 29, 2024

[Feature]: Proof of Work value ($5,000 Bounty) from Manifold Labs vllm-project/vllm#8003

Closed

1 task

jollylili added the 2024-2025 goal label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address issues with existing determinism guarantees #886

Address issues with existing determinism guarantees #886

alliepiper commented May 6, 2022 •

edited

Loading

dkolsen-pgi commented May 7, 2022 •

edited

Loading

maddyscientist commented May 7, 2022

leofang commented May 7, 2022 •

edited

Loading

ekelsen commented May 9, 2022

duncanriach commented May 9, 2022

lilohuang commented May 16, 2022

bhack commented Jul 10, 2022

alliepiper commented Jul 25, 2022

gevtushenko commented Oct 5, 2023

Address issues with existing determinism guarantees #886

Address issues with existing determinism guarantees #886

Comments

alliepiper commented May 6, 2022 • edited Loading

Overview

Affected Algorithms

Requirements

Open Questions

How to handle naming?

Option 1: Prioritize backwards compatibility.

Option 2: Prioritize consistency.

Scope

dkolsen-pgi commented May 7, 2022 • edited Loading

maddyscientist commented May 7, 2022

leofang commented May 7, 2022 • edited Loading

ekelsen commented May 9, 2022

duncanriach commented May 9, 2022

lilohuang commented May 16, 2022

bhack commented Jul 10, 2022

alliepiper commented Jul 25, 2022

gevtushenko commented Oct 5, 2023

alliepiper commented May 6, 2022 •

edited

Loading

dkolsen-pgi commented May 7, 2022 •

edited

Loading

leofang commented May 7, 2022 •

edited

Loading