-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove 'sample' parameter from stats::mean API #2389
base: branch-25.02
Are you sure you want to change the base?
Conversation
… KahanBabushkaNeumaierSum to strided summation
… into kahan_neumeier_sum
@@ -27,10 +27,9 @@ namespace stats { | |||
namespace detail { | |||
|
|||
template <typename Type, typename IdxType = int> | |||
void mean( | |||
Type* mu, const Type* data, IdxType D, IdxType N, bool sample, bool rowMajor, cudaStream_t stream) | |||
void mean(Type* mu, const Type* data, IdxType D, IdxType N, bool rowMajor, cudaStream_t stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for removing the ability to compute a sample-based mean vs a population-based mean? This feature seems useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct for some statistical metrics like stddev or var, but IIUC that theory does not hold for the mean. The sample mean is the closest we have to the population mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both sample and population mean shall have 1/N factor. That gives the unbiased estimate. (This is unlike stdev or variance where we need to use different factor for sample and population mean).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Malte for this PR. I agree that using 1/(N-1)
factor for sample mean
is not correct, we should have a single option that has 1/N
factor. Therefore it makes sense to remove the sample
parameter.
One should still consider that this is a breaking change, so we can consider to postpone this PR for 24.10, and issue a warning in 24.08 if the sample param is used.
This was the primary reason I pointed this out. We need to be careful about updates to public apis specifically, because we can only assume they are being used downstream (and we need to check). |
@mfoerste4 do you plan for this PR to be ready for 25.02 or should we push to 25.04? I just want to make sure we're keeping this on the roadmap but being realistic about the timeline. |
Thanks @cjnolet for the reminder, I somehow completely lost track of this one. I believe I never added the warning in the first place, so I will update this PR (aiming for 25.04) and prepare a small PR with a warning in case |
@cjnolet , in order to have this integrated for 25.02 I have re-added the 'old' API and marked it as deprecated. This way we can decide to remove it later while already allowing people to adjust to the new API. I believe this is the cleaner approach which also allows removal of the 'breaking' tag. |
This PR removes the sample-parameter from the
raft::stats::mean
API to prevent people from using it by accident when for example computing the mean for a sampled variance computation.This also invalidates some of the testcases. Within raft only test-code is affected by this change as the active usage of the sample parameter was already removed in #2381.
This PR is based on #2381 but was separated for tracking purposes.
Note that this requires adaption of downstream libraries using the API. I am aware of at least one occurrence in
cuml
.