-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lazily compute for null count(seems help to high cardinality aggr) #6155
Conversation
f8a6d00
to
1c3a1be
Compare
Thank you @Rachelint -- is there any chance we can get a benchmark showing the effect of this PR? |
I run clickbench in datafusion with this pr in local, seems q32 1.10~1.15x faster. After profile, I found it is due to
|
} | ||
|
||
let computed_null_count = self.buffer.len() - self.buffer.count_set_bits(); | ||
v.store(computed_null_count as i64, Ordering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using Ordering::Relaxed
https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially faster
Specifically, it might make the generated code simpler
Also, was there a reason to remove #[inline]
- maybe that could account for the slow down 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
( They would be similiar in x86, just affect reordering )?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I was thinking about ways to reduce the overhead of using atomic instructions on inlining, etc -- it may not be a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic is just a load and store instr? but count_set_bits()
maybe not suitable for inline( or it's just __builtin counting zeros)?
Maybe let compiler decide it is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if using
Ordering::Relaxed
https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html#variant.Relaxed would make this PR potentially fasterSpecifically, it might make the generated code simpler
Also, was there a reason to remove
#[inline]
- maybe that could account for the slow down 🤔
I am not sure should we use Relaxed
or Aquire + Release
, I switch to Aquire + Release
because I am worried about the situation that:
ArrayRef
is hold by multiple threads, and they call the null_count
function concurrently, if we use Relaxed
, the count_set_bits
computation is possible to be performed many times...
Another reason is that I found the strongest SeqCst
used in arrow cpp
https://github.com/apache/arrow/blob/187197c369058f7d1377c1b161c469a9e4542caf/cpp/src/arrow/array/data.cc#L206-L218
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...
No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.
I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ArrayRef is hold by multiple threads, and they call the null_count function concurrently, if we use Relaxed, the count_set_bits computation is possible to be performed many times...
No memory ordering will prevent this, you'd need some mutual exclusion primitive if this was the goal. However, it doesn't really matter if two threads race to compute the null count as they'll just set it to the same value. As we aren't using the write to synchronise other memory operations, Relaxed would be perfectly valid here.
I do think the proposal to simplify the state is worth considering, as it will potentially remove a conditional load
Sorry, maybe I didn't state it clearly. Is it possible that, due to compiler reordering or cpu cache consistence problem, and the actual execution order becomes:
if cached_null_count != UNINITIALIZED_NULL_COUNT {
return cached_null_count as usize;
}
let cached_null_count = v.load(Ordering::Relaxed);
let computed_null_count = self.buffer.len() - self.buffer.count_set_bits();
And finally, the computation will be performance one more times? Although it actually don't affect the correctness, and maybe rare to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, all processors have coherent caches and compilers respect single threaded data dependencies. You can't observe two different values for the same memory locations simultaneously. The ordering concerns whether operations performed to other memory locations are guaranteed to be visible following this operation, we aren't using this as a synchronisation primitive and so this doesn't matter
Ok, thanks, I misunderstand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may hurt the cpu cache, too? But I know few about the detial.
Comparing to Relaxed, Acq-Rel would just re-ordering in x86, but in weak ordering machine it might affect the cache an would require memory barrier or other instr
#[derive(Debug)] | ||
pub enum NullCount { | ||
Eager(usize), | ||
Lazy(AtomicI64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the state here is a bit complicated, here we have three states: Eager
, Lazy (initialized)
, Lazy (uninitialized)
. And we use both enum and fence value to differentiate them.
I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.
struct NullCount {
val: AtomicI64,
}
impl NullCount {
fn get(&self, ...) -> i64 {
let val = self.val.load(...);
if val == UNINIT {
val.store(cal_null_count);
} else {
return val
}
}
}
This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the state here is a bit complicated, here we have three states:
Eager
,Lazy (initialized)
,Lazy (uninitialized)
. And we use both enum and fence value to differentiate them.I wonder if we can simplify this with just two states: uninitialized and initialized; and when we try to read a uninitialized value, we count the value and set it to initialized state.
struct NullCount { val: AtomicI64, } impl NullCount { fn get(&self, ...) -> i64 { let val = self.val.load(...); if val == UNINIT { val.store(cal_null_count); } else { return val } } }This way we only have two states to manage, and we also keep the NullCount to be 8 byte, instead of the current 24 bytes, which might help with performance
I don't want to make it so complicated too... And I impl it with two state at the beginning.
But I found it seems to make some queries slower (maybe noise?) in my first version about this pr, and I encountered a strange performance problem maybe related to atomic in my another pr, so for trying best to avoid the possible cost of atomic, I refactor it to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for trying best to avoid the possible cost of atomic
Atomic
with Ordering::Relaxed
has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.
Relaxed
ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to use Mutex
or RwLock
, which is a bit overkill for our use case, imo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so for trying best to avoid the possible cost of atomic
Atomic
withOrdering::Relaxed
has no extra cost comparing to directly loading/storing the value, as shown here: https://godbolt.org/z/4oreernqh, they compile to same instructions. The regressions are probably not from atomics.
Relaxed
ordering is sufficient here, more strict ordering won't help with correctness and is slower. If we really want to prevent multiple threads to count the null count, we has to useMutex
orRwLock
, which is a bit overkill for our use case, imo.
😄 Thanks, I indeed don't know enough about atomic.
I think this is an interesting angle but I do wonder if there is something fishy in what DataFusion is doing here. In particular, whilst lazily computing null masks will save cycles, array slicing is still a non-trivial operation. I wonder if the issue might be that the algorithm in DF is relying on slicing on a hot path where it would be better of tracking offsets and then using one of the selection kernels. I worry it is going to be hard to reason about the downstream implications of introducing an additional indirection for accessing the null buffer, and there might be less disruptive ways to get the same or better return in DF. Most of the kernels in arrow-rs make use of the null counts |
I would say there is a lot of room for improvement. What is happening is that for high cardinality aggregates, the output of the hash aggregate operation is currently one giant contiguous RecordBatch which is then sliced There is more detail here apache/datafusion#9562 (and @JasonLi-cn was looking into improving it) however it is tricky as doing so would imply the intermediate state of the group keys and the hash table and the aggregates would need to be chunked. This isn't impossible, just non trivial |
Right so the question becomes whether the proposed optimisation is overfitting to a particular implementation that is known to be at least somewhat degenerate? Is this a general purpose improvement, or just something that helps a single use-case potentially to the detriment of others? |
Yes, I agree I am not yet convinced that this is a good general optimization for arrow-rs. I realize my last message was ambiguous about what I thought the best solution was. In particular, I think we would need to show this PR doesn't slow anything else down. There is at least one small cost that it increases the sizes of arrays by 8 bytes I think In general I do think making the change in DataFusion is the better option in general and that we would pursue that change regardless of this particular PR in arrow |
In my opinion we should evaluate this change on its own (would it be good for all arrow-rs users). Making DataFuson aggregates faster is a parallel track of work (which @Rachelint is also exploring other interesting ideas such as apache/datafusion#11943) which while related is not a sufficient justification in itself for this work |
We should indeed evaluate if it make general benefits for arrow-rs. Maybe I can try to find the related issues about why |
Running the arrow benchmarks ( |
Marking this as draft to signify it isn't waiting for more review. @Rachelint has some very promising other PRs downstream in DataFusion that might make an optimization such as this less relevant |
I'm going to close this as it looks like the consensus is that we should make the proposed changes in DF first, we can always reopen this / create a new PR if we decide to continue with this down the line |
In fact @Rachelint found a better way (implement chunked storage at a lower level to avoid the need to slice at all) |
Which issue does this PR close?
Closes #6146
Rationale for this change
See #6146
What changes are included in this PR?
lazily compute for null count when it is expansive to calculate it.
Are there any user-facing changes?
No.