-
Notifications
You must be signed in to change notification settings - Fork 623
Support limit, offset, top-k serial and parallel versions in codegen #1435
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but I haven't tried it out yet. But I do have some questions: (this one and some in the code)
- It seems like limit and offset are compile-time constants. That means, the compiled plan cannot be reused for other limit/offset values. Is that correct?
@@ -417,7 +417,7 @@ void Aggregation::DoNullCheck( | |||
default: { break; } | |||
} | |||
} | |||
agg_is_null.ElseBlock("Agg.IfAggIsNotNull"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to remove the naming? I found all these names very helpful when looking at IR. I know that all the string handling introduces some overhead, maybe we could drop then in Release mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most students will probably use printf
s to trace through the code. We still allow naming of the 'if' statements and the IR tracks predecessors which conveys the same information.
HANDLE_EXPLICIT_CALL_INST(peloton_sorter_sort, | ||
peloton::codegen::util::Sorter::Sort) | ||
HANDLE_EXPLICIT_CALL_INST(peloton_sorter_sortparallel, | ||
peloton::codegen::util::Sorter::SortParallel) | ||
HANDLE_EXPLICIT_CALL_INST(peloton_sorter_sorttopkparallel, | ||
peloton::codegen::util::Sorter::SortTopKParallel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you added more explicit call instructions for the bytecode interpreter. FYI: Once the interpreter is connected, it will print a debug message for every function call where no explicit call instruction exists, so it will be easy to find those.
} | ||
|
||
// Generate loading code and save in cache | ||
val = load_func(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the caching happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loaded values are inserted into the cached_vars_
member variable that is probed earlier.
codegen->CreateStore(next_limit, limit_count_ptr); | ||
} else { | ||
// Parallel mode. Atomically increment the count. | ||
next_limit = codegen->CreateAtomicRMW( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what this looks like in IR and what LLVM compiles it to? (I haven't look at it yet, Google was not helpful) Whatever it is - I have the feeling that the Interpreter doesn't support it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generate LLVM's atomicrmw
with sequential consistency. This may lower to different instructions depending on architecture, but on the few Intel Skylake+ machines I tested, it uses lock xadd
.
In the interpreter, you can use our peloton::atomic_add()
, which boils down to a locked xadd
.
child_pipeline_(this, Pipeline::Parallelism::Serial) { | ||
// Aggregations happen serially (for now ...) | ||
pipeline.SetSerial(); | ||
child_pipeline_(this, Pipeline::Parallelism::Flexible) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly: The child pipeline of the Orderby (before the hashing) can be parallel, but the order-by itself has to be serial because of the aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is saying is that the consumer portion of the order-by (i.e., the child pipeline) can be executed in parallel, and that the production portion (i.e., the scanning of the sorter) will be serial. It has no information about other operators in the pipeline.
src/codegen/util/sorter.cpp
Outdated
@@ -97,8 +131,11 @@ void Sorter::Sort() { | |||
|
|||
timer.Stop(); | |||
|
|||
LOG_DEBUG("Sorted %zu tuples in %.2f ms", tuples_.size(), | |||
timer.GetDuration()); | |||
#ifndef NDEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be #ifdef LOG_DEBUG_ENABLED
* Sift down the element at the root of the heap while maintaining the heap | ||
* property. | ||
*/ | ||
void HeapSiftDown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference of this function and STL push_heap()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In essence, they are the same. But, push_heap
alone doesn't give us the functionality we need. We need to remove the smallest (largest) element off the min (max) heap and insert a new element. We could use a pair of push_heap
and pop_heap
calls, but we've managed to achieve the same functionality with a single O(logn)
operation.
…o bytecode builder. Fixed parallel sorting top-k
…s were equal producing non-deterministic outputs). Fixed order by translator test to push a limit plan on top. This wasn't there before because we didn't need a limit
@tcm-marcel Want to take a last crack at this? Don't want this to get stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took so long. I think it is good to go!
@tcm-marcel I think this is stuck on the Jenkins mac build? |
The Jenkins Mac build fails because no space is left on the hard drive:
@apavlo I am not familiar with the Jenkins Mac machine. But on Travis the Mac build succeeds. |
@crd477 Can you check this? |
Fixed with the quick fix of restarting the VM from a clean snapshot. The more complete fix of how to do this cleanly and regularly was still on my TODO list before I left for vacation last week and I'll get back to it today. |
The macOS Jenkins worker should be stable now. I grew the VM's disk and added a workspace cleanup cronjob as the other Jenkins workers have. |
…mu-db#1435) * Initial limit commit * Cleanup OrderByPlan * Don't use SQL types for the result of the comparison when sorting * Sorting for Top-K works. Probably slow. * Optimize pop-then-push into single operation * Fix lang::If to allow injection of custom then and else blocks * Added early exit block to consumer context to allow operators to quick exit * More comments * Fast-exit limits * Cleanup comments. Enabled parallel sorting. * Fixed bytecode compilation. Added explicit call handlers for sorter to bytecode builder. Fixed parallel sorting top-k * Set limit and offset to 0 when no limit/offset provided * Fix offset * Fix sort test in optimize to have well-defined order (some sort values were equal producing non-deterministic outputs). Fixed order by translator test to push a limit plan on top. This wasn't there before because we didn't need a limit * Cleanup * Fix limit without order-by clause
After #1385 was merged in, I resurrected
LIMIT
,OFFSET
from my old branch, and implemented Top-K optimizations. These both work for serial and parallel execution, and are needed for TPC-C.Not quite ready for merging in, but folks can take a look at it.