-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Within-vector reductions #4628
Within-vector reductions #4628
Conversation
FYI @alexreinking that cmake presubmit just caught me forgetting to add two tests to the cmakelist, so it was worthwhile. |
Is this ready to review? (I'm assuming not from the failed buildbots.) |
Triaging the failed buildbots now. Looks like the main issue is an llvm bug that I'll need to work around: https://bugs.llvm.org/show_bug.cgi?id=44976 But there's also a crash on arm that I need to figure out. Feel free to wait until I get the buildbots green - this isn't urgent. |
Craig Topper jumped on my bug report and so the llvm bug was just fixed in master. We just need to figure out what to do for earlier llvms. A workaround is probably still valuable if there's a simple one. |
For now, you could just declare (and enforce) that this feature is only available in LLVM 11+ |
Might be hard to do, because this isn't a peephole optimization that's breaking. It happens for vanilla codegen of a VectorReduce IR node, and these could appear for arbitrary reasons. I think I'll just try padding out the vectors to the next power of two for llvms < 11. |
Actually this bug triggers for just widening multiplication of non-power-of-two vector sizes:
Yikes. I guess nobody uses those. I'll have to inject a workaround into the Mul visitor |
FYI @alinas for the just-fixed llvm bug, in case you encounter something similar. It goes away if you turn on the sse4.1 feature flag, so I doubt this ever triggers when building for real systems. |
Got it, thanks for the heads-up. |
Something is wonky with this PR -- it's been showing a travis test as in progress for days now. Maybe due to the unresolved conflicts? |
I'll try a merge with master |
This is now green and ready for review |
(I know normally we don't bother rebasing commits, but in this case, maybe consider doing so -- it's spread over 80 or so commits.) |
Sure. There are a few things I can probably carve off into separate PRs too |
src/CodeGen_LLVM.cpp
Outdated
@@ -1472,6 +1472,14 @@ Value *CodeGen_LLVM::codegen(const Expr &e) { | |||
value = nullptr; | |||
e.accept(this); | |||
internal_assert(value) << "Codegen of an expr did not produce an llvm value\n"; | |||
|
|||
// Halide doesn't distinguish between scalars and vectors of size 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to expand this comment: the code below is converting a vector of size 1 to a scalar?
src/CodeGen_LLVM.cpp
Outdated
|
||
#if LLVM_VERSION >= 90 | ||
if (output_lanes == 1 && | ||
// As of the release of llvm 10, the 64-bit experimental total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these criteria be moved to "llvm_has_intrinsic"?
}; // namespace | ||
|
||
class FindVectorizableExprsInAtomicNode : public IRMutator { | ||
Scope<> poisoned_names; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably know what this means but might be good to comment about the definition of "poisoned".
Yeah, with this amount of change, any easy carving will likely make our life simpler later (if we have to go back and debug injections) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(LGTM pending lots of nits)
Does the C++ backend need attention here? (Even if it's only injecting something that will deliberately fail?)
IMHO we're going to need some more detailed docs/examples (tutorial?) on using this as the details and limitations might not be immediately obvious.
} | ||
break; | ||
case VectorReduce::Mul: | ||
interval = Interval::everything(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: aren't there some cases we could exploit here? e.g., what if the type is float and we know the bounds are [0.0, 1.0]?
src/CodeGen_ARM.cpp
Outdated
// lanes and double the bit-width. | ||
|
||
int factor = op->value.type().lanes() / op->type.lanes(); | ||
if ((op->type.is_int() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: this expr is really hard to read. Might be better to pull out as a separate expr, perhaps even like
bool has_reduce = is_int || is_uint || is_float;
if (!case1) has_reduce = false; // reason why
if (!case2) has_reduce = false; // reason why
etc
@@ -1100,6 +1256,10 @@ string CodeGen_ARM::mattrs() const { | |||
arch_flags = "+sve"; | |||
} | |||
|
|||
if (target.has_feature(Target::ARMDotProd)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be gated by LLVM_VERSION?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm 8 has it, so we're good
src/CodeGen_LLVM.cpp
Outdated
|
||
#if LLVM_VERSION >= 90 | ||
if (output_lanes == 1 && | ||
// As of the release of llvm 10, the 64-bit experimental total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they work in trunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No
src/CodeGen_LLVM.cpp
Outdated
|
||
if (factor > 2 && ((factor & 1) == 0)) { | ||
// Factor the reduce into multiple stages. If we're going to | ||
// be widen the type by 4x or more we should also factor the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
widening
src/VectorizeLoops.cpp
Outdated
// Even if the load is bad, maybe we can lift the index | ||
IRMutator::visit(op); | ||
|
||
// TODO: tuples? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this TODO could probably use a bit more meat on the bone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, that's something I genuinely forgot to finish. Will need to add a test.
src/VectorizeLoops.cpp
Outdated
@@ -1022,6 +1431,7 @@ class VectorizeLoops : public IRMutator { | |||
// Replace the var with a ramp within the body | |||
Expr for_var = Variable::make(Int(32), for_loop->name); | |||
Expr replacement = Ramp::make(for_loop->min, 1, extent->value); | |||
stmt = for_loop->body; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...is this a pre-existing bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think this additional line is a no-op, because the next line doesn't use stmt and just clobbers it. I'll delete it.
@@ -20,6 +20,8 @@ WEAK CpuFeatures halide_get_cpu_features() { | |||
// features.set_available(halide_target_feature_armv7s); | |||
// } | |||
|
|||
// TODO: add runtime detection for ARMDotProd extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open an issue on this
if (!t.has_feature(Target::CUDACapability61)) { | ||
printf("Not running test because no cuda (with compute capability 6.1) is not enabled in the target\n"); | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: print the target here
@@ -187,7 +187,7 @@ class SimdOpCheck : public SimdOpCheckTest { | |||
|
|||
// SSE 2 | |||
|
|||
for (int w = 2; w <= 4; w++) { | |||
for (int w : {2, 4}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you intend to skip the w=3 case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, LLVM does different things with different versions, and it's not a case worth testing anyway
Moving to top-level comment: Not finished until I address the TODO: tuples? in VectorizeLoops.cpp |
Do we have a story in this pull request for hoisting the total reductions that typically follow these kinds of partial reduction operations outside the inner loop? Today, using this branch I'm finding that Halide code like the following to perform the dot product: Func dot("dot");
RDom rv(0, num_elems);
dot() += cast(UInt(32), A_u8(rv)) * B_u8(rv);
dot.update().atomic().vectorize(rv, 16); Produces assembly code like the following for AArch64, where the
Presumably the same is also true when generating Hexagon's Func dot("dot");
Func partial("partial");
Var x("x"), y("y"), k("k");
// Define the function
const int vec = 16;
RDom rvi(0, vec, "rvi");
RDom rvo(0, num_elems/vec, "rvo");
partial(k) += cast(UInt(32), A_u8(vec * k + rvi)) * B_u8(vec * k + rvi);
dot() += partial(rvo);
// Schedule the function
partial.update().atomic().vectorize(rvi);
dot.update().atomic().vectorize(rvo, 4); |
The scheduling directive for that is already in master: rfactor. It's a bit tricky to use unfortunately. Here's how you'd use it for a vectorized dot product:
|
Thanks for the pointer towards Ultimately, what we really want here is an // Define the function
Func dot("dot");
RDom r(0, num_elems, "r");
dot() += cast(UInt(32), A(r)) * B(r);
// Schedule the function
Var k("k");
RVar rvo("rvo"), rvoo("rvoo"), rvio("rvio"), rvii("rvii"), fused("fused");
dot.update().split(r, rvo, rvii, 4).split(rvo, rvoo, rvio, 4);
Func partial = dot.update().rfactor(rvio, k);
partial.compute_root().vectorize(k);
partial.update().fuse(rvii, k, fused).atomic().vectorize(fused); Unfortunately, though, while this does produce a Happy to move this discussion into a separate issue if that's helpful as I don't want to drown out any actual review comments for the pull request, but I figure that getting this use case nailed down is probably quite important for this patch and I'm curious to see if there's a good answer here. |
Current state of the branch is a mess, because I discovered a case where atomic nodes were being incorrectly stripped and had to add them back everywhere temporarily, resulting in CAS loops. I temporarily hacked them back off to look at this case. Everything works out cleanly if you promise num_elems is a multiple of 16. When it's not you get tail cases. One problem is that both of those splits are GuardWithIf, because they're splits of an RVar with unknown extent. You can make the inner one a RoundUp by reassociating the splits:
That gives you the hot loop:
If num_elems is not known to be a multiple of 16, you get tail cases and if statements, and the hot loop becomes:
Not sure why LLVM sees the need to spill the accumulator (q0). But in any case we shouldn't have this if statement. I'll see if I can get it to go away. |
Found the problem. Will be fixed in the final version of the branch. We were generating if conditions of the form likely(likely(foo)), which messed up a bunch of stuff. |
9257ab4
to
ad55e00
Compare
ad55e00
to
0cce2b5
Compare
54fbad6
to
f63c800
Compare
ok, ready for more review |
const VectorReduce *e = expr.as<VectorReduce>(); | ||
|
||
compare_scalar(op->op, e->op); | ||
// We've already compared types, so it's enough to compare the value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is type comparison missing? There is a comment about it, but I only see operator type and value comparisons. Is it somehow implicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
types get compared before this function is called in compare_expr
src/IRMatch.cpp
Outdated
|
||
void visit(const VectorReduce *op) override { | ||
const VectorReduce *e = expr.as<VectorReduce>(); | ||
if (result && e && op->op == e->op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar question about the type here: don't we need to compare number of lanes as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to compare types here. Will fix.
src/IROperator.cpp
Outdated
@@ -463,6 +463,31 @@ Expr lossless_cast(Type t, Expr e) { | |||
return Expr(); | |||
} | |||
} | |||
|
|||
if (const VectorReduce *red = e.as<VectorReduce>()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: 'red' is confusing name.
return -1; | ||
} | ||
|
||
if (!checker.atomics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be if(checker.vector_reduces)
Expr rhs = cast(dst_type, in(x * reduce_factor + r)); | ||
Expr rhs2 = cast(dst_type, in(x * reduce_factor + r + 32)); | ||
|
||
if (op == 4 || op == 5) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be much more readable to use named values for reduction operator types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't reduction operator types, they're the test cases in the switch statement below. I'll add a comment.
src/IROperator.cpp
Outdated
const int factor = red->value.type().lanes() / red->type.lanes(); | ||
switch (red->op) { | ||
case VectorReduce::Add: | ||
if (t.bits() >= 16 && factor < (1 << (t.bits() / 2))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a short comment would be really helpful here.
@@ -413,6 +423,10 @@ bool equal_helper(const BaseExprNode &a, const BaseExprNode &b) noexcept { | |||
case IRNodeType::Shuffle: | |||
return (equal_helper(((const Shuffle &)a).vectors, ((const Shuffle &)b).vectors) && | |||
equal_helper(((const Shuffle &)a).indices, ((const Shuffle &)b).indices)); | |||
case IRNodeType::VectorReduce: | |||
return (((const VectorReduce &)a).op == ((const VectorReduce &)b).op && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, at this point we already know the types match (it's checked outside equal_helper) but we don't know that the types of value match before recursively calling equal_helper, so I'd better check that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the cast handler has the same bug
}; | ||
|
||
// A ramp with the lanes repeated (e.g. <0 0 2 2 4 4 6 6>) | ||
struct InterleavedRamp { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a ramp(broadcast(a, repetitions), broadcast(b, repetitions), lanes), could you please add a TODO, so I can update it after merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -941,6 +1029,171 @@ class VectorSubs : public IRMutator { | |||
return Allocate::make(op->name, op->type, op->memory_type, new_extents, op->condition, body, new_expr, op->free_function); | |||
} | |||
|
|||
Stmt visit(const Atomic *op) override { | |||
// Recognize a few special cases that we can handle as within-vector reduction trees. | |||
do { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always wanted to say that goto is more readable :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered it, but having an explicit scope with breaks makes me more confident that RAII things are going to have the lifetime I expect
src/ModulusRemainder.cpp
Outdated
@@ -489,11 +490,17 @@ void ComputeModulusRemainder::visit(const Let *op) { | |||
} | |||
|
|||
void ComputeModulusRemainder::visit(const Shuffle *op) { | |||
// It's possible that scalar expressions are extracting a lane of a vector - don't fail in this case, but stop | |||
// It's possible that scalar expressions are extracting a lane of | |||
// a vector - don't faiql in this case, but stop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail is still misspelled.
woo! |
Various ISAs have vector instructions that perform a full or partial reduction across the vector lanes, and we have had no good way to reach these. Of particular note is the instruction variously referred to as udot/dp4a/vrmpy, which does a 4 x int8_t dot product into a 32-bit result, all potentially within a wider vector. This instruction is important for quantized neural networks.
This PR reaches such ops by composing atomic() with vectorize(). These operations as viewed as atomic (i.e. semantics are serial but unordered) reductions along the vectorized dimension.
I added a new IR node VectorReduce, which reduces groups of adjacent lanes in a wide vector down to a narrower vector using some operator. This is equivalent to a binary operation tree on slices of the vector, but that's much harder to peephole match in backends. The reduction operators are our primitive IR nodes that happen to be associative and commutative (add, mul, min, max, and, or). This node is introduced by the vectorize_loops pass when it would otherwise emit a pattern of the flavor:
f(ramp()/4) = f(ramp()/4) + some_vector;
or for total reductionsf(broadcast()) = f(broadcast()) + some_vector;
Backends can specialize their handling of this node to hit the appropriate ops. It gets a little more complex than that because some of these ops can also incorporate an existing accumulator. Much of the new logic in this PR is in various backends. I added support for x86, arm, and cuda. I did not touch the hexagon backend, as I figured @pranavb-ca or @dsharletg would be better at handling that. For the arm udot/sdot instructions I added codegen that I believe is correct but again have no way to test it (our buildbots don't have these instructions, which need a qualcomm 855 or similar). I added a feature flag to guard the emission of these instructions. Maybe @pranavb-ca or @dsharletg can sanity-check with a quick test on an 855?
For d3d and opencl there are intrinsics for this that should reach the same underlying instruction (dp4a on nvidia cards) but I didn't want to continue to drag on this PR. Those can be addressed later. Also related but not considered here is reductions along the warp via composing atomic() and gpu_lanes().
As a final note, atomic_vectorization is my favorite branch name to date.