Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Within-vector reductions #4628

Merged
merged 7 commits into from
Jun 25, 2020
Merged

Within-vector reductions #4628

merged 7 commits into from
Jun 25, 2020

Conversation

abadams
Copy link
Member

@abadams abadams commented Feb 19, 2020

Various ISAs have vector instructions that perform a full or partial reduction across the vector lanes, and we have had no good way to reach these. Of particular note is the instruction variously referred to as udot/dp4a/vrmpy, which does a 4 x int8_t dot product into a 32-bit result, all potentially within a wider vector. This instruction is important for quantized neural networks.

This PR reaches such ops by composing atomic() with vectorize(). These operations as viewed as atomic (i.e. semantics are serial but unordered) reductions along the vectorized dimension.

I added a new IR node VectorReduce, which reduces groups of adjacent lanes in a wide vector down to a narrower vector using some operator. This is equivalent to a binary operation tree on slices of the vector, but that's much harder to peephole match in backends. The reduction operators are our primitive IR nodes that happen to be associative and commutative (add, mul, min, max, and, or). This node is introduced by the vectorize_loops pass when it would otherwise emit a pattern of the flavor: f(ramp()/4) = f(ramp()/4) + some_vector; or for total reductions f(broadcast()) = f(broadcast()) + some_vector;

Backends can specialize their handling of this node to hit the appropriate ops. It gets a little more complex than that because some of these ops can also incorporate an existing accumulator. Much of the new logic in this PR is in various backends. I added support for x86, arm, and cuda. I did not touch the hexagon backend, as I figured @pranavb-ca or @dsharletg would be better at handling that. For the arm udot/sdot instructions I added codegen that I believe is correct but again have no way to test it (our buildbots don't have these instructions, which need a qualcomm 855 or similar). I added a feature flag to guard the emission of these instructions. Maybe @pranavb-ca or @dsharletg can sanity-check with a quick test on an 855?

For d3d and opencl there are intrinsics for this that should reach the same underlying instruction (dp4a on nvidia cards) but I didn't want to continue to drag on this PR. Those can be addressed later. Also related but not considered here is reductions along the warp via composing atomic() and gpu_lanes().

As a final note, atomic_vectorization is my favorite branch name to date.

@abadams
Copy link
Member Author

abadams commented Feb 19, 2020

FYI @alexreinking that cmake presubmit just caught me forgetting to add two tests to the cmakelist, so it was worthwhile.

@steven-johnson
Copy link
Contributor

Is this ready to review? (I'm assuming not from the failed buildbots.)

@abadams
Copy link
Member Author

abadams commented Feb 20, 2020

Triaging the failed buildbots now. Looks like the main issue is an llvm bug that I'll need to work around: https://bugs.llvm.org/show_bug.cgi?id=44976

But there's also a crash on arm that I need to figure out. Feel free to wait until I get the buildbots green - this isn't urgent.

@abadams
Copy link
Member Author

abadams commented Feb 20, 2020

Craig Topper jumped on my bug report and so the llvm bug was just fixed in master. We just need to figure out what to do for earlier llvms. A workaround is probably still valuable if there's a simple one.

@steven-johnson
Copy link
Contributor

For now, you could just declare (and enforce) that this feature is only available in LLVM 11+

@abadams
Copy link
Member Author

abadams commented Feb 20, 2020

Might be hard to do, because this isn't a peephole optimization that's breaking. It happens for vanilla codegen of a VectorReduce IR node, and these could appear for arbitrary reasons. I think I'll just try padding out the vectors to the next power of two for llvms < 11.

@abadams
Copy link
Member Author

abadams commented Feb 20, 2020

Actually this bug triggers for just widening multiplication of non-power-of-two vector sizes:

    ImageParam in1(Int(16), 1), in2(Int(16), 1);
    Func f;
    Var x;
    f(x) = cast<int32_t>(in1(x)) * in2(x);
    f.vectorize(x, 12);
    f.compile_jit(Target("x86-64-linux"));

Yikes. I guess nobody uses those. I'll have to inject a workaround into the Mul visitor

@abadams
Copy link
Member Author

abadams commented Feb 20, 2020

FYI @alinas for the just-fixed llvm bug, in case you encounter something similar. It goes away if you turn on the sse4.1 feature flag, so I doubt this ever triggers when building for real systems.

@alinas
Copy link
Contributor

alinas commented Feb 20, 2020

Got it, thanks for the heads-up.

@steven-johnson
Copy link
Contributor

Something is wonky with this PR -- it's been showing a travis test as in progress for days now. Maybe due to the unresolved conflicts?

@abadams
Copy link
Member Author

abadams commented Mar 4, 2020

I'll try a merge with master

@abadams
Copy link
Member Author

abadams commented Mar 14, 2020

This is now green and ready for review

@steven-johnson
Copy link
Contributor

(I know normally we don't bother rebasing commits, but in this case, maybe consider doing so -- it's spread over 80 or so commits.)

@abadams
Copy link
Member Author

abadams commented Mar 16, 2020

Sure. There are a few things I can probably carve off into separate PRs too

@@ -1472,6 +1472,14 @@ Value *CodeGen_LLVM::codegen(const Expr &e) {
value = nullptr;
e.accept(this);
internal_assert(value) << "Codegen of an expr did not produce an llvm value\n";

// Halide doesn't distinguish between scalars and vectors of size 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be good to expand this comment: the code below is converting a vector of size 1 to a scalar?


#if LLVM_VERSION >= 90
if (output_lanes == 1 &&
// As of the release of llvm 10, the 64-bit experimental total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these criteria be moved to "llvm_has_intrinsic"?

}; // namespace

class FindVectorizableExprsInAtomicNode : public IRMutator {
Scope<> poisoned_names;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably know what this means but might be good to comment about the definition of "poisoned".

@steven-johnson
Copy link
Contributor

Sure. There are a few things I can probably carve off into separate PRs too

Yeah, with this amount of change, any easy carving will likely make our life simpler later (if we have to go back and debug injections)

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LGTM pending lots of nits)

Does the C++ backend need attention here? (Even if it's only injecting something that will deliberately fail?)

IMHO we're going to need some more detailed docs/examples (tutorial?) on using this as the details and limitations might not be immediately obvious.

}
break;
case VectorReduce::Mul:
interval = Interval::everything();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: aren't there some cases we could exploit here? e.g., what if the type is float and we know the bounds are [0.0, 1.0]?

// lanes and double the bit-width.

int factor = op->value.type().lanes() / op->type.lanes();
if ((op->type.is_int() ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: this expr is really hard to read. Might be better to pull out as a separate expr, perhaps even like

bool has_reduce = is_int || is_uint || is_float;
if (!case1) has_reduce = false; // reason why
if (!case2) has_reduce = false; // reason why

etc

@@ -1100,6 +1256,10 @@ string CodeGen_ARM::mattrs() const {
arch_flags = "+sve";
}

if (target.has_feature(Target::ARMDotProd)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be gated by LLVM_VERSION?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm 8 has it, so we're good


#if LLVM_VERSION >= 90
if (output_lanes == 1 &&
// As of the release of llvm 10, the 64-bit experimental total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they work in trunk

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No


if (factor > 2 && ((factor & 1) == 0)) {
// Factor the reduce into multiple stages. If we're going to
// be widen the type by 4x or more we should also factor the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

widening

// Even if the load is bad, maybe we can lift the index
IRMutator::visit(op);

// TODO: tuples?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this TODO could probably use a bit more meat on the bone

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, that's something I genuinely forgot to finish. Will need to add a test.

@@ -1022,6 +1431,7 @@ class VectorizeLoops : public IRMutator {
// Replace the var with a ramp within the body
Expr for_var = Variable::make(Int(32), for_loop->name);
Expr replacement = Ramp::make(for_loop->min, 1, extent->value);
stmt = for_loop->body;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...is this a pre-existing bug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think this additional line is a no-op, because the next line doesn't use stmt and just clobbers it. I'll delete it.

@@ -20,6 +20,8 @@ WEAK CpuFeatures halide_get_cpu_features() {
// features.set_available(halide_target_feature_armv7s);
// }

// TODO: add runtime detection for ARMDotProd extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open an issue on this

if (!t.has_feature(Target::CUDACapability61)) {
printf("Not running test because no cuda (with compute capability 6.1) is not enabled in the target\n");
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: print the target here

@@ -187,7 +187,7 @@ class SimdOpCheck : public SimdOpCheckTest {

// SSE 2

for (int w = 2; w <= 4; w++) {
for (int w : {2, 4}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you intend to skip the w=3 case here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, LLVM does different things with different versions, and it's not a case worth testing anyway

@abadams
Copy link
Member Author

abadams commented Mar 16, 2020

Moving to top-level comment: Not finished until I address the TODO: tuples? in VectorizeLoops.cpp

@joesavage
Copy link
Contributor

joesavage commented Jun 2, 2020

Do we have a story in this pull request for hoisting the total reductions that typically follow these kinds of partial reduction operations outside the inner loop? Today, using this branch I'm finding that Halide code like the following to perform the dot product:

Func dot("dot");
RDom rv(0, num_elems);
dot() += cast(UInt(32), A_u8(rv)) * B_u8(rv);
dot.update().atomic().vectorize(rv, 16);

Produces assembly code like the following for AArch64, where the movi vector initialization and addv horizontal reduction are in the innermost loop (since the VectorReduce target type is a scalar):

ldr	q0, [x12], #16
ldr	q1, [x10], #16
movi	v2.2d, #0x0
subs	x9, x9, #0x1
udot	v2.4s, v0.16b, v1.16b
addv	s0, v2.4s
fmov	w13, s0
add	w8, w13, w8
b.ne	374 <dot+0x374>

Presumably the same is also true when generating Hexagon's vrmpy and the like, producing code that doesn't make the most of the accumulating capabilities of these instructions. Is there some way to get Halide to continually accumulate into a single vector using this atomic().vectorize() paradigm, or is it expected that these capabilities make up a separate unit of work and thus would likely come from a separate pull request? The following sequence that feels like it perhaps should work currently seems to hit an internal error ("Only Serial and Parallel For nodes should survive down to codegen"):

Func dot("dot");
Func partial("partial");
Var x("x"), y("y"), k("k");

// Define the function
const int vec = 16;
RDom rvi(0, vec, "rvi");
RDom rvo(0, num_elems/vec, "rvo");
partial(k) += cast(UInt(32), A_u8(vec * k + rvi)) * B_u8(vec * k + rvi);
dot() += partial(rvo);

// Schedule the function
partial.update().atomic().vectorize(rvi);
dot.update().atomic().vectorize(rvo, 4);

@abadams
Copy link
Member Author

abadams commented Jun 2, 2020

The scheduling directive for that is already in master: rfactor. It's a bit tricky to use unfortunately. Here's how you'd use it for a vectorized dot product:

// Define the function
RDom r(0, num_elems);
dot() += cast(UInt(32), A_u8(vec * k + rvi)) * B_u8(vec * k + rvi);

RVar rvo, rvi;
Var k;
const int vec = 16;
dot.update().split(r, rvo, rvi, vec);
// In partial, each value of rvi gets a separate accumulator indexed by the new pure variable k.
Func partial = dot.update().rfactor(rvi, k); 
partial.compute_root().vectorize(k).update().vectorize(k);

@joesavage
Copy link
Contributor

Thanks for the pointer towards rfactor. Unfortunately, though, it looks like this particular case might be a bit more complicated. If we try and simply use rfactor on a reduction domain of extent 16 as suggested, this implies that we want to compute over 16 independent partial sums, producing a VectorReduce node with source type uint8x16 and target type uint32x16. Since this reads 16 elements from each input buffer and produces 16 outputs, this isn't really a VectorReduce in the true sense, and thus doesn't generate UDOT.

Ultimately, what we really want here is an rfactor of extent four – each independent accumulation for which contains an inner loop of size four – and then to vectorize over these loops with a factor of 16 to product a VectorReduce with source type uint8x16 and target type uint32x4. I tried to have a stab at doing this with the following:

// Define the function
Func dot("dot");
RDom r(0, num_elems, "r");
dot() += cast(UInt(32), A(r)) * B(r);

// Schedule the function
Var k("k");
RVar rvo("rvo"), rvoo("rvoo"), rvio("rvio"), rvii("rvii"), fused("fused");
dot.update().split(r, rvo, rvii, 4).split(rvo, rvoo, rvio, 4);
Func partial = dot.update().rfactor(rvio, k);
partial.compute_root().vectorize(k);
partial.update().fuse(rvii, k, fused).atomic().vectorize(fused);

Unfortunately, though, while this does produce a UDOT instruction, it seems to be doing some really weird stuff such that it definitely isn't generating the desired result. I need to look into the 'why' in more detail, but I'm seeing lots of weird if statements in the pseudocode for the inner loop as a result of the vectorization on fused for whatever reason.

Happy to move this discussion into a separate issue if that's helpful as I don't want to drown out any actual review comments for the pull request, but I figure that getting this use case nailed down is probably quite important for this patch and I'm curious to see if there's a good answer here.

@abadams
Copy link
Member Author

abadams commented Jun 3, 2020

Current state of the branch is a mess, because I discovered a case where atomic nodes were being incorrectly stripped and had to add them back everywhere temporarily, resulting in CAS loops. I temporarily hacked them back off to look at this case.

Everything works out cleanly if you promise num_elems is a multiple of 16. When it's not you get tail cases. One problem is that both of those splits are GuardWithIf, because they're splits of an RVar with unknown extent. You can make the inner one a RoundUp by reassociating the splits:


    Var k("k");
    RVar rvo("rvo"), rvi("rvi"), rvio("rvio"), rvii("rvii"), fused("fused");
    dot.update().split(r, rvo, rvi, 16).split(rvi, rvio, rvii, 4);
    Func partial = dot.update().rfactor(rvio, k);
    partial.compute_root().vectorize(k);
    partial.update().fuse(rvii, k, fused).atomic().vectorize(fused);

That gives you the hot loop:

.LBB0_2:                                // %"for dot_intm.s1.r$x.rvo"
                                        // =>This Inner Loop Header: Depth=1
	ldr	q1, [x10], #16
	ldr	q2, [x11], #16
	subs	x9, x9, #1              // =1
	udot	v0.4s, v1.16b, v2.16b
	b.ne	.LBB0_2

If num_elems is not known to be a multiple of 16, you get tail cases and if statements, and the hot loop becomes:

.LBB0_2:                                // %true_bb
                                        //   in Loop: Header=BB0_4 Depth=1
	lsl	x2, x15, #4
	sub	x3, x2, x11
	sub	x2, x2, x13
	ldr	q0, [sp]
	ldr	q1, [x8, x3]
	ldr	q2, [x10, x2]
	udot	v0.4s, v1.16b, v2.16b
	str	q0, [sp]
.LBB0_3:                                // %after_bb
                                        //   in Loop: Header=BB0_4 Depth=1
	add	x15, x15, #1            // =1
	cmp	x15, x16
	b.eq	.LBB0_37
.LBB0_4:                                // %"for dot_intm.s1.r$x.rvo"
	sxtw	x2, w15
	lsl	x3, x2, #4
	add	w2, w3, #16             // =16
	cmp	w2, w12
	b.le	.LBB0_2

Not sure why LLVM sees the need to spill the accumulator (q0). But in any case we shouldn't have this if statement. I'll see if I can get it to go away.

@abadams
Copy link
Member Author

abadams commented Jun 3, 2020

Found the problem. Will be fixed in the final version of the branch. We were generating if conditions of the form likely(likely(foo)), which messed up a bunch of stuff.

@abadams abadams force-pushed the atomic_vectorization branch from ad55e00 to 0cce2b5 Compare June 11, 2020 22:52
@abadams abadams force-pushed the atomic_vectorization branch from 54fbad6 to f63c800 Compare June 13, 2020 02:02
@abadams
Copy link
Member Author

abadams commented Jun 13, 2020

ok, ready for more review

const VectorReduce *e = expr.as<VectorReduce>();

compare_scalar(op->op, e->op);
// We've already compared types, so it's enough to compare the value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is type comparison missing? There is a comment about it, but I only see operator type and value comparisons. Is it somehow implicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types get compared before this function is called in compare_expr

src/IRMatch.cpp Outdated

void visit(const VectorReduce *op) override {
const VectorReduce *e = expr.as<VectorReduce>();
if (result && e && op->op == e->op) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question about the type here: don't we need to compare number of lanes as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to compare types here. Will fix.

@@ -463,6 +463,31 @@ Expr lossless_cast(Type t, Expr e) {
return Expr();
}
}

if (const VectorReduce *red = e.as<VectorReduce>()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 'red' is confusing name.

return -1;
}

if (!checker.atomics) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be if(checker.vector_reduces)

Expr rhs = cast(dst_type, in(x * reduce_factor + r));
Expr rhs2 = cast(dst_type, in(x * reduce_factor + r + 32));

if (op == 4 || op == 5) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be much more readable to use named values for reduction operator types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't reduction operator types, they're the test cases in the switch statement below. I'll add a comment.

const int factor = red->value.type().lanes() / red->type.lanes();
switch (red->op) {
case VectorReduce::Add:
if (t.bits() >= 16 && factor < (1 << (t.bits() / 2))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a short comment would be really helpful here.

@@ -413,6 +423,10 @@ bool equal_helper(const BaseExprNode &a, const BaseExprNode &b) noexcept {
case IRNodeType::Shuffle:
return (equal_helper(((const Shuffle &)a).vectors, ((const Shuffle &)b).vectors) &&
equal_helper(((const Shuffle &)a).indices, ((const Shuffle &)b).indices));
case IRNodeType::VectorReduce:
return (((const VectorReduce &)a).op == ((const VectorReduce &)b).op &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, at this point we already know the types match (it's checked outside equal_helper) but we don't know that the types of value match before recursively calling equal_helper, so I'd better check that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the cast handler has the same bug

};

// A ramp with the lanes repeated (e.g. <0 0 2 2 4 4 6 6>)
struct InterleavedRamp {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a ramp(broadcast(a, repetitions), broadcast(b, repetitions), lanes), could you please add a TODO, so I can update it after merging?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -941,6 +1029,171 @@ class VectorSubs : public IRMutator {
return Allocate::make(op->name, op->type, op->memory_type, new_extents, op->condition, body, new_expr, op->free_function);
}

Stmt visit(const Atomic *op) override {
// Recognize a few special cases that we can handle as within-vector reduction trees.
do {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always wanted to say that goto is more readable :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered it, but having an explicit scope with breaks makes me more confident that RAII things are going to have the lifetime I expect

@@ -489,11 +490,17 @@ void ComputeModulusRemainder::visit(const Let *op) {
}

void ComputeModulusRemainder::visit(const Shuffle *op) {
// It's possible that scalar expressions are extracting a lane of a vector - don't fail in this case, but stop
// It's possible that scalar expressions are extracting a lane of
// a vector - don't faiql in this case, but stop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail is still misspelled.

@abadams abadams merged commit 95e8982 into master Jun 25, 2020
@abadams
Copy link
Member Author

abadams commented Jun 25, 2020

woo!

@abadams abadams deleted the atomic_vectorization branch July 8, 2020 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants