Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[microNPU][1] Add affine analysis structures for the cascader #9458

Merged
merged 4 commits into from
Dec 1, 2021

Conversation

mbaret
Copy link
Contributor

@mbaret mbaret commented Nov 5, 2021

RFC: apache/tvm-rfcs#37
Issue: #9429

The cascader relies heavily on being able to determine data dependencies between operators. This is so that it can calculate how stripes should be propagated through a cascade.

To do this, two data structures are defined: StripeConfig and Propagator. StripeConfig stores information for how a tensor should be broken up into stripes and executed. Propagator transforms a StripeConfig using an affine transform matrix, allowing an input StripeConfig for an operator to be determined by 'propagating' the output StripeConfig.

By chaining together Propagators, we can analyse how data dependencies vary throughout a cascade and therefore calculate the memory requirements (and approximate the performance).

@mbaret
Copy link
Contributor Author

mbaret commented Nov 5, 2021

@mbaret mbaret changed the title [ETHOSU][1] Add affine analysis structures for the cascader [microNPU][1] Add affine analysis structures for the cascader Nov 8, 2021
@mbaret
Copy link
Contributor Author

mbaret commented Nov 9, 2021

also cc @csullivan

@tqchen
Copy link
Member

tqchen commented Nov 9, 2021

minor note: not needed for now, but it may be helpful to take a look at https://github.com/apache/tvm/blob/main/include/tvm/arith/iter_affine_map.h to see if there is any utils that can be reused

@mbaret
Copy link
Contributor Author

mbaret commented Nov 9, 2021

minor note: not needed for now, but it may be helpful to take a look at https://github.com/apache/tvm/blob/main/include/tvm/arith/iter_affine_map.h to see if there is any utils that can be reused

I did spend a little bit of time looking at this after @junrushao1994 made me aware of it. It looks like it'd make for a good integration point for a 'v2' - especially once we upgrade to TensorIR. However, I think there are a few representational issues that would make it hard to directly leverage in the current design.

@mbaret mbaret force-pushed the ethosu-cascader-1 branch 2 times, most recently from b878cb5 to c71c421 Compare November 18, 2021 13:40
Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @mbaret, this is only a partial review but I had a couple questions and comments. I'll hopefully be able to wrap the review up on Monday. Apologies for the slow turn around, but thanks for the great contribution!

* error will compound when we need to multiply the strides by the number of
* stripes along a given axis.
*/
inline std::vector<float> GetStrides() const { return strides_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an example that we can write or link to in this doc string to illustrate the error accumulation that occurs from using ceildiv style rounding. I didn't quite follow the current description. By fractional striding are you trying to describe the case when the stripe shape dims are not divisors of the input shape dims?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best example of this is an upscale operation. If we have a 2x2 upscale and choose to stripe the output in 3x3 stripes, the input stripes will get a fractional stride of 3/2. We can't just ceil round this, otherwise as we increase the striding it'll get further and further from the truth. I'll add this to the docs.

Side note: I have considered re-expressing this as a rational number rather than a float, but I think that can be a future improvement for now.

*
* The size of that stripe in each axis is the 'shape'. The strides is how far
* you should move between stripes, so also (4, 4) for a simple non-overlappping
* tiling. However, we explore some overlapping scheduling options so shape != strides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping on earlier comment ⬆️ , an example like one of these when stride value is non-integral would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an example based on 2x2 upscale.

*
* Finally, the 'offset' tells us where to start the first stripe. In this simple
* case the offset is just (0, 0), but in something like a padding operation we
* may want to start from a negative index, which is captured by the offset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about how negative indexing is handled for an operation would be helpful. For example, the stripe config between two operations Op1 and Op2, will depend on the padding needed by Op2, but will influence where Op1 writes its memory. Is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the doc to use slice as the example here because I think that's easier to follow. Regarding the padding case, it's a bit of a challenge to explain without a diagram but I'll give it a go here.

Let's say we have an op A that represents a symmetric pad by 1 and two tensors T_in and T_out such that T_in -> A -> T_out. If T_in has shape (4, 4) then T_out will have shape (6, 6) after the padding. Now, we choose a StripeConfig for T_out which is equivalent to (2, 2) tiling:

StripeConfig T_out = 
{
  shape=[2, 2],
  extent=[6, 6],
  strides=[2, 2],
  order=[1, 2],
  stripes=[3, 3],
  offset=[0, 0],
}

The question then is, what StripeConfig will need to be produced when we propagate to T_in? Well, if we state that any part of a stripe that lies outside of the tensor bounds can be ignored, then the input StripeConfig should be this:

StripeConfig T_in =
{
  shape=[2, 2],
  extent=[6, 6],
  strides=[2, 2],
  order=[1, 2],
  stripes=[3, 3],
  offset=[-1, -1],
}

This will effectively 'overlay' the output StripeConfig on the (4, 4) input tensor T_in, but such that outer 1-wide padding margin is always out-of-bounds (either <0 or >=4 in either axis). This way, reads will never be generated for the padding but they will for the 'interior'.

If this is still unclear - which I appreciate it might be - I can potentially produce a quick diagram.

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly data structure introductions at this point! LGTM.

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b
Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8
Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824
Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
@mbaret
Copy link
Contributor Author

mbaret commented Nov 30, 2021

ping @csullivan

Copy link
Contributor

@NicolaLancellotti NicolaLancellotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mbaret.

@leandron leandron merged commit 2275359 into apache:main Dec 1, 2021
@leandron
Copy link
Contributor

leandron commented Dec 1, 2021

This is merged now, thanks @csullivan @mbaret @NicolaLancellotti @manupa-arm and @tqchen.

masahi pushed a commit to masahi/tvm that referenced this pull request Dec 1, 2021
…#9458)

* [ETHOSU][1] Add affine analysis structures for the cascader

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b

* Add test guards

Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8

* Address review comments

Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824

* Improve docs

Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
…#9458)

* [ETHOSU][1] Add affine analysis structures for the cascader

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b

* Add test guards

Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8

* Address review comments

Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824

* Improve docs

Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 11, 2022
…#9458)

* [ETHOSU][1] Add affine analysis structures for the cascader

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b

* Add test guards

Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8

* Address review comments

Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824

* Improve docs

Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
yangulei pushed a commit to yangulei/tvm that referenced this pull request Jan 12, 2022
…#9458)

* [ETHOSU][1] Add affine analysis structures for the cascader

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b

* Add test guards

Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8

* Address review comments

Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824

* Improve docs

Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
…#9458)

* [ETHOSU][1] Add affine analysis structures for the cascader

The cascader relies heavily on being able to determine
data dependencies between operators. This is so that it
can calculate how stripes should be propagated through a
cascade.

To do this, two data structures are defined: StripeConfig
and Propagator. StripeConfig stores information for how a
tensor should be broken up into stripes and executed.
Propagator transforms a StripeConfig using an affine
transform matrix, allowing an input StripeConfig for an
operator to be determined by 'propagating' the output
StripeConfig.

By chaining together Propagators, we can analyse how
data dependencies vary throughout a cascade and therefore
calculate the memory requirements (and approximate the
performance).

Change-Id: If7176fea961c631be4a6c195303da536030d957b

* Add test guards

Change-Id: I1d7633e20daab33642fa5c4a12e474a4def4d8b8

* Address review comments

Change-Id: Iff5f1effa08e0628de91f5577487d0cecebec824

* Improve docs

Change-Id: I508809d8c1a08d231e3a9b0fd9b3f2639cc2f0e3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants