[USMP] Adding U4 usecase #65

manupak · 2022-03-25T13:07:15Z

This commit adds a new usecase (U4) for
the USMP that enables the user to place I/O
tensors within the workspace and not
providing them seperately.

This allows the space for I/O tensors to
be utilized by the inference.

manupak · 2022-03-25T13:08:31Z

cc : @Mousius @leandron @areusch

This commit adds a new usecase (U4) for the USMP that enables the user to place I/O tensors within the workspace and not providing them seperately. This allows the space for I/O tensors to be utilized by the inference. Change-Id: I622feca4f8762cf48ef66005bdff2cdb02df4d54

manupak · 2022-03-29T16:42:08Z

A friendly ping for reviews as the implementation PR is also ready and green :) ! @Mousius @areusch

areusch

thanks @manupa-arm for the proposal! I added a couple questions

areusch · 2022-03-31T02:35:46Z

rfcs/0009_Unified_Static_Memory_Planning.md

+         // the workspace buffer
+         GenerateInput(inputs.input0)
+
+         tvmgen_my_model_run(&workspaces);


right now we mark input nodes as "precious" (figuratively, this isn't a literal thing), and i don't think we re-use the memory. in other words, this line should be idempotent. I think this RFC seeks to change that, which is a perfectly reasonable thing to do but i like that there is a PassOption to support this explicitly. should this in fact be a PassOption? another way to do this is to annotate the relay program. the benefit of that is that if we ever started compiling multiple programs in a single tvm.relay.build call, we wouldn't have a singleton global PassOption which could apply differently to different programs or parameters. also, if a user didn't particularly care about one input, but did care about another, it might be more helpful to mark this at the I/O tensor level. what are your thoughts?

Good points @areusch!

One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :

A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.

Thus, I would consider them out of scope for extension proposed here.

However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.

For A1,

If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.

For A2,

This is a perfectly valid extension upon this work.

Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.

One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).

Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.

I can add some text as future considerations, if you like. lmk.

i think this all makes sense to me

areusch · 2022-03-31T02:37:35Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.
+
+* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.


if we did implement this for e.g. C++ runtime or a scenario where DLTensor were passed (perhaps we have dynamic shapes), would we allocate DLTensor instances in whatever section of workspace we reserve for them? conceptually we could just consider those DLTensors as companion buffers which need to be live at the same time as the data.

Good point again!

This is generally applicable for Embedded C Runtime Inferface as well.

Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?

I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;

So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.

Again, I can add few lines for this as future consideration.

i agree it's probably sub-optimal to separate I/O buffers from intermediates, but what I mean here is: how would we extend the planner to also allocate memory for the DLTensor metadata inside the workspace buffer? the Executor currently has to allocate memory for those somewhere in the case where we need that metadata (e.g. --use-unpacked-api=false or we have dynamic shapes).

@areusch smart pointer (DLTensor) metadata is not something considered so far.

I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.

It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.

I admit that we need to account for a certain overhead

i guess the main question is: do you think that DLTensor data should be "overhead" in the sense that USMP should consider a buffer to need sizeof(data) + sizeof(DLTensor) bytes contiguously, or do you prefer to keep with all the data in one contiguous block and the DLTensors elsewhere? the second option here seems more inline with the unpacked calling convention.

I also feel the latter choice allows the backend (passes) to decide where should this data be held, while not constraining them to be placed in the same memory pool (thus ultimately in the same memory).

manupak

Thanks @areusch for going through this.

Please see my responses.

manupak · 2022-03-31T09:00:49Z

rfcs/0009_Unified_Static_Memory_Planning.md

+         // the workspace buffer
+         GenerateInput(inputs.input0)
+
+         tvmgen_my_model_run(&workspaces);


Good points @areusch!

One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :

A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.

Thus, I would consider them out of scope for extension proposed here.

However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.

For A1,

If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.

For A2,

This is a perfectly valid extension upon this work.

Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.

One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).

Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.

I can add some text as future considerations, if you like. lmk.

manupak · 2022-03-31T09:10:39Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.
+
+* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.


Good point again!

This is generally applicable for Embedded C Runtime Inferface as well.

Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?

I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;

So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.

Again, I can add few lines for this as future consideration.

areusch · 2022-04-08T20:45:01Z

thanks @manupa-arm , just one point to clarify now

manupak

Thanks @areusch!
I was out of office for few days.

manupak · 2022-04-19T15:30:17Z

rfcs/0009_Unified_Static_Memory_Planning.md

+* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.
+
+* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.


@areusch smart pointer (DLTensor) metadata is not something considered so far.

I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.

It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.

areusch

thanks @manupa-arm ! i think this looks solid.

areusch · 2022-04-21T00:30:41Z

will let @Mousius see if he has any feedback and you guys can merge whenever

Mousius

I have one minor typography comment but otherwise this looks good to me 😸

rfcs/0009_Unified_Static_Memory_Planning.md

Co-authored-by: Christopher Sidebottom <[email protected]>

manupak · 2022-04-22T12:18:04Z

Thanks @Mousius for catching that! now fixed!

manupak · 2022-04-25T10:56:59Z

A friendly ping for reviews/merge...

areusch · 2022-04-26T00:38:32Z

@Mousius could you take another look when you get a minute?

Mousius

LGTM 😸

manupak requested a review from Mousius March 25, 2022 13:08

manupak mentioned this pull request Mar 25, 2022

[USMP] Adding support for U4 usecase apache/tvm#10785

Merged

manupak force-pushed the usmp_u4_rfc branch from 2343235 to 86b6f79 Compare March 25, 2022 14:08

areusch reviewed Mar 31, 2022

View reviewed changes

manupak commented Mar 31, 2022

View reviewed changes

manupak commented Apr 19, 2022

View reviewed changes

areusch approved these changes Apr 21, 2022

View reviewed changes

Mousius requested changes Apr 21, 2022

View reviewed changes

rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved

Update rfcs/0009_Unified_Static_Memory_Planning.md

ee95ec7

Co-authored-by: Christopher Sidebottom <[email protected]>

Mousius approved these changes Apr 27, 2022

View reviewed changes

Mousius merged commit b9e246f into apache:main Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[USMP] Adding U4 usecase #65

[USMP] Adding U4 usecase #65

manupak commented Mar 25, 2022

manupak commented Mar 25, 2022

manupak commented Mar 29, 2022

areusch left a comment

areusch Mar 31, 2022

manupak Mar 31, 2022 •

edited

Loading

areusch Apr 8, 2022

areusch Mar 31, 2022

manupak Mar 31, 2022 •

edited

Loading

areusch Apr 8, 2022

manupak Apr 19, 2022 •

edited

Loading

areusch Apr 20, 2022

manupak Apr 20, 2022

manupak left a comment

manupak Mar 31, 2022 •

edited

Loading

manupak Mar 31, 2022 •

edited

Loading

areusch commented Apr 8, 2022

manupak left a comment

manupak Apr 19, 2022 •

edited

Loading

areusch left a comment

areusch commented Apr 21, 2022

Mousius left a comment

manupak commented Apr 22, 2022

manupak commented Apr 25, 2022 •

edited

Loading

areusch commented Apr 26, 2022

Mousius left a comment

		* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.

		* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.

[USMP] Adding U4 usecase #65

[USMP] Adding U4 usecase #65

Conversation

manupak commented Mar 25, 2022

manupak commented Mar 25, 2022

manupak commented Mar 29, 2022

areusch left a comment

Choose a reason for hiding this comment

areusch Mar 31, 2022

Choose a reason for hiding this comment

manupak Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

areusch Apr 8, 2022

Choose a reason for hiding this comment

areusch Mar 31, 2022

Choose a reason for hiding this comment

manupak Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

areusch Apr 8, 2022

Choose a reason for hiding this comment

manupak Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

areusch Apr 20, 2022

Choose a reason for hiding this comment

manupak Apr 20, 2022

Choose a reason for hiding this comment

manupak left a comment

Choose a reason for hiding this comment

manupak Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

manupak Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

areusch commented Apr 8, 2022

manupak left a comment

Choose a reason for hiding this comment

manupak Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

areusch left a comment

Choose a reason for hiding this comment

areusch commented Apr 21, 2022

Mousius left a comment

Choose a reason for hiding this comment

manupak commented Apr 22, 2022

manupak commented Apr 25, 2022 • edited Loading

areusch commented Apr 26, 2022

Mousius left a comment

Choose a reason for hiding this comment

manupak Mar 31, 2022 •

edited

Loading

manupak Mar 31, 2022 •

edited

Loading

manupak Apr 19, 2022 •

edited

Loading

manupak Mar 31, 2022 •

edited

Loading

manupak Mar 31, 2022 •

edited

Loading

manupak Apr 19, 2022 •

edited

Loading

manupak commented Apr 25, 2022 •

edited

Loading