Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USMP] Adding U4 usecase #65

Merged
merged 2 commits into from
Apr 27, 2022
Merged

[USMP] Adding U4 usecase #65

merged 2 commits into from
Apr 27, 2022

Conversation

manupak
Copy link
Contributor

@manupak manupak commented Mar 25, 2022

This commit adds a new usecase (U4) for
the USMP that enables the user to place I/O
tensors within the workspace and not
providing them seperately.

This allows the space for I/O tensors to
be utilized by the inference.

@manupak manupak requested a review from Mousius March 25, 2022 13:08
@manupak
Copy link
Contributor Author

manupak commented Mar 25, 2022

cc : @Mousius @leandron @areusch

This commit adds a new usecase (U4) for
the USMP that enables the user to place I/O
tensors within the workspace and not
providing them seperately.

This allows the space for I/O tensors to
be utilized by the inference.

Change-Id: I622feca4f8762cf48ef66005bdff2cdb02df4d54
@manupak
Copy link
Contributor Author

manupak commented Mar 29, 2022

A friendly ping for reviews as the implementation PR is also ready and green :) ! @Mousius @areusch

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @manupa-arm for the proposal! I added a couple questions

// the workspace buffer
GenerateInput(inputs.input0)

tvmgen_my_model_run(&workspaces);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now we mark input nodes as "precious" (figuratively, this isn't a literal thing), and i don't think we re-use the memory. in other words, this line should be idempotent. I think this RFC seeks to change that, which is a perfectly reasonable thing to do but i like that there is a PassOption to support this explicitly. should this in fact be a PassOption? another way to do this is to annotate the relay program. the benefit of that is that if we ever started compiling multiple programs in a single tvm.relay.build call, we wouldn't have a singleton global PassOption which could apply differently to different programs or parameters. also, if a user didn't particularly care about one input, but did care about another, it might be more helpful to mark this at the I/O tensor level. what are your thoughts?

Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points @areusch!

One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :

A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.

Thus, I would consider them out of scope for extension proposed here.

However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.

For A1,

If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.

For A2,

This is a perfectly valid extension upon this work.

Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.

One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).

Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.

I can add some text as future considerations, if you like. lmk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this all makes sense to me

* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.

* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we did implement this for e.g. C++ runtime or a scenario where DLTensor were passed (perhaps we have dynamic shapes), would we allocate DLTensor instances in whatever section of workspace we reserve for them? conceptually we could just consider those DLTensors as companion buffers which need to be live at the same time as the data.

Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point again!

This is generally applicable for Embedded C Runtime Inferface as well.

Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?

I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;

So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.

Again, I can add few lines for this as future consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree it's probably sub-optimal to separate I/O buffers from intermediates, but what I mean here is: how would we extend the planner to also allocate memory for the DLTensor metadata inside the workspace buffer? the Executor currently has to allocate memory for those somewhere in the case where we need that metadata (e.g. --use-unpacked-api=false or we have dynamic shapes).

Copy link
Contributor Author

@manupak manupak Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch smart pointer (DLTensor) metadata is not something considered so far.

I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.

It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that we need to account for a certain overhead

i guess the main question is: do you think that DLTensor data should be "overhead" in the sense that USMP should consider a buffer to need sizeof(data) + sizeof(DLTensor) bytes contiguously, or do you prefer to keep with all the data in one contiguous block and the DLTensors elsewhere? the second option here seems more inline with the unpacked calling convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel the latter choice allows the backend (passes) to decide where should this data be held, while not constraining them to be placed in the same memory pool (thus ultimately in the same memory).

Copy link
Contributor Author

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @areusch for going through this.

Please see my responses.

// the workspace buffer
GenerateInput(inputs.input0)

tvmgen_my_model_run(&workspaces);
Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points @areusch!

One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :

A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.

Thus, I would consider them out of scope for extension proposed here.

However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.

For A1,

If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.

For A2,

This is a perfectly valid extension upon this work.

Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.

One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).

Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.

I can add some text as future considerations, if you like. lmk.

* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.

* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.
Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point again!

This is generally applicable for Embedded C Runtime Inferface as well.

Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?

I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;

So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.

Again, I can add few lines for this as future consideration.

@areusch
Copy link
Contributor

areusch commented Apr 8, 2022

thanks @manupa-arm , just one point to clarify now

Copy link
Contributor Author

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @areusch!
I was out of office for few days.

* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.

* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.
Copy link
Contributor Author

@manupak manupak Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch smart pointer (DLTensor) metadata is not something considered so far.

I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.

It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @manupa-arm ! i think this looks solid.

@areusch
Copy link
Contributor

areusch commented Apr 21, 2022

will let @Mousius see if he has any feedback and you guys can merge whenever

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one minor typography comment but otherwise this looks good to me 😸

rfcs/0009_Unified_Static_Memory_Planning.md Outdated Show resolved Hide resolved
@manupak
Copy link
Contributor Author

manupak commented Apr 22, 2022

Thanks @Mousius for catching that! now fixed!

@manupak
Copy link
Contributor Author

manupak commented Apr 25, 2022

A friendly ping for reviews/merge...

@areusch
Copy link
Contributor

areusch commented Apr 26, 2022

@Mousius could you take another look when you get a minute?

Copy link
Member

@Mousius Mousius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 😸

@Mousius Mousius merged commit b9e246f into apache:main Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants