-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[USMP] Adding U4 usecase #65
Conversation
This commit adds a new usecase (U4) for the USMP that enables the user to place I/O tensors within the workspace and not providing them seperately. This allows the space for I/O tensors to be utilized by the inference. Change-Id: I622feca4f8762cf48ef66005bdff2cdb02df4d54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @manupa-arm for the proposal! I added a couple questions
// the workspace buffer | ||
GenerateInput(inputs.input0) | ||
|
||
tvmgen_my_model_run(&workspaces); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now we mark input nodes as "precious" (figuratively, this isn't a literal thing), and i don't think we re-use the memory. in other words, this line should be idempotent. I think this RFC seeks to change that, which is a perfectly reasonable thing to do but i like that there is a PassOption to support this explicitly. should this in fact be a PassOption? another way to do this is to annotate the relay program. the benefit of that is that if we ever started compiling multiple programs in a single tvm.relay.build call, we wouldn't have a singleton global PassOption which could apply differently to different programs or parameters. also, if a user didn't particularly care about one input, but did care about another, it might be more helpful to mark this at the I/O tensor level. what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points @areusch!
One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :
A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.
Thus, I would consider them out of scope for extension proposed here.
However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.
For A1,
If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.
For A2,
This is a perfectly valid extension upon this work.
Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.
One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).
Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.
I can add some text as future considerations, if you like. lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this all makes sense to me
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR. | ||
|
||
* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we did implement this for e.g. C++ runtime or a scenario where DLTensor
were passed (perhaps we have dynamic shapes), would we allocate DLTensor instances in whatever section of workspace we reserve for them? conceptually we could just consider those DLTensor
s as companion buffers which need to be live at the same time as the data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point again!
This is generally applicable for Embedded C Runtime Inferface as well.
Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?
I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;
So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.
Again, I can add few lines for this as future consideration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree it's probably sub-optimal to separate I/O buffers from intermediates, but what I mean here is: how would we extend the planner to also allocate memory for the DLTensor
metadata inside the workspace buffer? the Executor currently has to allocate memory for those somewhere in the case where we need that metadata (e.g. --use-unpacked-api=false or we have dynamic shapes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@areusch smart pointer (DLTensor) metadata is not something considered so far.
I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.
It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit that we need to account for a certain overhead
i guess the main question is: do you think that DLTensor data should be "overhead" in the sense that USMP should consider a buffer to need sizeof(data) + sizeof(DLTensor)
bytes contiguously, or do you prefer to keep with all the data
in one contiguous block and the DLTensors elsewhere? the second option here seems more inline with the unpacked calling convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel the latter choice allows the backend (passes) to decide where should this data be held, while not constraining them to be placed in the same memory pool (thus ultimately in the same memory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @areusch for going through this.
Please see my responses.
// the workspace buffer | ||
GenerateInput(inputs.input0) | ||
|
||
tvmgen_my_model_run(&workspaces); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points @areusch!
One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :
A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.
Thus, I would consider them out of scope for extension proposed here.
However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.
For A1,
If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.
For A2,
This is a perfectly valid extension upon this work.
Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.
One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).
Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.
I can add some text as future considerations, if you like. lmk.
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR. | ||
|
||
* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point again!
This is generally applicable for Embedded C Runtime Inferface as well.
Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?
I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;
So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.
Again, I can add few lines for this as future consideration.
thanks @manupa-arm , just one point to clarify now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @areusch!
I was out of office for few days.
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR. | ||
|
||
* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@areusch smart pointer (DLTensor) metadata is not something considered so far.
I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.
It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @manupa-arm ! i think this looks solid.
will let @Mousius see if he has any feedback and you guys can merge whenever |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one minor typography comment but otherwise this looks good to me 😸
Co-authored-by: Christopher Sidebottom <[email protected]>
Thanks @Mousius for catching that! now fixed! |
A friendly ping for reviews/merge... |
@Mousius could you take another look when you get a minute? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 😸
This commit adds a new usecase (U4) for
the USMP that enables the user to place I/O
tensors within the workspace and not
providing them seperately.
This allows the space for I/O tensors to
be utilized by the inference.