-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[USMP] Adding U4 usecase #65
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -349,6 +349,108 @@ tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c | |
TVMExecute(&my_model, &inputs, &outputs, &context); | ||
} | ||
``` | ||
|
||
## U4 : User wants to write/read directly to the workspace buffer | ||
|
||
This usecase allows the space used by I/O tensors to be re-used by the inference. | ||
|
||
### TVMC | ||
``` | ||
tvmc compile my_model.tflite | ||
--executor=aot | ||
--target=c | ||
--workspace-pools=sram | ||
--pass-config tir.usmp.enable=1 | ||
--pass-config tir.usmp.use_workspace_io=1 | ||
|
||
``` | ||
### Codegen'd Artifacts | ||
``` | ||
//Codegen'd artifacts in metadata.c (lib0.c) | ||
|
||
int32_t tvmgen_my_model_run( | ||
tvmgen_my_model_workspace_pools* workspace_pools, | ||
){ | ||
return my_model_main(workspace_pools.sram); | ||
} | ||
|
||
// Returns a handle pointing to space inside the | ||
// workspace pool where input should be stored | ||
|
||
tvmgen_my_model_inputs tvmgen_my_model_map_inputs( | ||
tvmgen_my_model_workspace_pools* workspace_pools | ||
) { | ||
tvmgen_my_model_inputs = { | ||
.input0 = &workspace_pools->sram[<INPUT0_OFFSET>], | ||
}; | ||
return tvmgen_my_model_inputs; | ||
} | ||
|
||
// Returns a handle pointing to space inside the | ||
// workspace pool where output is stored | ||
|
||
tvmgen_my_model_outputs tvmgen_my_model_map_outputs( | ||
tvmgen_my_model_workspace_pools* workspace_pools | ||
) { | ||
tvmgen_my_model_outputs = { | ||
.output0 = &workspace_pools->sram[<OUTPUT0_OFFSET>], | ||
}; | ||
return tvmgen_my_model_outputs; | ||
} | ||
``` | ||
``` | ||
// metadata.h | ||
|
||
#define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx | ||
|
||
typedef struct { | ||
uint8_t* sram; | ||
} tvmgen_my_model_workspace_pools; | ||
|
||
typedef struct { | ||
uint8_t* input0; | ||
} tvmgen_my_model_inputs; | ||
|
||
typedef struct { | ||
uint8_t* output0; | ||
} tvmgen_my_model_outputs; | ||
|
||
tvmgen_my_model_inputs tvmgen_my_model_map_inputs( | ||
tvmgen_my_model_workspace_pools* workspace_pools | ||
); | ||
|
||
tvmgen_my_model_outputs tvmgen_my_model_map_outputs( | ||
tvmgen_my_model_workspace_pools* workspace_pools | ||
); | ||
``` | ||
### User Application | ||
``` | ||
// The User Application model; | ||
__attribute__((section( "SRAM" ), aligned( 16 ))) static uint8_t workspace_buffer_sram[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE]; | ||
|
||
int main(...) { | ||
... | ||
tvmgen_my_model_workspace_pools workspaces = { | ||
.sram = &workspace_buffer_sram, | ||
}; | ||
tvmgen_my_model_inputs inputs = | ||
tvmgen_my_model_map_inputs(&workspaces); | ||
tvmgen_my_model_outputs outputs = | ||
tvmgen_my_model_map_outputs(&workspaces); | ||
|
||
// Generate input tensor by passing the handle | ||
// E.g. this could be a driver writing directly to | ||
// the workspace buffer | ||
GenerateInput(inputs.input0) | ||
|
||
tvmgen_my_model_run(&workspaces); | ||
|
||
// A consumer can obtain the data through | ||
// accessing the updated struct outputs | ||
// that points inside the workspace. | ||
ReadInferenceOutput(outputs.output0); | ||
} | ||
``` | ||
# Reference-level explanation | ||
|
||
## Overview | ||
|
@@ -500,6 +602,52 @@ This actual pass would traverse full TIR program and construct BufferInfo object | |
<compute> | ||
... | ||
``` | ||
|
||
## The optional lowering changes to support U4 | ||
|
||
After Step 1, the I/O tensors will be bound as allocate nodes with special annotation to keep track of the offsets within workspace pools. As an e.g. : | ||
|
||
### Pre U4 IR transformation | ||
|
||
``` | ||
__tvm_main__ = primfn(input1: handle, input2: handle, output1: handle, output2: handle) -> () | ||
attr = {"global_symbol": "__tvm_main__", "runner_function": True} | ||
buffers = {output1_buffer_var: Buffer(output1_buffer_var_1: Pointer(global int16), int16, [452], []), | ||
output2_buffer_var: Buffer(output2_buffer_var_1: Pointer(global int16), int16, [452], []), | ||
input2_buffer_var: Buffer(input2_buffer_var_1: Pointer(global uint8), uint8, [150528], []), | ||
input1_buffer_var: Buffer(input1_buffer_var_1: Pointer(global uint8), uint8, [150528], [])} | ||
buffer_map = {input2: input2_buffer_var, input1: input1_buffer_var, output2: output2_buffer_var, output1: output1_buffer_var} { | ||
@tir.call_extern("tvmgen_default_fused_cast_subtract", input1_buffer_var_1, @tir.lookup_param("p0", dtype=handle), output1_buffer_var_1, dtype=int32) | ||
@tir.call_extern("tvmgen_default_fused_cast_subtract", input2_buffer_var_1, @tir.lookup_param("p1", dtype=handle), output2_buffer_var_1, dtype=int32) | ||
} | ||
``` | ||
|
||
### Post U4 IR transformation | ||
|
||
``` | ||
@__tvm_main__ = primfn() -> () | ||
attr = {"global_symbol": "__tvm_main__", "runner_function": True} | ||
buffers = {output2_buffer_var: Buffer(output2_buffer_var_1: Pointer(global int16), int16, [452], []), | ||
output1_buffer_var: Buffer(output1_buffer_var_1: Pointer(global int16), int16, [452], []), | ||
input2_buffer_var: Buffer(input2_buffer_var_1: Pointer(global uint8), uint8, [150528], []), | ||
input1_buffer_var: Buffer(input1_buffer_var_1: Pointer(global uint8), uint8, [150528], [])} | ||
buffer_map = {output1: handle: output1_buffer_var, input1: handle: input1_buffer_var, input2: handle: input2_buffer_var, output2: handle: output2_buffer_var} { | ||
allocate(output2_buffer_var_1, int16, [452]), storage_scope = global, annotations = {"output_tensor": "output2"}); | ||
allocate(output1_buffer_var_1, int16, [452]), storage_scope = global, annotations = {"output_tensor": "output1"}); | ||
allocate(input2_buffer_var_1, uint8, [150528]), storage_scope = global, annotations = {"input_tensor": "input2"}); | ||
allocate(input1_buffer_var_1, uint8, [150528]), storage_scope = global, annotations = {"input_tensor": "input1"}) { | ||
@tir.call_extern("tvmgen_default_fused_cast_subtract", input1_buffer_var_1, @tir.lookup_param("p0", dtype=handle), output1_buffer_var_1, dtype=int32) | ||
@tir.call_extern("tvmgen_default_fused_cast_subtract", input2_buffer_var_1, @tir.lookup_param("p1", dtype=handle), output2_buffer_var_1, dtype=int32) | ||
} | ||
} | ||
|
||
``` | ||
|
||
Through out the USMP lowering, the allocate node with such special annotations will be maintained as a `Map<String, PoolAllocation>`, where the key indicates the name of the I/O tensor while `PoolAllocation` captures the pool and the offset it was assigned in the USMP. | ||
|
||
The above metadata will be used to produce the `tvmgen_<model_name>_map_inputs` and `tvmgen\_<model_name>_map_outputs` functions to metadata sources (See the guide-level explanation of U4) | ||
|
||
|
||
# Code Structure | ||
|
||
* src/tir/usmp/analysis/ -- this is where analysis passes of USMP will live | ||
|
@@ -515,4 +663,6 @@ NOTE : to support tir.constants generally, we'll be enhancing the bound relay.co | |
|
||
# Drawbacks | ||
|
||
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR. | ||
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR. | ||
|
||
* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we did implement this for e.g. C++ runtime or a scenario where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point again! This is generally applicable for Embedded C Runtime Inferface as well. Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ? I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR; So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it. Again, I can add few lines for this as future consideration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree it's probably sub-optimal to separate I/O buffers from intermediates, but what I mean here is: how would we extend the planner to also allocate memory for the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @areusch smart pointer (DLTensor) metadata is not something considered so far. I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases. It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
i guess the main question is: do you think that DLTensor data should be "overhead" in the sense that USMP should consider a buffer to need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also feel the latter choice allows the backend (passes) to decide where should this data be held, while not constraining them to be placed in the same memory pool (thus ultimately in the same memory). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now we mark input nodes as "precious" (figuratively, this isn't a literal thing), and i don't think we re-use the memory. in other words, this line should be idempotent. I think this RFC seeks to change that, which is a perfectly reasonable thing to do but i like that there is a PassOption to support this explicitly. should this in fact be a PassOption? another way to do this is to annotate the relay program. the benefit of that is that if we ever started compiling multiple programs in a single tvm.relay.build call, we wouldn't have a singleton global PassOption which could apply differently to different programs or parameters. also, if a user didn't particularly care about one input, but did care about another, it might be more helpful to mark this at the I/O tensor level. what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points @areusch!
One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :
A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.
Thus, I would consider them out of scope for extension proposed here.
However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.
For A1,
If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.
For A2,
This is a perfectly valid extension upon this work.
Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.
One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).
Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.
I can add some text as future considerations, if you like. lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this all makes sense to me