Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[USMP] Adding U4 usecase #65

Merged
merged 2 commits into from
Apr 27, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
152 changes: 151 additions & 1 deletion rfcs/0009_Unified_Static_Memory_Planning.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,108 @@ tvmc compile my_model.tflite --executor=aot --output-format=mlf --target=c
TVMExecute(&my_model, &inputs, &outputs, &context);
}
```

## U4 : User wants to write/read directly to the workspace buffer

This usecase allows the space used by I/O tensors to be re-used by the inference.

### TVMC
```
tvmc compile my_model.tflite
--executor=aot
--target=c
--workspace-pools=sram
--pass-config tir.usmp.enable=1
--pass-config tir.usmp.use_workspace_io=1

```
### Codegen'd Artifacts
```
//Codegen'd artifacts in metadata.c (lib0.c)

int32_t tvmgen_my_model_run(
tvmgen_my_model_workspace_pools* workspace_pools,
){
return my_model_main(workspace_pools.sram);
}

// Returns a handle pointing to space inside the
// workspace pool where input should be stored

tvmgen_my_model_inputs tvmgen_my_model_map_inputs(
tvmgen_my_model_workspace_pools* workspace_pools
) {
tvmgen_my_model_inputs = {
.input0 = &workspace_pools->sram[<INPUT0_OFFSET>],
};
return tvmgen_my_model_inputs;
}

// Returns a handle pointing to space inside the
// workspace pool where output is stored

tvmgen_my_model_outputs tvmgen_my_model_map_outputs(
tvmgen_my_model_workspace_pools* workspace_pools
) {
tvmgen_my_model_outputs = {
.output0 = &workspace_pools->sram[<OUTPUT0_OFFSET>],
};
return tvmgen_my_model_outputs;
}
```
```
// metadata.h

#define TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE xxxx

typedef struct {
uint8_t* sram;
} tvmgen_my_model_workspace_pools;

typedef struct {
uint8_t* input0;
} tvmgen_my_model_inputs;

typedef struct {
uint8_t* output0;
} tvmgen_my_model_outputs;

tvmgen_my_model_inputs tvmgen_my_model_map_inputs(
tvmgen_my_model_workspace_pools* workspace_pools
);

tvmgen_my_model_outputs tvmgen_my_model_map_outputs(
tvmgen_my_model_workspace_pools* workspace_pools
);
```
### User Application
```
// The User Application model;
__attribute__((section( "SRAM" ), aligned( 16 ))) static uint8_t workspace_buffer_sram[TVM_MY_MODEL_SRAM_WORKSPACE_BUFFER_SIZE];

int main(...) {
...
tvmgen_my_model_workspace_pools workspaces = {
.sram = &workspace_buffer_sram,
};
tvmgen_my_model_inputs inputs =
tvmgen_my_model_map_inputs(&workspaces);
tvmgen_my_model_outputs outputs =
tvmgen_my_model_map_outputs(&workspaces);

// Generate input tensor by passing the handle
// E.g. this could be a driver writing directly to
// the workspace buffer
GenerateInput(inputs.input0)

tvmgen_my_model_run(&workspaces);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now we mark input nodes as "precious" (figuratively, this isn't a literal thing), and i don't think we re-use the memory. in other words, this line should be idempotent. I think this RFC seeks to change that, which is a perfectly reasonable thing to do but i like that there is a PassOption to support this explicitly. should this in fact be a PassOption? another way to do this is to annotate the relay program. the benefit of that is that if we ever started compiling multiple programs in a single tvm.relay.build call, we wouldn't have a singleton global PassOption which could apply differently to different programs or parameters. also, if a user didn't particularly care about one input, but did care about another, it might be more helpful to mark this at the I/O tensor level. what are your thoughts?

Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points @areusch!

One thing I would like to make clear, the proposal here is to extend the current TVM's status. Therefore, this change do not consider the following two scenarios :

A1. using tvm.relay.build(List[IRModule])
A2. Selective re-use of space of some I/O tensors.

Thus, I would consider them out of scope for extension proposed here.

However, I'd like to highlight how this work is unrelated or do not block such extentions in the future.

For A1,

If and when we want to do this (hopefully in that RFC) -- discuss why would we ever want to do this. As of now I feel this over-stretching that API. Even if we wanted do that, there is a larger issue to be solved because in the case where PassConfig need to be seperately set for the different programs that coupled in the build call. e.g. we might want not to vectorize one program while we want to do vectorize the other.

For A2,

This is a perfectly valid extension upon this work.

Right now, we have kept the change self-contained within USMP Pass -- therefore using PassConfig.
I would still consider this route to be valueable because the user does not need to manually annotate all the I/O tensors otherwise, which would account for most usage of this feature.

One thing that I find challenging would be to use these annotations, before relay.build(...) following the agreement reached here : https://github.com/apache/tvm-rfcs/blob/main/rfcs/0029-migrating-to-irmodule-attributes.md. i.e. the annotations are being done inside relay.build(...).

Let say if we could overcome that hurdle, then what needs to be done to support that would be just use the annotation to filter CreateAllocateIO pass to select which I/O tensors should be selected based on the annotations. Thus, one could simply extend the feature proposed here to support that derivative advanced feature.

I can add some text as future considerations, if you like. lmk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this all makes sense to me


// A consumer can obtain the data through
// accessing the updated struct outputs
// that points inside the workspace.
ReadInferenceOutput(outputs.output0);
}
```
# Reference-level explanation

## Overview
Expand Down Expand Up @@ -500,6 +602,52 @@ This actual pass would traverse full TIR program and construct BufferInfo object
<compute>
...
```

## The optional lowering changes to support U4

After Step 1, the I/O tensors will be bound as allocate nodes with special annotation to keep track of the offsets within workspace pools. As an e.g. :

### Pre U4 IR transformation

```
__tvm_main__ = primfn(input1: handle, input2: handle, output1: handle, output2: handle) -> ()
attr = {"global_symbol": "__tvm_main__", "runner_function": True}
buffers = {output1_buffer_var: Buffer(output1_buffer_var_1: Pointer(global int16), int16, [452], []),
output2_buffer_var: Buffer(output2_buffer_var_1: Pointer(global int16), int16, [452], []),
input2_buffer_var: Buffer(input2_buffer_var_1: Pointer(global uint8), uint8, [150528], []),
input1_buffer_var: Buffer(input1_buffer_var_1: Pointer(global uint8), uint8, [150528], [])}
buffer_map = {input2: input2_buffer_var, input1: input1_buffer_var, output2: output2_buffer_var, output1: output1_buffer_var} {
@tir.call_extern("tvmgen_default_fused_cast_subtract", input1_buffer_var_1, @tir.lookup_param("p0", dtype=handle), output1_buffer_var_1, dtype=int32)
@tir.call_extern("tvmgen_default_fused_cast_subtract", input2_buffer_var_1, @tir.lookup_param("p1", dtype=handle), output2_buffer_var_1, dtype=int32)
}
```

### Post U4 IR transformation

```
@__tvm_main__ = primfn() -> ()
attr = {"global_symbol": "__tvm_main__", "runner_function": True}
buffers = {output2_buffer_var: Buffer(output2_buffer_var_1: Pointer(global int16), int16, [452], []),
output1_buffer_var: Buffer(output1_buffer_var_1: Pointer(global int16), int16, [452], []),
input2_buffer_var: Buffer(input2_buffer_var_1: Pointer(global uint8), uint8, [150528], []),
input1_buffer_var: Buffer(input1_buffer_var_1: Pointer(global uint8), uint8, [150528], [])}
buffer_map = {output1: handle: output1_buffer_var, input1: handle: input1_buffer_var, input2: handle: input2_buffer_var, output2: handle: output2_buffer_var} {
allocate(output2_buffer_var_1, int16, [452]), storage_scope = global, annotations = {"output_tensor": "output2"});
allocate(output1_buffer_var_1, int16, [452]), storage_scope = global, annotations = {"output_tensor": "output1"});
allocate(input2_buffer_var_1, uint8, [150528]), storage_scope = global, annotations = {"input_tensor": "input2"});
allocate(input1_buffer_var_1, uint8, [150528]), storage_scope = global, annotations = {"input_tensor": "input1"}) {
@tir.call_extern("tvmgen_default_fused_cast_subtract", input1_buffer_var_1, @tir.lookup_param("p0", dtype=handle), output1_buffer_var_1, dtype=int32)
@tir.call_extern("tvmgen_default_fused_cast_subtract", input2_buffer_var_1, @tir.lookup_param("p1", dtype=handle), output2_buffer_var_1, dtype=int32)
}
}

```

Through out the USMP lowering, the allocate node with such special annotations will be maintained as a `Map<String, PoolAllocation>`, where the key indicates the name of the I/O tensor while `PoolAllocation` captures the pool and the offset it was assigned in the USMP.

The above metadata will be used to produce the `tvmgen_<model_name>_map_inputs` and `tvmgen\_<model_name>_map_outputs` functions to metadata sources (See the guide-level explanation of U4)


# Code Structure

* src/tir/usmp/analysis/ -- this is where analysis passes of USMP will live
Expand All @@ -515,4 +663,6 @@ NOTE : to support tir.constants generally, we'll be enhancing the bound relay.co

# Drawbacks

* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.
* The relay "main" function that describes the call order to operator PrimFuncs has to be described in TIR to be able to integrate the USMP into the respective executor codegen. However, we dont view this as a major problem as the relay "main" function could easily be lowered to TIR.

* The U4 usecase will only be supported with [Embedded C Runtime Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951/14). This is mainly because the nature of the requirement is associated with embedded usecases. However, the USMP changes here should be complimentary to support other runtime interfaces such as Module-based Model Runtime Interface's set_input and set_output in future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we did implement this for e.g. C++ runtime or a scenario where DLTensor were passed (perhaps we have dynamic shapes), would we allocate DLTensor instances in whatever section of workspace we reserve for them? conceptually we could just consider those DLTensors as companion buffers which need to be live at the same time as the data.

Copy link
Contributor Author

@manupak manupak Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point again!

This is generally applicable for Embedded C Runtime Inferface as well.

Another way to frame the suggestion is to say : why cant we treat the buffer space used by I/O tensors as seperate workspace buffers ?

I have considered this too :). The short anwser is it is sub-optimal use disjoint workspace buffers rather than a single workspace buffer -- because it does not allow partial overlaps of tensors across such buffers. TLDR;

So the workspace buffer is viewed as a spatio-temporal buffer and tensors are considered as spatio-temporal allocation. Having disjoint workspace buffers, create a spatial barrier that could potentially lead to higher fragmentation. Therefore, out of the two options; we are prioritizing the better one here. Again, this could be an another usecase if there is value for it.

Again, I can add few lines for this as future consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree it's probably sub-optimal to separate I/O buffers from intermediates, but what I mean here is: how would we extend the planner to also allocate memory for the DLTensor metadata inside the workspace buffer? the Executor currently has to allocate memory for those somewhere in the case where we need that metadata (e.g. --use-unpacked-api=false or we have dynamic shapes).

Copy link
Contributor Author

@manupak manupak Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@areusch smart pointer (DLTensor) metadata is not something considered so far.

I admit that we need to account for a certain overhead for each buffers based on the calling convention used. Currently the USMP only focus are about planning for .data of DLTensors -- however, I feel it is orthogonal to the proposal here and applicable generally to all usecases.

It is an extention of its own with choices such as whether we do inline placement of such data or do we use special buffer to hold them, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that we need to account for a certain overhead

i guess the main question is: do you think that DLTensor data should be "overhead" in the sense that USMP should consider a buffer to need sizeof(data) + sizeof(DLTensor) bytes contiguously, or do you prefer to keep with all the data in one contiguous block and the DLTensors elsewhere? the second option here seems more inline with the unpacked calling convention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel the latter choice allows the backend (passes) to decide where should this data be held, while not constraining them to be placed in the same memory pool (thus ultimately in the same memory).