Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Relay option to link parameters into runtime Modules #6917

Merged
merged 63 commits into from
Nov 26, 2020

Conversation

areusch
Copy link
Contributor

@areusch areusch commented Nov 13, 2020

Implements RFC: https://discuss.tvm.apache.org/t/rfc-linked-parameters-for-cpu-targets/8452

This PR implements changes to support generating Modules that contain parameters pre-linked. For CPU contexts, these modules don't require any copying before launching inference via GraphRuntime. Both C++ and C runtimes are supported.

@areusch
Copy link
Contributor Author

areusch commented Nov 23, 2020

@kparzysz-quic addressed those comments.
@manupa-arm @tqchen @ZihengJiang please take another look when you have a minute and explicit approve if you're ok with this change.

os.fill(' ');
os.setf(std::ios::left, std::ios::adjustfield);
if (arr_type.bits() == 32) {
PrintArray<float>(tensor->dl_tensor.data, num_elements, one_element_size_bytes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it is easier to use uint to print the case of float in most of our cases. It also helps us to get around the problem of printing exp and mantisa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, the printing problem is just an alignment thing. it doesn't affect the functionality. I don't see why adding an implicit cast to generated code that the compiler will never even know about is easier? because remember, we are not going to add the cast into the generated function that is accessing these floats--it is just going to get a float* and then by generating uint32_t here, the float* will magically access uint32_t data. I agree with you that it doesn't seeeem like anything would break too badly given they are the same size, but it's kind of a long limb to go out on just to support bfloat16. we would also need to do byte ordering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put it in another way. We will need support for bfloat16 and fp16 now given that they are already part of the type system. Addtiionally, there might also be need to support customized data types.

The main reason to go for the uint repr is its bit accurate precision(we don't need to worry about printing out INF, nan, or loss of accuracy due to printing and parsing of floating pt -- although partly addressed by hex printing, still somewhat complicated) and simplicity in implementation.

Additionally, for big endian machines and small endian machines we are already using the bits as the only information in our runtime https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h#L429, which means our runtime won't simply work beyond such cases. Right now we also did not see an example that goes beyond these two cases.

If we do think it is important to keep float and double printing as their own type. I think we could do that, but need to make sure the rest of the codepath uses uint so that it works for bfloat16, fp16 and other potential extension data types now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, as discussed i've added support for float16 and bfloat16. left the rest of the float printing as it was before.

@tqchen
Copy link
Member

tqchen commented Nov 24, 2020

I still think we can print all values by their bits and print via uint. Unless there is a very strange endian being used. It would also helps us to support types that are not supported by LLVM natively, like bfloat16

@areusch
Copy link
Contributor Author

areusch commented Nov 24, 2020

@tqchen it seems like the main benefit is supporting bfloat16...i'd like to argue that's out of scope for this PR. further, we don't really have a big-endian board handy to test against anyway. I don't think that change would take a long time to implement, but we don't have a diverse enough test fleet now to have good confidence in it.

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a second pass. Thanks for breaking out some of it as a separate PR.
Some comments and some of it might end up being clarifications.

<< " ((uint64_t*)out_ret_value)[0] = (uint64_t) (uintptr_t) "
<< ::tvm::runtime::symbol::tvm_param_prefix << kv.first << ";\n"
<< " out_ret_tcode[0] = " << kTVMOpaqueHandle << ";\n"
<< " return 0;\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a default here ? prolly to give a meaningful error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to return a specific error code here (this was my original implementation), but unfortunately it's not easy to catch on the other side. we need to do a bit more legwork to be able to catch specific function return values in c++/python. as it stands, returning nullptr from this function is specific enough, and has the benefit that an exception flow isn't triggered by default for non-parameters.

<< "} // extern \"C\"\n"
<< "#endif\n";
stream << " case " << kv.second->id << ":\n"
<< " ((uint64_t*)out_ret_value)[0] = (uint64_t) (uintptr_t) "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a copy is introduced (Correct me if Im wrong), which is something we would want to avoid if possible in memory constrained devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just copying the pointer here. I believe that should be ok?

Copy link
Contributor

@manupak manupak Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok. So kv,first is the array's name, right ? Any reason why is it casted to uint64_t ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct. ordinarily we would cast to void*, but since void* may be less than 64 bits, we cast to uint64_t which is the largest type stored in TVMValue.

Copy link
Contributor

@manupak manupak Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think void* should be enough here. void* would be less than 64 bits if the machine compiled for is less than 64bits. right? otherwise this will make it store the pointer in 64bit space where the machine could have a address space defined less than 64 bits, unless Im missing something here. In this specific case, the array defined for the constant will have an address depending the target machine it would be compiled for, hence my confusion.

Do we need to perform pointer arithmetic on this ? -- Im asking this because uintptr_t cast is used. If so, using uintptr_t should be sufficient instead of void*.

I guess this argument will break if TVMValue already uses 64bit spaces for addresses irrespective of what we do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TVMValue is always 64 bits wide, though, because it's a union including int64_t. so when addressing it as an array, need to cast appropriately. i'm not sure if the uintptr_t cast is excessive--quite possibly it is.

Copy link
Contributor

@manupak manupak Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright then. I will take your word for it. :). See if you can drop uintptr_t if that is not required given that it is going to be written to 64bit wide location anyway.

}
}

void NDArrayDataToC(::tvm::runtime::NDArray arr, int indent_chars, std::ostream& os) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if a brief could be added to show how a simple NDArray would get converted and appear in the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documented this function in codegen_params.h and added an example there.

}
}

void NDArrayDataToC(::tvm::runtime::NDArray arr, int indent_chars, std::ostream& os) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Clarification] Can we use a raw binary string (e.g. "\xeb\x2a) to represent the content here ? Do we need to show the element wise break down? I would reckon DLTensors would anyway take it as a void* right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could. the main thing this drags in is whether we should be determining things like the byte order and alignment of the target machine, or leave that to the compiler (LLVM or in this case, external C). i've been so far arguing not to do any of this--I don't really see a tangible benefit and it adds considerable complexity at this level that's hard to test. also, it makes it quite easy to inspect the generated parameters, which will often differ from the parameters originally supplied to relay.build().

Copy link
Member

@tqchen tqchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NDArray deletion part LGTM, see more discussions on the printing repr and whether or not w should use uint

num_elements *= shape_elem;
}

std::unique_ptr<DLManagedTensor, DLManagedTensorDeleter> tensor(arr.ToDLPack());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need ToDLPack, and can directly get DLTensor* from the arr.operator->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

os.fill(' ');
os.setf(std::ios::left, std::ios::adjustfield);
if (arr_type.bits() == 32) {
PrintArray<float>(tensor->dl_tensor.data, num_elements, one_element_size_bytes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To put it in another way. We will need support for bfloat16 and fp16 now given that they are already part of the type system. Addtiionally, there might also be need to support customized data types.

The main reason to go for the uint repr is its bit accurate precision(we don't need to worry about printing out INF, nan, or loss of accuracy due to printing and parsing of floating pt -- although partly addressed by hex printing, still somewhat complicated) and simplicity in implementation.

Additionally, for big endian machines and small endian machines we are already using the bits as the only information in our runtime https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h#L429, which means our runtime won't simply work beyond such cases. Right now we also did not see an example that goes beyond these two cases.

If we do think it is important to keep float and double printing as their own type. I think we could do that, but need to make sure the rest of the codepath uses uint so that it works for bfloat16, fp16 and other potential extension data types now.

void GraphRuntime::LinkedNDArrayDeleter(Object* container) {
// container is the NDArray::Container which needs to get deleted.
// The data member points to global const memory, so it does not need deleting.
delete reinterpret_cast<NDArray::Container*>(container);
Copy link
Member

@tqchen tqchen Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be static cast instead, since NDArray::Container* is a subclass of Object*, this can cause bug when Container and Object class have a different beginning offset due to subclassing and layout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, thanks. done.

Copy link
Contributor

@manupak manupak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM bar others' pending comments.

@tqchen
Copy link
Member

tqchen commented Nov 25, 2020

LGTM now, @FrozenGene please take another look

@tqchen tqchen merged commit 81d9f11 into apache:main Nov 26, 2020
@tqchen
Copy link
Member

tqchen commented Nov 26, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
* refactor RPCSessionContext utils

* Make TVMLogf platform-independent.

 * Some platforms need to use an alternate printf() to support basic
   things like %zu. Since %zu is platform-specific, we prefer to
   use a printf() that supports it or allow the platform to fix it up
   as needed.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
* refactor RPCSessionContext utils

* Make TVMLogf platform-independent.

 * Some platforms need to use an alternate printf() to support basic
   things like %zu. Since %zu is platform-specific, we prefer to
   use a printf() that supports it or allow the platform to fix it up
   as needed.
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
* refactor RPCSessionContext utils

* Make TVMLogf platform-independent.

 * Some platforms need to use an alternate printf() to support basic
   things like %zu. Since %zu is platform-specific, we prefer to
   use a printf() that supports it or allow the platform to fix it up
   as needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants