-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Relay option to link parameters into runtime Modules #6917
Conversation
* Some platforms need to use an alternate printf() to support basic things like %zu. Since %zu is platform-specific, we prefer to use a printf() that supports it or allow the platform to fix it up as needed.
@kparzysz-quic addressed those comments. |
src/target/source/codegen_params.cc
Outdated
os.fill(' '); | ||
os.setf(std::ios::left, std::ios::adjustfield); | ||
if (arr_type.bits() == 32) { | ||
PrintArray<float>(tensor->dl_tensor.data, num_elements, one_element_size_bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it is easier to use uint to print the case of float in most of our cases. It also helps us to get around the problem of printing exp and mantisa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, the printing problem is just an alignment thing. it doesn't affect the functionality. I don't see why adding an implicit cast to generated code that the compiler will never even know about is easier? because remember, we are not going to add the cast into the generated function that is accessing these floats--it is just going to get a float* and then by generating uint32_t here, the float* will magically access uint32_t data. I agree with you that it doesn't seeeem like anything would break too badly given they are the same size, but it's kind of a long limb to go out on just to support bfloat16. we would also need to do byte ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put it in another way. We will need support for bfloat16 and fp16 now given that they are already part of the type system. Addtiionally, there might also be need to support customized data types.
The main reason to go for the uint repr is its bit accurate precision(we don't need to worry about printing out INF, nan, or loss of accuracy due to printing and parsing of floating pt -- although partly addressed by hex printing, still somewhat complicated) and simplicity in implementation.
Additionally, for big endian machines and small endian machines we are already using the bits as the only information in our runtime https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h#L429, which means our runtime won't simply work beyond such cases. Right now we also did not see an example that goes beyond these two cases.
If we do think it is important to keep float and double printing as their own type. I think we could do that, but need to make sure the rest of the codepath uses uint so that it works for bfloat16, fp16 and other potential extension data types now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, as discussed i've added support for float16 and bfloat16. left the rest of the float printing as it was before.
I still think we can print all values by their bits and print via uint. Unless there is a very strange endian being used. It would also helps us to support types that are not supported by LLVM natively, like bfloat16 |
@tqchen it seems like the main benefit is supporting bfloat16...i'd like to argue that's out of scope for this PR. further, we don't really have a big-endian board handy to test against anyway. I don't think that change would take a long time to implement, but we don't have a diverse enough test fleet now to have good confidence in it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a second pass. Thanks for breaking out some of it as a separate PR.
Some comments and some of it might end up being clarifications.
<< " ((uint64_t*)out_ret_value)[0] = (uint64_t) (uintptr_t) " | ||
<< ::tvm::runtime::symbol::tvm_param_prefix << kv.first << ";\n" | ||
<< " out_ret_tcode[0] = " << kTVMOpaqueHandle << ";\n" | ||
<< " return 0;\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need a default here ? prolly to give a meaningful error ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to return a specific error code here (this was my original implementation), but unfortunately it's not easy to catch on the other side. we need to do a bit more legwork to be able to catch specific function return values in c++/python. as it stands, returning nullptr from this function is specific enough, and has the benefit that an exception flow isn't triggered by default for non-parameters.
<< "} // extern \"C\"\n" | ||
<< "#endif\n"; | ||
stream << " case " << kv.second->id << ":\n" | ||
<< " ((uint64_t*)out_ret_value)[0] = (uint64_t) (uintptr_t) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a copy is introduced (Correct me if Im wrong), which is something we would want to avoid if possible in memory constrained devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's just copying the pointer here. I believe that should be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok. So kv,first is the array's name, right ? Any reason why is it casted to uint64_t ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. ordinarily we would cast to void*, but since void* may be less than 64 bits, we cast to uint64_t which is the largest type stored in TVMValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think void* should be enough here. void* would be less than 64 bits if the machine compiled for is less than 64bits. right? otherwise this will make it store the pointer in 64bit space where the machine could have a address space defined less than 64 bits, unless Im missing something here. In this specific case, the array defined for the constant will have an address depending the target machine it would be compiled for, hence my confusion.
Do we need to perform pointer arithmetic on this ? -- Im asking this because uintptr_t cast is used. If so, using uintptr_t should be sufficient instead of void*.
I guess this argument will break if TVMValue already uses 64bit spaces for addresses irrespective of what we do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TVMValue is always 64 bits wide, though, because it's a union including int64_t. so when addressing it as an array, need to cast appropriately. i'm not sure if the uintptr_t cast is excessive--quite possibly it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright then. I will take your word for it. :). See if you can drop uintptr_t if that is not required given that it is going to be written to 64bit wide location anyway.
} | ||
} | ||
|
||
void NDArrayDataToC(::tvm::runtime::NDArray arr, int indent_chars, std::ostream& os) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if a brief could be added to show how a simple NDArray would get converted and appear in the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documented this function in codegen_params.h and added an example there.
} | ||
} | ||
|
||
void NDArrayDataToC(::tvm::runtime::NDArray arr, int indent_chars, std::ostream& os) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Clarification] Can we use a raw binary string (e.g. "\xeb\x2a) to represent the content here ? Do we need to show the element wise break down? I would reckon DLTensors would anyway take it as a void* right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could. the main thing this drags in is whether we should be determining things like the byte order and alignment of the target machine, or leave that to the compiler (LLVM or in this case, external C). i've been so far arguing not to do any of this--I don't really see a tangible benefit and it adds considerable complexity at this level that's hard to test. also, it makes it quite easy to inspect the generated parameters, which will often differ from the parameters originally supplied to relay.build().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NDArray deletion part LGTM, see more discussions on the printing repr and whether or not w should use uint
src/target/source/codegen_params.cc
Outdated
num_elements *= shape_elem; | ||
} | ||
|
||
std::unique_ptr<DLManagedTensor, DLManagedTensorDeleter> tensor(arr.ToDLPack()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need ToDLPack, and can directly get DLTensor* from the arr.operator->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/target/source/codegen_params.cc
Outdated
os.fill(' '); | ||
os.setf(std::ios::left, std::ios::adjustfield); | ||
if (arr_type.bits() == 32) { | ||
PrintArray<float>(tensor->dl_tensor.data, num_elements, one_element_size_bytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To put it in another way. We will need support for bfloat16 and fp16 now given that they are already part of the type system. Addtiionally, there might also be need to support customized data types.
The main reason to go for the uint repr is its bit accurate precision(we don't need to worry about printing out INF, nan, or loss of accuracy due to printing and parsing of floating pt -- although partly addressed by hex printing, still somewhat complicated) and simplicity in implementation.
Additionally, for big endian machines and small endian machines we are already using the bits as the only information in our runtime https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h#L429, which means our runtime won't simply work beyond such cases. Right now we also did not see an example that goes beyond these two cases.
If we do think it is important to keep float and double printing as their own type. I think we could do that, but need to make sure the rest of the codepath uses uint so that it works for bfloat16, fp16 and other potential extension data types now.
src/runtime/graph/graph_runtime.cc
Outdated
void GraphRuntime::LinkedNDArrayDeleter(Object* container) { | ||
// container is the NDArray::Container which needs to get deleted. | ||
// The data member points to global const memory, so it does not need deleting. | ||
delete reinterpret_cast<NDArray::Container*>(container); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be static cast instead, since NDArray::Container* is a subclass of Object*, this can cause bug when Container and Object class have a different beginning offset due to subclassing and layout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, thanks. done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM bar others' pending comments.
LGTM now, @FrozenGene please take another look |
* refactor RPCSessionContext utils * Make TVMLogf platform-independent. * Some platforms need to use an alternate printf() to support basic things like %zu. Since %zu is platform-specific, we prefer to use a printf() that supports it or allow the platform to fix it up as needed.
* refactor RPCSessionContext utils * Make TVMLogf platform-independent. * Some platforms need to use an alternate printf() to support basic things like %zu. Since %zu is platform-specific, we prefer to use a printf() that supports it or allow the platform to fix it up as needed.
* refactor RPCSessionContext utils * Make TVMLogf platform-independent. * Some platforms need to use an alternate printf() to support basic things like %zu. Since %zu is platform-specific, we prefer to use a printf() that supports it or allow the platform to fix it up as needed.
Implements RFC: https://discuss.tvm.apache.org/t/rfc-linked-parameters-for-cpu-targets/8452
This PR implements changes to support generating Modules that contain parameters pre-linked. For CPU contexts, these modules don't require any copying before launching inference via GraphRuntime. Both C++ and C runtimes are supported.