-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dump nodes with potential overflow in half conversion #23363
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can commit the suggested changes from lintrunner.
@@ -156,6 +156,7 @@ def run_ort_pipeline( | |||
batch_count, | |||
start_memory, | |||
memory_monitor_type, | |||
skip_warmup: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for skip_warmup=True
? For some one-time test for a certain input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an option to skip the warm-up runs is useful. You do not need run a model multiple times to collect data dumping or nvtx tracing.
@@ -402,6 +462,12 @@ const NodeDumpOptions& NodeDumpOptionsFromEnvironmentVariables() { | |||
opts.snippet_threshold = ParseEnvironmentVariableWithDefault<int>(env_vars::kSnippetThreshold, kDefaultSnippetThreshold); | |||
opts.snippet_edge_items = ParseEnvironmentVariableWithDefault<int>(env_vars::kSnippetEdgeItems, kDefaultSnippetEdgeItems); | |||
|
|||
constexpr int kMaxHalfThreshold = 65504; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that we eventually need compare with half, then why not have a half type for kMaxHalfThreshold
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not used to compare with half. One purpose is to check the integer value of the threshold parsed from environ variable.
TensorStatisticsData tensor_statistics; | ||
DumpTensor(dump_options, *tensor, tensor_metadata, tensor_statistics, session_state); | ||
|
||
if (check_half_overflow && tensor_statistics.is_float) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to detect half overflow, so the whole dump analysis should be conducted on a fp32 model, right? Do we have some constraints on the use of this function, to ensure that users would not make fp16/int4 model as input by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design is to analyze float tensor only. It will ignore other data type safely. It is compatible with half conversion script float16.py behavior.
@@ -491,11 +561,20 @@ void DumpNodeInputs( | |||
const bool is_shape_set = (dump_options.dump_flags & NodeDumpOptions::DumpFlags::Shape) != 0; | |||
PrintIf(is_shape_set, MakeString(" Shape: ", shape, "\n")); | |||
|
|||
if ((dump_options.dump_flags & NodeDumpOptions::DumpFlags::InputData) != 0) { | |||
if ((dump_options.dump_flags & NodeDumpOptions::DumpFlags::InputData) != 0 || check_half_overflow) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that this blocklist is on the node level. Why not on the node type level? We only benchmark a few data, this skiplayer node type has potential to break with some data that is not in benchmark. Why not block this node type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you have node level block list, you can decide whether it is better to extend to node type block list.
Block list on node level has it pros and cons. The pros: it could achieve better performance when only a subset (like 50% SkipLayerNorm nodes) has overflow. The cons: the model has potential to overflow with new data (the risk is smaller if you test more samples and use lower threshold).
Description
Add a tool to generate node_block_list used in float16 conversion tool.
Previously, we have a feature to dump statistics data (like min, max) of each node input/output. However, it is time consuming to generate a list of nodes that need to be kept in float32 when model is large.
This could help speed up the process by outputting a list of nodes that have potential overflow in float-to-half conversion.
Usage is to build onnxruntime from source with
--cmake_extra_defines onnxruntime_DEBUG_NODE_INPUTS_OUTPUTS=1
, then set some environment variables before running float32 optimized onnx model like:The threshold
ORT_DEBUG_NODE_IO_HALF_OVERFLOW_THRESHOLD
shall be <= 65504. The default value is 50000 if the environment variable is not set. It is better to leave some margin if number of samples are not large enough in the test.As a demo, we add an option --skip_warmup to benchmark.py for Flux, so that we can reduce the time on dumping warm-up runs.
Example snippet of stdout (each inference session has such a summary when session ended):
Then you can use the python script to convert corresponding model to float16.
Motivation and Context
It is a tool used to generate node_block_list used in float16 conversion of stable diffusion 3.x and flux models in #22986.
In stable diffusion or Flux pipeline, there are multiple models and there could be multiple session runs for each model. Without a proper tool, it is time consuming to get node_block_list for each model.