Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[microNPU] Fix cascade scheduling stability #13428

Merged

Conversation

Aleksei-grovety
Copy link
Contributor

@Aleksei-grovety Aleksei-grovety commented Nov 18, 2022

For Plans/Proposals added sorting by the number of cycles in case the memory used matches.

cc @leandron @ekalda, @NicolaLancellotti

@tvm-bot
Copy link
Collaborator

tvm-bot commented Nov 18, 2022

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@github-actions github-actions bot requested review from ekalda and leandron November 18, 2022 10:15
Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alexey-yazev, looks good! :)

What I gather is that the instability in the cascader comes from nondeterministic sorting when two Plans/Proposals have the same memory usage. It makes sense to me then to look at the cycle count as a differentiating metric. However, in the case where we have identical performance and memory use, I can't think of a reason why one of the Plans/Proposals should be advantageous of the other, so I wonder if this could be simplified by just removing one of the Plan or Proposal?

@Aleksei-grovety
Copy link
Contributor Author

Thanks @alexey-yazev, looks good! :)

What I gather is that the instability in the cascader comes from nondeterministic sorting when two Plans/Proposals have the same memory usage. It makes sense to me then to look at the cycle count as a differentiating metric. However, in the case where we have identical performance and memory use, I can't think of a reason why one of the Plans/Proposals should be advantageous of the other, so I wonder if this could be simplified by just removing one of the Plan or Proposal?

Thanks @ekalda!
I agree that elements with the same metrics have no advantages over each other. It seems that the real problem is in calculation of metrics, since resulting proposal from launch to launch is obtained with the same metrics, but as a result different amount of memory is allocated. I'll try to figure it out.

@Aleksei-grovety
Copy link
Contributor Author

I suppose checking for the equality of allocated_size and workspace_size in test_networks.py is incorrect as when using cascader with enabled striping a proposal is selected with condition proposal.memory_usage < workspace_size, allocated_size and proposal.memory_usage are calculated differently (unified static memory planning is used to calculate allocated_size and proposal.memory_usage is calculated as the sum of all tensors, taking into account striping for intermediate tensors)

@ekalda
Copy link
Contributor

ekalda commented Nov 24, 2022

Thanks @alexey-yazev, looks good! :)
What I gather is that the instability in the cascader comes from nondeterministic sorting when two Plans/Proposals have the same memory usage. It makes sense to me then to look at the cycle count as a differentiating metric. However, in the case where we have identical performance and memory use, I can't think of a reason why one of the Plans/Proposals should be advantageous of the other, so I wonder if this could be simplified by just removing one of the Plan or Proposal?

Thanks @ekalda! I agree that elements with the same metrics have no advantages over each other. It seems that the real problem is in calculation of metrics, since resulting proposal from launch to launch is obtained with the same metrics, but as a result different amount of memory is allocated. I'll try to figure it out.

I suppose there can be two kinds of instability there:
(1) Choosing a different Pproposal from launch to launch. Even if the Proposals have same memory and cycle counts according to the cascader, the more accurate memory planner can give a differing results for Proposals with different topology
(2) We choose an identical Proposal every time, but the memory planner allocates different amount of memory for the same proposal. That sounds like a memory planner instability

(A bit of a stab in the dark there)

@ekalda
Copy link
Contributor

ekalda commented Nov 24, 2022

I suppose checking for the equality of allocated_size and workspace_size in test_networks.py is incorrect as when using cascader with enabled striping a proposal is selected with condition proposal.memory_usage < workspace_size, allocated_size and proposal.memory_usage are calculated differently (unified static memory planning is used to calculate allocated_size and proposal.memory_usage is calculated as the sum of all tensors, taking into account striping for intermediate tensors)

Yes, I think you are right, thinking about it, we can't really check for the equality of allocated_size and workspace_size. I suppose when we test for allocated_size < workspace_size we are checking that the Proposal we chose (based on workspace_size) still fits into the workspace_size once we have done memory planning on the resulting graph.

@Aleksei-grovety
Copy link
Contributor Author

Without running the StorageRewrite pass (changes were merged in PR #13365) amount of allocated memory is same from launch to launch despite the fact that different proposals are applied.

I suppose there can be two kinds of instability there:
(1) Choosing a different Pproposal from launch to launch. Even if the Proposals have same memory and cycle counts according to the cascader, the more accurate memory planner can give a differing results for Proposals with different topology
(2) We choose an identical Proposal every time, but the memory planner allocates different amount of memory for the same proposal. That sounds like a memory planner instability

There is the first one and it happens if the StorageRewrite pass was run.

@Aleksei-grovety
Copy link
Contributor Author

Aleksei-grovety commented Nov 28, 2022

For this pull request, will it be enough to add an additional parameter to sort Plans/Proposals or do I need to investigate problem with different memory allocations when running StorageRewrite pass?

@ekalda
Copy link
Contributor

ekalda commented Dec 1, 2022

For this pull request, will it be enough to add an additional parameter to sort Plans/Proposals or do I need to investigate problem with different memory allocations when running StorageRewrite pass?

Sorry for the delay on this - I don't think we should spend much time investigating the instability that results from using StorageRewrite since Ethos-U is intended to be run with the USMP, so debugging the internals of StorageRewrite seems a bit out of scope here.

@Aleksei-grovety Aleksei-grovety force-pushed the ethosu-cascade-scheduling-stability-bugfix branch from a0f521b to 10e8390 Compare December 2, 2022 06:25
@Aleksei-grovety
Copy link
Contributor Author

Aleksei-grovety commented Dec 2, 2022

Sorry for the delay on this - I don't think we should spend much time investigating the instability that results from using StorageRewrite since Ethos-U is intended to be run with the USMP, so debugging the internals of StorageRewrite seems a bit out of scope here.

Thanks, on changes in the code I left only additional conditions for sorting.

The reason for allocating different amounts of memory from launch to launch was that when determining optimal proposals, there are elements in the collection with the same costs metrics and the first of these metrics becomes optimal and the rest are discarded.
the problem was solved by adding an additional sorting condition by shapes from StripeConfigs in the case when the metrics match.
@Aleksei-grovety Aleksei-grovety force-pushed the ethosu-cascade-scheduling-stability-bugfix branch from 10e8390 to 30a4503 Compare December 2, 2022 18:07
Copy link
Contributor

@ekalda ekalda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ekalda ekalda merged commit 012551f into apache:main Dec 5, 2022
@ekalda
Copy link
Contributor

ekalda commented Dec 5, 2022

Thanks @alexey-yazev!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants