-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882
Conversation
Normally I would not combine different fixes into one PR, but when fighting simultaneous CI flakiness, I feel it's justified. With my latest commit, I've sped up test_random.test_shuffle to under 2 minutes to keep the CI from timing out at the 3 hour point. Previously, I'd seen test_shuffle take from 10 - 50 minutes. Not sure why the disparity- perhaps multiple CI runners are running on the same machine? I did this purely in the Python test-result-checking part, and did not alter the loop-counts or logic of the test. @asitstands was the last significant contributor to this test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your contribution! Great deep dive (on finding out one possible area of improvement and this quick fix!
How many times did you run the test to verify it ends in 2mins (compared to the 10-50mins)
Also why is there such a huge variance 10-50mins? Any thoughts? Wondering if it has to do less with the code and more with the infra
start = subarr[0].asscalar() | ||
seq = mx.nd.arange(start, start + stride, ctx=arr.context) | ||
assert (subarr == seq).prod() == 1 | ||
assert (column0.sort() == seq).prod() == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious to know what difference it makes by moving sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no random shapes in test_shuffle so its runtime, even give the sort, should be fairly consistent. My only explanation for the runtime variation is that maybe there's multiple cpu runners on the same machine. Looking for confirmation from @marcoabreu.
Moving the sort was just a style preference, not performance driven. The biggest runtime savings came from introducing the if stride > 1:
clause to avoid needless work for the last big 1D test input. The rest was just some further perf polishing.
I keep stumbling on flakey tests with this PR, so my last commit is fixing laop_6. A bit of bad luck I would say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we run up to 4 jobs per windows-cpu machine. 1 job per windows-gpu machine.
@apeforest @larroy @anirudh2290 Would be great to get this reviewed quickly and merged if it helps speed up the CI! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I have reconsidered this PR and I don't feel comfortable moving forward with it.
If we take this change, this means that we make it impossible for users to compile with 32bit since we are not safeguarding against it anymore and we're acknowledging that 32bit compilation is not possible. The issue we're working around here is that our operator definitions are all defined in headerfiles which are then copied into one HUGE object file.
The reason we're running into this limit here is due to issue with our architecture. I don't think it's such a big deal to change the compilation from preprocessing everything into a single object file to having separate objects. This solves this problem and it will really improve our compile times since this object is on the critical path.
I'd prefer if somebody changes the way how we compile operators or starts a dev@ vote to officially deprecate 32bit. Otherwise, I'm not comfortable moving forward.
@marcoabreu That is not 100% true though - for elementwise ops that is the case and I fully agree we should split them, but for most operators there is 1 file per operator. The issue in those cases comes from the fact that we need to support many combinations input/output types (and things like accumulation types in some cases) which ends up in a long list of templates to instantiate. |
If we reduce/split elementwise ops, is the problem solved then? |
No, it is not. As I said, the number of templates that we need to generate in some other cases is a problem - that was the problem with softmax for example, which prompted e.g. this: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh#L620 which is bad, since it actually makes Windows build different in terms of functionality than other builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. LGTM. Could you describe in the PR the changes to the other test files regarding tolerance are needed to go together with the PR.
@marcoabreu let's unblock CI and discuss what you propose later. Blocking a critical change to unblock CI and development takes priority. |
Yes please. I've been retriggering my 5 PRs for last 2 days (atleast 2-3 times per PR) |
Hi, there might be a misunderstanding. The issue here is not the cause for the timeouts. |
This reverts commit 9eaf3a5.
@larroy I looked again at the CI failure of laop_6: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-15882/runs/10/nodes/116/steps/186/log/?start=0. It appears to be unrelated to tolerances, so I reverted my change to that test. My change to test_shuffle was unrelated to tolerances- just sped up the checking code without altering any functionality. |
@DickJC123 I looked at laop tests, one root cause is the multiplication by a "random projection" in checking the numeric gradient done in the test, this initial multiplication creates an introduction of fp error. I had a look at this and tried to address, but I then had failures in other tests and run ouf of time allocated for this. https://github.com/apache/incubator-mxnet/blob/39bf4e062bac4fc23d7c8591005b99592e42ab25/python/mxnet/test_utils.py#L921 |
Thanks for the clarification. I already approved. |
Marco. I understand this is a bug in the vs compiler version we use. I don't think this deprecates 32 bit support as the target arch doesn't change. We are compiling for 32 bit in arm7 right? |
We had the same issue with clang unix if I recall correctly. So I'd say that this is not a bug but rather a symptom. |
Can we get rid of the compiler upgrade for now and observe if everything is fine now? For the other improvements I'd be happy to get them in. You can also just put them in a separate PR. |
Why is this PR stuck? |
This PR was created at a time when there were multiple CI issues (OOM on Windows toolchain, CI timeouts due to test_shuffle, and laop_6 flakeyness). This PR tried to solve all 3 problems in order to get to a passing/mergeable state. Since all of the issues have been solved in a fashion, I'm closing this PR. The approach this PR took to move the Windows CI to the 64-bit toolchain may be helpful as a reference in the future. Instead of adopting this PR though, we broke up some of the larger compiles and stayed on the 32-bit windows toolchain. The way I look at it, by staying on the 32-bit toolchain, we are somewhat artificially forcing developers to keep their compilation units small. This is not a bad thing, since it helps a multi-threaded compile to finish more quickly in the presence of errors. @ptrendx points out that with some valid templating approaches, reducing compile size may not be always possible. If we run up against a Windows compile that we don't want to work around by breaking up the file, ping me to dust off and resubmit this PR. |
Happy to move forward with the upgrade to 64bit |
@marcoabreu Sounds like if I resubmitted the core of this PR, you'd support it. Anything specific behind your shifted stance w.r.t. windows 64-bit toolchain? I know we went ahead and broke up some major compilation units, which you were pushing for. |
It seems like I was under the impression that we are dropping support of some visual studio version, but that seemed like a fluke. It's fine now, sorry for the confusion. Happy to move forward with the upgrade to 64bit |
If we want to monitor binary sizes, a good way to do it would be to put a safeguard on CI that checks that separately. |
@marcoabreu Are you referring to this PR - #16712 This PR for opencv update is failing due to "authorization issue" (Hence I am stuck) |
Description
'Out of heap space' errors having been plaguing our Windows CI builds for some time (#13958). The errors appear to be a result of the limited virtual memory of the 32-bit version of cl.exe being used. After some trial and error (due to not having direct access to the build machines) this PR upgrades the Windows compilation toolchain to 64-bit programs. The PR leaves the compiler version at VS2015. A VS2017 compiler appears to be installed onto the CI machines but cannot be used yet because compatible OpenCV libraries are not present.
I recommend a quick review due to the inconvenience of having these sporadic build failures.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments