Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

DickJC123 · 2019-08-13T23:36:30Z

Description

'Out of heap space' errors having been plaguing our Windows CI builds for some time (#13958). The errors appear to be a result of the limited virtual memory of the 32-bit version of cl.exe being used. After some trial and error (due to not having direct access to the build machines) this PR upgrades the Windows compilation toolchain to 64-bit programs. The PR leaves the compiler version at VS2015. A VS2017 compiler appears to be installed onto the CI machines but cannot be used yet because compatible OpenCV libraries are not present.

I recommend a quick review due to the inconvenience of having these sporadic build failures.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[X ] Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

DickJC123 · 2019-08-14T23:05:57Z

Normally I would not combine different fixes into one PR, but when fighting simultaneous CI flakiness, I feel it's justified. With my latest commit, I've sped up test_random.test_shuffle to under 2 minutes to keep the CI from timing out at the 3 hour point. Previously, I'd seen test_shuffle take from 10 - 50 minutes. Not sure why the disparity- perhaps multiple CI runners are running on the same machine? I did this purely in the Python test-result-checking part, and did not alter the loop-counts or logic of the test. @asitstands was the last significant contributor to this test.

ChaiBapchya

Thanks a lot for your contribution! Great deep dive (on finding out one possible area of improvement and this quick fix!
How many times did you run the test to verify it ends in 2mins (compared to the 10-50mins)
Also why is there such a huge variance 10-50mins? Any thoughts? Wondering if it has to do less with the code and more with the infra

ChaiBapchya · 2019-08-15T06:33:42Z

tests/python/unittest/test_random.py

-            start = subarr[0].asscalar()
-            seq = mx.nd.arange(start, start + stride, ctx=arr.context)
-            assert (subarr == seq).prod() == 1
+        assert (column0.sort() == seq).prod() == 1


curious to know what difference it makes by moving sort?

There are no random shapes in test_shuffle so its runtime, even give the sort, should be fairly consistent. My only explanation for the runtime variation is that maybe there's multiple cpu runners on the same machine. Looking for confirmation from @marcoabreu.

Moving the sort was just a style preference, not performance driven. The biggest runtime savings came from introducing the if stride > 1: clause to avoid needless work for the last big 1D test input. The rest was just some further perf polishing.

I keep stumbling on flakey tests with this PR, so my last commit is fixing laop_6. A bit of bad luck I would say.

we run up to 4 jobs per windows-cpu machine. 1 job per windows-gpu machine.

ChaiBapchya · 2019-08-15T06:37:49Z

@apeforest @larroy @anirudh2290 Would be great to get this reviewed quickly and merged if it helps speed up the CI!

marcoabreu

Hey, I have reconsidered this PR and I don't feel comfortable moving forward with it.

If we take this change, this means that we make it impossible for users to compile with 32bit since we are not safeguarding against it anymore and we're acknowledging that 32bit compilation is not possible. The issue we're working around here is that our operator definitions are all defined in headerfiles which are then copied into one HUGE object file.

The reason we're running into this limit here is due to issue with our architecture. I don't think it's such a big deal to change the compilation from preprocessing everything into a single object file to having separate objects. This solves this problem and it will really improve our compile times since this object is on the critical path.

I'd prefer if somebody changes the way how we compile operators or starts a dev@ vote to officially deprecate 32bit. Otherwise, I'm not comfortable moving forward.

ptrendx · 2019-08-15T15:02:18Z

@marcoabreu That is not 100% true though - for elementwise ops that is the case and I fully agree we should split them, but for most operators there is 1 file per operator. The issue in those cases comes from the fact that we need to support many combinations input/output types (and things like accumulation types in some cases) which ends up in a long list of templates to instantiate.

marcoabreu · 2019-08-15T15:45:35Z

If we reduce/split elementwise ops, is the problem solved then?

ptrendx · 2019-08-15T16:51:08Z

No, it is not. As I said, the number of templates that we need to generate in some other cases is a problem - that was the problem with softmax for example, which prompted e.g. this: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh#L620 which is bad, since it actually makes Windows build different in terms of functionality than other builds.

larroy

Thanks for the change. LGTM. Could you describe in the PR the changes to the other test files regarding tolerance are needed to go together with the PR.

larroy · 2019-08-15T19:05:58Z

@marcoabreu let's unblock CI and discuss what you propose later. Blocking a critical change to unblock CI and development takes priority.

ChaiBapchya · 2019-08-15T21:24:23Z

Yes please. I've been retriggering my 5 PRs for last 2 days (atleast 2-3 times per PR)
Would be great if we address this time-out issue!

marcoabreu · 2019-08-15T22:53:28Z

Hi, there might be a misunderstanding. The issue here is not the cause for the timeouts.

This reverts commit 9eaf3a5.

DickJC123 · 2019-08-16T02:38:17Z

@larroy I looked again at the CI failure of laop_6: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-15882/runs/10/nodes/116/steps/186/log/?start=0. It appears to be unrelated to tolerances, so I reverted my change to that test.

My change to test_shuffle was unrelated to tolerances- just sped up the checking code without altering any functionality.

larroy · 2019-08-16T21:52:03Z

@DickJC123 I looked at laop tests, one root cause is the multiplication by a "random projection" in checking the numeric gradient done in the test, this initial multiplication creates an introduction of fp error. I had a look at this and tried to address, but I then had failures in other tests and run ouf of time allocated for this. https://github.com/apache/incubator-mxnet/blob/39bf4e062bac4fc23d7c8591005b99592e42ab25/python/mxnet/test_utils.py#L921

#15770

larroy · 2019-08-16T23:24:37Z

Thanks for the clarification. I already approved.

larroy · 2019-08-17T00:49:48Z

Marco. I understand this is a bug in the vs compiler version we use. I don't think this deprecates 32 bit support as the target arch doesn't change. We are compiling for 32 bit in arm7 right?

marcoabreu · 2019-08-17T04:21:52Z

We had the same issue with clang unix if I recall correctly. So I'd say that this is not a bug but rather a symptom.

marcoabreu · 2019-08-21T05:29:22Z

Can we get rid of the compiler upgrade for now and observe if everything is fine now? For the other improvements I'd be happy to get them in. You can also just put them in a separate PR.

larroy · 2019-11-20T23:13:31Z

Why is this PR stuck?

DickJC123 · 2019-11-21T00:34:59Z

This PR was created at a time when there were multiple CI issues (OOM on Windows toolchain, CI timeouts due to test_shuffle, and laop_6 flakeyness). This PR tried to solve all 3 problems in order to get to a passing/mergeable state. Since all of the issues have been solved in a fashion, I'm closing this PR.

The approach this PR took to move the Windows CI to the 64-bit toolchain may be helpful as a reference in the future. Instead of adopting this PR though, we broke up some of the larger compiles and stayed on the 32-bit windows toolchain. The way I look at it, by staying on the 32-bit toolchain, we are somewhat artificially forcing developers to keep their compilation units small. This is not a bad thing, since it helps a multi-threaded compile to finish more quickly in the presence of errors. @ptrendx points out that with some valid templating approaches, reducing compile size may not be always possible.

If we run up against a Windows compile that we don't want to work around by breaking up the file, ping me to dust off and resubmit this PR.

marcoabreu · 2019-11-21T16:45:33Z

Happy to move forward with the upgrade to 64bit

DickJC123 · 2019-11-21T18:36:52Z

@marcoabreu Sounds like if I resubmitted the core of this PR, you'd support it. Anything specific behind your shifted stance w.r.t. windows 64-bit toolchain? I know we went ahead and broke up some major compilation units, which you were pushing for.

marcoabreu · 2019-11-21T20:02:18Z

It seems like I was under the impression that we are dropping support of some visual studio version, but that seemed like a fluke. It's fine now, sorry for the confusion. Happy to move forward with the upgrade to 64bit

larroy · 2019-11-27T19:03:39Z

If we want to monitor binary sizes, a good way to do it would be to put a safeguard on CI that checks that separately.

ChaiBapchya · 2019-11-27T19:50:33Z

It seems like I was under the impression that we are dropping support of some visual studio version, but that seemed like a fluke. It's fine now, sorry for the confusion. Happy to move forward with the upgrade to 64bit

@marcoabreu Are you referring to this PR - #16712
In order to upgrade VC2015 -> 2017, need opencv to be updated

This PR for opencv update is failing due to "authorization issue" (Hence I am stuck)
[Restricted access] https://github.com/anirudh2290/mxnet-ci/pull/2

DickJC123 added 8 commits August 13, 2019 16:15

Fix heap space errors by upgrading to 64-bit cl.exe?

9ade21f

Try vcvarsall.bat.

b6223ed

Add missing amd64 arg.

1506103

Try simpler vcvars64.bat.

681f875

Looking for location of vcvars64.bat.

f467bd1

Temporarily try VS2017.

beb689a

Reinstate VC2015: CI machines missing VC2017-compatible OpenCV.

6a41b63

Trigger CI.

348523f

DickJC123 changed the title ~~Windows CI build experiments. Fixes 'out of heap space' errors?~~ Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. Aug 14, 2019

DickJC123 requested review from KellenSunderland and marcoabreu August 14, 2019 02:12

marcoabreu approved these changes Aug 14, 2019

View reviewed changes

DickJC123 added 2 commits August 14, 2019 11:06

Trigger CI.

0b3aac5

Speed up test_random.py:test_shuffle to get past CI timeouts.

9e38d35

DickJC123 mentioned this pull request Aug 15, 2019

Trial fix of pr_softmax [Experimental, do not merge] #15897

Closed

7 tasks

Trigger CI.

561882a

DickJC123 mentioned this pull request Aug 15, 2019

[CI] unix cpu validation Timeout #15880

Open

kshitij12345 mentioned this pull request Aug 15, 2019

Refactor test_random.test_shuffle to improve the timings. #15904

Closed

7 tasks

ChaiBapchya reviewed Aug 15, 2019

View reviewed changes

Fix flakey test_operator.py:test_laop_6.

9eaf3a5

marcoabreu suggested changes Aug 15, 2019

View reviewed changes

Correct use of check_numeric_gradient.

4e177d6

larroy approved these changes Aug 15, 2019

View reviewed changes

DickJC123 mentioned this pull request Aug 15, 2019

Refactor for windows CI 'out of heap space' errors #15922

Merged

4 tasks

Revert "Fix flakey test_operator.py:test_laop_6."

e3bfd80

This reverts commit 9eaf3a5.

DickJC123 added 2 commits August 16, 2019 00:31

Trigger CI- seeing Windows access violation on laop_6.

4eb990c

Trigger CI- seeing build timeouts on centos-cpu.

43f890e

Setting MXNET_MODULE_SEED. laop_6 segfault data driven?

a65820b

roywei added the CI label Aug 19, 2019

Set laop_6 seed.

0075dfc

DickJC123 closed this Nov 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

DickJC123 commented Aug 13, 2019 •

edited

Loading

DickJC123 commented Aug 14, 2019

ChaiBapchya left a comment

ChaiBapchya Aug 15, 2019

DickJC123 Aug 15, 2019 •

edited

Loading

marcoabreu Aug 15, 2019

ChaiBapchya commented Aug 15, 2019

marcoabreu left a comment

ptrendx commented Aug 15, 2019

marcoabreu commented Aug 15, 2019

ptrendx commented Aug 15, 2019

larroy left a comment

larroy commented Aug 15, 2019

ChaiBapchya commented Aug 15, 2019

marcoabreu commented Aug 15, 2019

DickJC123 commented Aug 16, 2019

larroy commented Aug 16, 2019 •

edited

Loading

larroy commented Aug 16, 2019

larroy commented Aug 17, 2019

marcoabreu commented Aug 17, 2019 •

edited

Loading

marcoabreu commented Aug 21, 2019

larroy commented Nov 20, 2019

DickJC123 commented Nov 21, 2019

marcoabreu commented Nov 21, 2019

DickJC123 commented Nov 21, 2019

marcoabreu commented Nov 21, 2019

larroy commented Nov 27, 2019

ChaiBapchya commented Nov 27, 2019 •

edited

Loading

Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

Conversation

DickJC123 commented Aug 13, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

DickJC123 commented Aug 14, 2019

ChaiBapchya left a comment

Choose a reason for hiding this comment

ChaiBapchya Aug 15, 2019

Choose a reason for hiding this comment

DickJC123 Aug 15, 2019 • edited Loading

Choose a reason for hiding this comment

marcoabreu Aug 15, 2019

Choose a reason for hiding this comment

ChaiBapchya commented Aug 15, 2019

marcoabreu left a comment

Choose a reason for hiding this comment

ptrendx commented Aug 15, 2019

marcoabreu commented Aug 15, 2019

ptrendx commented Aug 15, 2019

larroy left a comment

Choose a reason for hiding this comment

larroy commented Aug 15, 2019

ChaiBapchya commented Aug 15, 2019

marcoabreu commented Aug 15, 2019

DickJC123 commented Aug 16, 2019

larroy commented Aug 16, 2019 • edited Loading

larroy commented Aug 16, 2019

larroy commented Aug 17, 2019

marcoabreu commented Aug 17, 2019 • edited Loading

marcoabreu commented Aug 21, 2019

larroy commented Nov 20, 2019

DickJC123 commented Nov 21, 2019

marcoabreu commented Nov 21, 2019

DickJC123 commented Nov 21, 2019

marcoabreu commented Nov 21, 2019

larroy commented Nov 27, 2019

ChaiBapchya commented Nov 27, 2019 • edited Loading

DickJC123 commented Aug 13, 2019 •

edited

Loading

DickJC123 Aug 15, 2019 •

edited

Loading

larroy commented Aug 16, 2019 •

edited

Loading

marcoabreu commented Aug 17, 2019 •

edited

Loading

ChaiBapchya commented Nov 27, 2019 •

edited

Loading