Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. #15882

Closed
wants to merge 18 commits into from

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Aug 13, 2019

Description

'Out of heap space' errors having been plaguing our Windows CI builds for some time (#13958). The errors appear to be a result of the limited virtual memory of the 32-bit version of cl.exe being used. After some trial and error (due to not having direct access to the build machines) this PR upgrades the Windows compilation toolchain to 64-bit programs. The PR leaves the compiler version at VS2015. A VS2017 compiler appears to be installed onto the CI machines but cannot be used yet because compatible OpenCV libraries are not present.

I recommend a quick review due to the inconvenience of having these sporadic build failures.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [X ] Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@DickJC123 DickJC123 changed the title Windows CI build experiments. Fixes 'out of heap space' errors? Move Windows CI build to a 64-bit toolchain to fix 'out of heap space'. Aug 14, 2019
@DickJC123
Copy link
Contributor Author

Normally I would not combine different fixes into one PR, but when fighting simultaneous CI flakiness, I feel it's justified. With my latest commit, I've sped up test_random.test_shuffle to under 2 minutes to keep the CI from timing out at the 3 hour point. Previously, I'd seen test_shuffle take from 10 - 50 minutes. Not sure why the disparity- perhaps multiple CI runners are running on the same machine? I did this purely in the Python test-result-checking part, and did not alter the loop-counts or logic of the test. @asitstands was the last significant contributor to this test.

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution! Great deep dive (on finding out one possible area of improvement and this quick fix!
How many times did you run the test to verify it ends in 2mins (compared to the 10-50mins)
Also why is there such a huge variance 10-50mins? Any thoughts? Wondering if it has to do less with the code and more with the infra

start = subarr[0].asscalar()
seq = mx.nd.arange(start, start + stride, ctx=arr.context)
assert (subarr == seq).prod() == 1
assert (column0.sort() == seq).prod() == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious to know what difference it makes by moving sort?

Copy link
Contributor Author

@DickJC123 DickJC123 Aug 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no random shapes in test_shuffle so its runtime, even give the sort, should be fairly consistent. My only explanation for the runtime variation is that maybe there's multiple cpu runners on the same machine. Looking for confirmation from @marcoabreu.

Moving the sort was just a style preference, not performance driven. The biggest runtime savings came from introducing the if stride > 1: clause to avoid needless work for the last big 1D test input. The rest was just some further perf polishing.

I keep stumbling on flakey tests with this PR, so my last commit is fixing laop_6. A bit of bad luck I would say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we run up to 4 jobs per windows-cpu machine. 1 job per windows-gpu machine.

@ChaiBapchya
Copy link
Contributor

@apeforest @larroy @anirudh2290 Would be great to get this reviewed quickly and merged if it helps speed up the CI!

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I have reconsidered this PR and I don't feel comfortable moving forward with it.

If we take this change, this means that we make it impossible for users to compile with 32bit since we are not safeguarding against it anymore and we're acknowledging that 32bit compilation is not possible. The issue we're working around here is that our operator definitions are all defined in headerfiles which are then copied into one HUGE object file.

The reason we're running into this limit here is due to issue with our architecture. I don't think it's such a big deal to change the compilation from preprocessing everything into a single object file to having separate objects. This solves this problem and it will really improve our compile times since this object is on the critical path.

I'd prefer if somebody changes the way how we compile operators or starts a dev@ vote to officially deprecate 32bit. Otherwise, I'm not comfortable moving forward.

@ptrendx
Copy link
Member

ptrendx commented Aug 15, 2019

@marcoabreu That is not 100% true though - for elementwise ops that is the case and I fully agree we should split them, but for most operators there is 1 file per operator. The issue in those cases comes from the fact that we need to support many combinations input/output types (and things like accumulation types in some cases) which ends up in a long list of templates to instantiate.

@marcoabreu
Copy link
Contributor

If we reduce/split elementwise ops, is the problem solved then?

@ptrendx
Copy link
Member

ptrendx commented Aug 15, 2019

No, it is not. As I said, the number of templates that we need to generate in some other cases is a problem - that was the problem with softmax for example, which prompted e.g. this: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce-inl.cuh#L620 which is bad, since it actually makes Windows build different in terms of functionality than other builds.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. LGTM. Could you describe in the PR the changes to the other test files regarding tolerance are needed to go together with the PR.

@larroy
Copy link
Contributor

larroy commented Aug 15, 2019

@marcoabreu let's unblock CI and discuss what you propose later. Blocking a critical change to unblock CI and development takes priority.

@ChaiBapchya
Copy link
Contributor

Yes please. I've been retriggering my 5 PRs for last 2 days (atleast 2-3 times per PR)
Would be great if we address this time-out issue!

@marcoabreu
Copy link
Contributor

Hi, there might be a misunderstanding. The issue here is not the cause for the timeouts.

@DickJC123
Copy link
Contributor Author

@larroy I looked again at the CI failure of laop_6: http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-15882/runs/10/nodes/116/steps/186/log/?start=0. It appears to be unrelated to tolerances, so I reverted my change to that test.

My change to test_shuffle was unrelated to tolerances- just sped up the checking code without altering any functionality.

@larroy
Copy link
Contributor

larroy commented Aug 16, 2019

@DickJC123 I looked at laop tests, one root cause is the multiplication by a "random projection" in checking the numeric gradient done in the test, this initial multiplication creates an introduction of fp error. I had a look at this and tried to address, but I then had failures in other tests and run ouf of time allocated for this. https://github.com/apache/incubator-mxnet/blob/39bf4e062bac4fc23d7c8591005b99592e42ab25/python/mxnet/test_utils.py#L921

#15770

@larroy
Copy link
Contributor

larroy commented Aug 16, 2019

Thanks for the clarification. I already approved.

@larroy
Copy link
Contributor

larroy commented Aug 17, 2019

Marco. I understand this is a bug in the vs compiler version we use. I don't think this deprecates 32 bit support as the target arch doesn't change. We are compiling for 32 bit in arm7 right?

@marcoabreu
Copy link
Contributor

marcoabreu commented Aug 17, 2019

We had the same issue with clang unix if I recall correctly. So I'd say that this is not a bug but rather a symptom.

@roywei roywei added the CI label Aug 19, 2019
@marcoabreu
Copy link
Contributor

Can we get rid of the compiler upgrade for now and observe if everything is fine now? For the other improvements I'd be happy to get them in. You can also just put them in a separate PR.

@larroy
Copy link
Contributor

larroy commented Nov 20, 2019

Why is this PR stuck?

@DickJC123
Copy link
Contributor Author

This PR was created at a time when there were multiple CI issues (OOM on Windows toolchain, CI timeouts due to test_shuffle, and laop_6 flakeyness). This PR tried to solve all 3 problems in order to get to a passing/mergeable state. Since all of the issues have been solved in a fashion, I'm closing this PR.

The approach this PR took to move the Windows CI to the 64-bit toolchain may be helpful as a reference in the future. Instead of adopting this PR though, we broke up some of the larger compiles and stayed on the 32-bit windows toolchain. The way I look at it, by staying on the 32-bit toolchain, we are somewhat artificially forcing developers to keep their compilation units small. This is not a bad thing, since it helps a multi-threaded compile to finish more quickly in the presence of errors. @ptrendx points out that with some valid templating approaches, reducing compile size may not be always possible.

If we run up against a Windows compile that we don't want to work around by breaking up the file, ping me to dust off and resubmit this PR.

@DickJC123 DickJC123 closed this Nov 21, 2019
@marcoabreu
Copy link
Contributor

Happy to move forward with the upgrade to 64bit

@DickJC123
Copy link
Contributor Author

@marcoabreu Sounds like if I resubmitted the core of this PR, you'd support it. Anything specific behind your shifted stance w.r.t. windows 64-bit toolchain? I know we went ahead and broke up some major compilation units, which you were pushing for.

@marcoabreu
Copy link
Contributor

It seems like I was under the impression that we are dropping support of some visual studio version, but that seemed like a fluke. It's fine now, sorry for the confusion. Happy to move forward with the upgrade to 64bit

@larroy
Copy link
Contributor

larroy commented Nov 27, 2019

If we want to monitor binary sizes, a good way to do it would be to put a safeguard on CI that checks that separately.

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Nov 27, 2019

It seems like I was under the impression that we are dropping support of some visual studio version, but that seemed like a fluke. It's fine now, sorry for the confusion. Happy to move forward with the upgrade to 64bit

@marcoabreu Are you referring to this PR - #16712
In order to upgrade VC2015 -> 2017, need opencv to be updated

This PR for opencv update is failing due to "authorization issue" (Hence I am stuck)
[Restricted access] https://github.com/anirudh2290/mxnet-ci/pull/2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants