Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use int instead of char in CLI config parser #3976

Merged
merged 1 commit into from
Dec 7, 2018

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Dec 7, 2018

I was playing around with AWS's latest offering of ARM instances, and running the binary classification CLI example resulted into an infinite loop.

Diagnosis. std::istream.get() and getchar() have return type int, not char, because both may return EOF, which is commonly -1. Storing EOF in a char results in undefined behavior. The char type is usually signed on x86 systems but unsigned on ARM systems (see http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-default/), so the undefined behavior will only be apparent when compiled on ARM.

EDIT. The DMLC config parser suffers from the same issue. Fortunately, DMLC JSON parser and parameter packs are unaffected.

@codecov-io
Copy link

codecov-io commented Dec 7, 2018

Codecov Report

Merging #3976 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3976   +/-   ##
=========================================
  Coverage     56.36%   56.36%           
  Complexity      205      205           
=========================================
  Files           186      186           
  Lines         14775    14775           
  Branches        498      498           
=========================================
  Hits           8328     8328           
  Misses         6208     6208           
  Partials        239      239

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f26053...d19d7b0. Read the comment docs.

@RAMitchell
Copy link
Member

Change looks fine in my non-expert opinion. We should add a test that demonstrates this failure when run on ARM, this will help expose any issues with other systems.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. That's actually very interesting to know.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

@RAMitchell @trivialfis Do you think it would be worthwhile to add an ARM worker to Jenkins CI? We can have it run CLI examples and Python tests (but no GPU tests).

@trivialfis
Copy link
Member

@hcho3 I assume there will be some other problems unique to ARM devices. If we were to support ARM, at least one of us needs to have an actual usable ARM device at hand for testing/debugging, and I don't have such a device. Debugging on test farm is no fun.

Besides, is there such a demand for running XGBoost on ARM?

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

@trivialfis I was referring to ARM instances on EC2: https://aws.amazon.com/ec2/instance-types/a1/
At least some compatibility / portability problems can be caught.

@trivialfis
Copy link
Member

@hcho3

At least some compatibility / portability problems can be caught.

I understand your concern, but after catching the problems(the PR test fail on ARM), I will have to get an ARM device to debug it. A quick search shows me that there are tools like QEMU maybe I can use to emulate ARM environment, but I'm not sure yet. If you think it's worth the effort then I will try playing with it.
@RAMitchell WDYT?

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

I suppose asking contributors to launch A1 instances may cause vendor lock-in. Okay, I'll shelve it for the time being.

@hcho3 hcho3 merged commit 9af6b68 into dmlc:master Dec 7, 2018
@hcho3 hcho3 deleted the fix_config_parser branch December 7, 2018 09:00
@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

@trivialfis As for your question about demand for non-x86 machines, I think it is significant. See #3244 and #3075 for example. But it is rather difficult on the part of XGBoost community to make the same level of guarantees for ARM as we do for x86, as not everyone has access to ARM devices.

For now, we'll have to rely on reports from external contributors, who will individually test XGBoost on their edge ARM devices.

@trivialfis
Copy link
Member

@hcho3 Thanks. I will keep that in mind. Let's sort out other refactors/fixes before getting to ARM.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

Agreed. We have other priorities for the time being.

@RAMitchell
Copy link
Member

I think having an ARM machine is worthwhile even with the debugging difficulty but as you say there are other priorities. We could use more open source contribution on xgboost in the CI department but the problem is access to the servers.

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

We could use more open source contribution on xgboost in the CI department but the problem is access to the servers

@RAMitchell Is web access to Jenkins going to sufficient to this kind of contribution, or would (some) contributors need SSH access? If the latter is the case, we need to consider alternative to having AWS sponsor the whole CI server, since they won't allow outsiders to have SSH access. Maybe form a consortium of multiple companies and have them pay a set amount each month to maintain the existing CI server?

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

BTW, the current Jenkins CI server is running out of my own pocket. It's manageable for now, but I'd prefer a more sustainable arrangement sooner or later :)

@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 7, 2018

We could use more open source contribution on xgboost in the CI department but the problem is access to the servers

@RAMitchell I may have misunderstood you here. Did you mean access as in access control, or as in availability of compute resources?

@lock lock bot locked as resolved and limited conversation to collaborators Mar 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants