Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-5294 - XGBoost model in Flow doesn't seem to converge #2055

Merged
merged 3 commits into from
Feb 16, 2018

Conversation

Pscheidl
Copy link
Contributor

Issue: https://0xdata.atlassian.net/browse/PUBDEV-5294

This is a known issue with XGBoost, when the library takes system Locale and uses it to determine decimal separator.

There are many countries with different decimal separator from US/UK decimal point, usually a decimal comma (arabic style). For example Japan shares the same settings with US/UK (one more reason for this to work for the US guys and Mateusz). https://en.wikipedia.org/wiki/Decimal_separator

This issue has been reported many times (e.g. here dmlc/xgboost#2512), however for some uses, this is considered a feature and is unlikely to be fixed soon.

Therefore, fixing this issue on our side by passing localized parameters is the preferred option.

@@ -168,40 +172,40 @@ public XGBoostModel(Key<XGBoostModel> selfKey, XGBoostParameters parms, XGBoostO
}
if (p._eta != 0.3) {
Log.info("Using user-provided parameter eta instead of learn_rate.");
params.put("eta", p._eta);
params.put("eta", localizedNumberFormatter.format(p._eta));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling every single case I would recommend allowing to put the number directly in the map and then post-process it to a correct XGBoost input. This will prevent future mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to this, someone might just add a new param as is without realizing this in the future. A post process of all float/double values right before we return the params map might be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@mdymczyk
Copy link
Contributor

Good catch! Looks good if you could only make that one improvement @michalkurka suggested.

@Pscheidl Pscheidl force-pushed the pavel_pubdev-5294 branch 2 times, most recently from 2f7aed4 to 9661331 Compare February 15, 2018 20:55
@Pscheidl
Copy link
Contributor Author

Thank you guys, all set. Feel free to do review. I'll watch tests.

import java.util.HashMap;
import java.util.Map;

import static hex.tree.xgboost.XGBoost.makeDataInfo;

public class XGBoostModel extends Model<XGBoostModel, XGBoostModel.XGBoostParameters, XGBoostOutput> {

private static final NumberFormat localizedNumberFormatter = DecimalFormat.getNumberInstance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see documentation to DecimalFormat - there is no guarantee of being thread safe

 * Decimal formats are generally not synchronized.
 * It is recommended to create separate format instances for each thread.
 * If multiple threads access a format concurrently, it must be synchronized
 * externally.

@michalkurka
Copy link
Contributor

@h2o-ops please test

@michalkurka
Copy link
Contributor

@Pscheidl seems ready to merge - the test failure is irrelevant

@Pscheidl
Copy link
Contributor Author

@stexyz Please test your original scenario and let us know. Thank you.

@Pscheidl
Copy link
Contributor Author

Pscheidl commented Feb 16, 2018

I did manual testing myself on the airlines data as well. Czech locale (cs_CZ), which is comma separated.
snimek obrazovky porizeny 2018-02-15 21-59-55

Works.

Copy link
Contributor

@stexyz stexyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XGBoost converges on the Airlines dataset now when invoked from the Flow.

@Pscheidl Pscheidl merged commit 6819806 into master Feb 16, 2018
@arnocandel
Copy link
Member

Nice!

@Pscheidl Pscheidl deleted the pavel_pubdev-5294 branch February 16, 2018 12:10
michalkurka pushed a commit that referenced this pull request Feb 19, 2018
* PUBDEV-5294 - XGBoost model in Flow doesn't seem to converge

* Fix issue with DecimalFormat (not thread safe!)

* Fix formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants