Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] remove unnecessary conversions between data structures #8546

Merged
merged 9 commits into from
Dec 14, 2022

Conversation

jameslamb
Copy link
Contributor

@jameslamb jameslamb commented Dec 5, 2022

I ran flake8 over the code in the Python package, with flake8-comprehensions installed.

pip install flake8 flake8-comprehensions
flake8 \
    --ignore=B,F,E,W \
    --exclude='python-package/build' \
    .

This found a few opportunities for small optimizations. For example, the results of set([1, 2, 3]) and {1, 2, 3} are equivalent, but the set([]) form has the extra cost of construction an intermediate list (which is immediately thrown away).

This PR proposes resolving those warnings, for some tiny efficiency improvements.

./python-package/setup.py:282:51: C400 Unnecessary generator - rewrite as a list comprehension.
./python-package/xgboost/core.py:1163:48: C405 Unnecessary tuple literal - rewrite as a set literal.
./python-package/xgboost/core.py:1511:18: C402 Unnecessary generator - rewrite as a dict comprehension.
./python-package/xgboost/tracker.py:144:23: C405 Unnecessary list literal - rewrite as a set literal.
./python-package/xgboost/tracker.py:258:23: C405 Unnecessary list literal - rewrite as a set literal.
./python-package/xgboost/training.py:518:18: C402 Unnecessary generator - rewrite as a dict comprehension.

Thanks for your time and consideration.

Comment on lines -517 to -518
else:
params = dict((k, v) for k, v in params.items())
Copy link
Contributor Author

@jameslamb jameslamb Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flake8-comprehensions warned about the use of dict() here, but when I looked more closely, it seems like this line isn't necessary at all.

It's been in xgboost for 6+ years (since #1142), so maybe something has changed since then.

But if the code enters this point, it's already guaranteed that params is a dictionary.

if isinstance(params, list):
_metrics = [x[1] for x in params if x[0] == 'eval_metric']
params = dict(params)
if 'eval_metric' in params:
params['eval_metric'] = _metrics
else:
params = dict((k, v) for k, v in params.items())

since it can only be a dictionary or list

params: BoosterParam,

BoosterParam = Union[List, Dict] # better be sequence

And for something that is already a dictionary, I can't think of a situation where re-iterating over all the contents like this would produce a different dictionary.

params = {"a": 1, "b": [2, 3], "c": None}

new_params = dict((k, v) for k, v in params.items())

params == new_params
# True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trivialfis I found that this change was actually to blame for the failure of tests/python/test_early_stopping.py, not the one from this comment thread: #8546 (comment).

That dict((k, v) ... code was serving a second purpose...it ensured that the rest of xgboost.cv() worked on a copy of params, so calls like params.pop() wouldn't have side-effects on the passed-in dictionary.

I fixed that in 849ea92.


There is one remaining test Python test failure (python/test_with_sklearn.py::test_pandas_input) on macOS (build link) and one on Windows (build link).

assert isinstance(
            clf_isotonic.calibrated_classifiers_[0].base_estimator, xgb.XGBClassifier
        )
E       AttributeError: '_CalibratedClassifier' object has no attribute 'base_estimator'

I think that might not be related to this PR's changes, given that I also see it on this other PR which only touches the R package: #8565 (build link).

I wonder if that's related to the newest release of scikit-learn supporting pandas as an input and output (scikit-learn/scikit-learn#5523).

I noticed, for example, that the failing Mac build here is using scikit-learn=1.2.0, which was just released today (PyPI link).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully #8580 will fix this.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the cleanup! Could you please rebase onto or merge the latest branch?

@jameslamb
Copy link
Contributor Author

Could you please rebase onto or merge the latest branch

Sure! Just merged in master.

I chose to merge instead of rebase since this repo uses squash merging.

@@ -1160,7 +1160,7 @@ def feature_names(self, feature_names: Optional[FeatureNames]) -> None:
raise ValueError(msg)
# prohibit to use symbols may affect to parse. e.g. []<
if not all(isinstance(f, str) and
not any(x in f for x in set(('[', ']', '<')))
not any(x in f for x in {'[', ']', '<'})
Copy link
Contributor Author

@jameslamb jameslamb Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
not any(x in f for x in {'[', ']', '<'})
not any(x in f for x in ['[', ']', '<'])

pylint rightly caught that since this is iterating over values, not doing a set comparison, an iterable type should be used here instead of a set.

/home/runner/work/xgboost/xgboost/python-package/xgboost/core.py:1163: convention (C0208, use-sequence-for-iteration, DMatrix.feature_names) Use a sequence type when iterating over values

build link: https://github.com/dmlc/xgboost/actions/runs/3635213908/jobs/6134104007

Copy link
Member

@trivialfis trivialfis Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, the change seems to be causing failure in tests. can you reproduce the failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'm not sure why. I can't see how iterating over a list of strings should be different (in a way that matters for this code) than iterating over a set of strings.

I just merged latest master into this branch, in case the test failures aren't related to this PR's changes. If they still fail, I'll try to reproduce this locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yes, I was able to reproduce the failing tests locally.

cd python-package
pip install .
cd ..
pytest tests/python/test_early_stopping.py

I'll try to figure out which of these changes is to blame, sorry about that!

@trivialfis
Copy link
Member

let me fix the failure with updated sklearn tomorrow.

@trivialfis
Copy link
Member

apologies for the push. I need to bring in the CI fixes and hope you don't mind.

@jameslamb
Copy link
Contributor Author

no problem, do what you need! And let me know if I can help in any way.

@trivialfis trivialfis merged commit 06ea6c7 into dmlc:master Dec 14, 2022
@jameslamb jameslamb deleted the python/comprehensions branch December 14, 2022 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants