Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and improve map reduce #232

Merged
merged 43 commits into from
Jan 1, 2023
Merged

Conversation

AnesBenmerzoug
Copy link
Collaborator

@AnesBenmerzoug AnesBenmerzoug commented Dec 19, 2022

Description

This PR closes #193 and closes #165

Changes

  • Adds a SequentialParallelBackend class that runs all jobs in the current thread.
  • Defines a BaseParallelBackend abstract class.
  • Defines a AbstractNoPublicConstructor abstract metaclass to disallow directly instantiating parallel backends.
  • Passes inputs to MapReduceJob at initialization.
  • Removes chunkify_inputs parameter from MapReduceJob.
  • Removes n_runs parameter from MapReduceJob.
  • Calls the parallel backend's put() method for each generated chunk in _chunkify().
  • Skips chunkification in MapReduceJob if n_runs >= n_jobs.
  • Use singledispatch and singledispatchmethod to handle different input types in chunkify and _get_value.
  • Adds a __repr__ method to ValuationResult.
  • Calls parallel_backend.put() on the input to MapReduceJob's __call__ method if it is a sequence of Utility objects.
  • Renames ParallelConfig's num_workers attribute to n_local_workers.
  • Small fixes to MapReduceJob's docstring.

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "nbsphinx":"hidden"

@AnesBenmerzoug AnesBenmerzoug self-assigned this Dec 19, 2022
@AnesBenmerzoug AnesBenmerzoug marked this pull request as ready for review December 19, 2022 08:02
@AnesBenmerzoug
Copy link
Collaborator Author

@mdbenito I have tried using partial to avoid using map_kwargs and reduce_kwargs but it didn't work with ray's put function. I checked their code and it does different things depending on the passed input.

@mdbenito mdbenito added this to the Ready for public release milestone Dec 19, 2022
src/pydvl/utils/parallel/backend.py Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/value/shapley/montecarlo.py Show resolved Hide resolved
src/pydvl/value/shapley/montecarlo.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Show resolved Hide resolved
src/pydvl/utils/numeric.py Outdated Show resolved Hide resolved
@AnesBenmerzoug AnesBenmerzoug marked this pull request as ready for review December 28, 2022 09:49
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking nice :) I still have my previous question about putting everything to the backend, though. And a few comments here and there

src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/backend.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/value/shapley/montecarlo.py Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
@AnesBenmerzoug
Copy link
Collaborator Author

@mdbenito I have made the changes we talked about:

  • removed n_runs from MapReduceJobs
  • call put() on each chunk inside _chunkify()

I also added two tests for MapReduceJobs and fixed a bug when passing numpy arrays to _chunkify().

Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks rather nice now :) I suggested a couple of cosmetic changes, ditching redundant elses, but other than that I'm 100% fine with merging. Go ahead!

Comment on lines +165 to +170
if map_kwargs is None:
self.map_kwargs = dict()
else:
self.map_kwargs = {
k: self.parallel_backend.put(v) for k, v in map_kwargs.items()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if map_kwargs is None:
self.map_kwargs = dict()
else:
self.map_kwargs = {
k: self.parallel_backend.put(v) for k, v in map_kwargs.items()
}
self.map_kwargs = {
k: self.parallel_backend.put(v) for k, v in (map_kwargs or {}).items()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes it harder to read since this is not a common pattern.

Comment on lines +172 to +177
if reduce_kwargs is None:
self.reduce_kwargs = dict()
else:
self.reduce_kwargs = {
k: self.parallel_backend.put(v) for k, v in reduce_kwargs.items()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if reduce_kwargs is None:
self.reduce_kwargs = dict()
else:
self.reduce_kwargs = {
k: self.parallel_backend.put(v) for k, v in reduce_kwargs.items()
}
self.reduce_kwargs = {
k: self.parallel_backend.put(v) for k, v in (reduce_kwargs or {}).items()
}

src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved
src/pydvl/utils/parallel/map_reduce.py Outdated Show resolved Hide resolved

import numpy as np
from numpy.typing import NDArray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we always directly import like this instead of using the if TYPING guard and then quoting all types? (that's a bit of a PITA, tbh)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a leftover from when we didn't pin the minimum numpy version to 1.20.

The typing module was added to numpy in version 1.20.

So, yes I think we should just import it directly.

Co-authored-by: Miguel de Benito Delgado <[email protected]>
@AnesBenmerzoug AnesBenmerzoug merged commit c94aafe into develop Jan 1, 2023
@AnesBenmerzoug AnesBenmerzoug deleted the fix-and-improve-map-reduce branch January 5, 2023 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected behaviour of MapReduceJob with multiple runs Ideas for a change of interface to MapReduceJob
2 participants