Fix and improve map reduce #232

AnesBenmerzoug · 2022-12-19T08:00:56Z

Description

This PR closes #193 and closes #165

Changes

Adds a SequentialParallelBackend class that runs all jobs in the current thread.
Defines a BaseParallelBackend abstract class.
Defines a AbstractNoPublicConstructor abstract metaclass to disallow directly instantiating parallel backends.
Passes inputs to MapReduceJob at initialization.
Removes chunkify_inputs parameter from MapReduceJob.
Removes n_runs parameter from MapReduceJob.
Calls the parallel backend's put() method for each generated chunk in _chunkify().
Skips chunkification in MapReduceJob if n_runs >= n_jobs.
Use singledispatch and singledispatchmethod to handle different input types in chunkify and _get_value.
Adds a __repr__ method to ValuationResult.
Calls parallel_backend.put() on the input to MapReduceJob's __call__ method if it is a sequence of Utility objects.
Renames ParallelConfig's num_workers attribute to n_local_workers.
Small fixes to MapReduceJob's docstring.

Checklist

Wrote Unit tests (if necessary)
Updated Documentation (if necessary)
Updated Changelog
If notebooks were added/changed, added boilerplate cells are tagged with "nbsphinx":"hidden"

It was only used in one place and did not make sense in the context of parallelization

…thereof is passed

AnesBenmerzoug · 2022-12-19T08:03:34Z

@mdbenito I have tried using partial to avoid using map_kwargs and reduce_kwargs but it didn't work with ray's put function. I checked their code and it does different things depending on the passed input.

src/pydvl/utils/parallel/backend.py

src/pydvl/utils/parallel/map_reduce.py

src/pydvl/value/shapley/montecarlo.py

src/pydvl/utils/parallel/map_reduce.py

src/pydvl/utils/parallel/backend.py

src/pydvl/utils/numeric.py

Co-authored-by: Miguel de Benito Delgado <[email protected]>

…r and get_shapley_coordinator

mdbenito

This is looking nice :) I still have my previous question about putting everything to the backend, though. And a few comments here and there

src/pydvl/utils/parallel/backend.py

src/pydvl/utils/parallel/map_reduce.py

src/pydvl/value/shapley/montecarlo.py

src/pydvl/utils/parallel/map_reduce.py

Co-authored-by: Miguel de Benito Delgado <[email protected]>

…s __init__

Co-authored-by: Miguel de Benito Delgado <[email protected]>

…umpy arrays

…alue and using partial instead of map_kwargs and reduce_kwargs

AnesBenmerzoug · 2022-12-30T18:43:39Z

@mdbenito I have made the changes we talked about:

removed n_runs from MapReduceJobs
call put() on each chunk inside _chunkify()

I also added two tests for MapReduceJobs and fixed a bug when passing numpy arrays to _chunkify().

mdbenito

I think this looks rather nice now :) I suggested a couple of cosmetic changes, ditching redundant elses, but other than that I'm 100% fine with merging. Go ahead!

mdbenito · 2023-01-01T14:57:28Z

src/pydvl/utils/parallel/map_reduce.py

+        if map_kwargs is None:
            self.map_kwargs = dict()
+        else:
+            self.map_kwargs = {
+                k: self.parallel_backend.put(v) for k, v in map_kwargs.items()
+            }


Suggested change

if map_kwargs is None:

self.map_kwargs = dict()

else:

self.map_kwargs = {

k: self.parallel_backend.put(v) for k, v in map_kwargs.items()

}

self.map_kwargs = {

k: self.parallel_backend.put(v) for k, v in (map_kwargs or {}).items()

}

I think this makes it harder to read since this is not a common pattern.

mdbenito · 2023-01-01T14:58:00Z

src/pydvl/utils/parallel/map_reduce.py

+        if reduce_kwargs is None:
            self.reduce_kwargs = dict()
+        else:
+            self.reduce_kwargs = {
+                k: self.parallel_backend.put(v) for k, v in reduce_kwargs.items()
+            }


Suggested change

if reduce_kwargs is None:

self.reduce_kwargs = dict()

else:

self.reduce_kwargs = {

k: self.parallel_backend.put(v) for k, v in reduce_kwargs.items()

}

self.reduce_kwargs = {

k: self.parallel_backend.put(v) for k, v in (reduce_kwargs or {}).items()

}

src/pydvl/utils/parallel/map_reduce.py

mdbenito · 2023-01-01T15:05:49Z

src/pydvl/value/shapley/naive.py


 import numpy as np
+from numpy.typing import NDArray


Why don't we always directly import like this instead of using the if TYPING guard and then quoting all types? (that's a bit of a PITA, tbh)

I think it was a leftover from when we didn't pin the minimum numpy version to 1.20.

The typing module was added to numpy in version 1.20.

So, yes I think we should just import it directly.

Co-authored-by: Miguel de Benito Delgado <[email protected]>

AnesBenmerzoug added 10 commits December 9, 2022 12:17

Add SequentialParallelBackend to run tasks without parallelism

37f84d2

Fix map_reduce test for non chunkified inputs

949a3bf

Use expit from scipy to avoid overflow warnings

74c4cdd

Remove the chunkify_inputs argument of MapReduceJob

58d2149

It was only used in one place and did not make sense in the context of parallelization

Rename num_chunks to n_chunks for consistency

9a1ad1a

Remove all optional arguments from MapReduceJob's __call__ method

80dec53

Call parallel backend inside MapReduceJob when a Utility or sequence …

31be0b6

…thereof is passed

Rename num_workers in ParallelConfig to n_local_workers

7c72d2a

Add a __repr__ to ValuationResults

219403f

If n_runs >= n_jobs, do not chunkify inputs

9bc96d1

AnesBenmerzoug self-assigned this Dec 19, 2022

AnesBenmerzoug marked this pull request as ready for review December 19, 2022 08:02

AnesBenmerzoug requested a review from mdbenito December 19, 2022 08:02

AnesBenmerzoug added 2 commits December 19, 2022 09:04

Use older version of tox

6fcdc5f

Update changelog

0c2454d

mdbenito added this to the Ready for public release milestone Dec 19, 2022

mdbenito reviewed Dec 19, 2022

View reviewed changes

mdbenito mentioned this pull request Dec 19, 2022

Remove unpackable decorator, use asdict() #233

Merged

AnesBenmerzoug marked this pull request as draft December 19, 2022 15:12

AnesBenmerzoug and others added 10 commits December 19, 2022 16:44

Apply suggestions from code review

c77a474

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Set explicit type for result of expit

4aa6450

Explicitly use RayParallelBackend in RayActorWrapper's docstring example

8a15206

Add support for sequential parallel backend in get_shapley_coordinato…

2f4be67

…r and get_shapley_coordinator

Use singledispatchmethod and singledispatch decorators in map_reduce

a75a4c9

Use Any as type of default _get_value function

9e68790

Use Iterable from typing for ReduceFunction type

89b67b3

Fixes

cd1ced5

Another fix

dc491ec

Set ignore_reinit_error to True when using local ray cluster

d06389f

AnesBenmerzoug added 6 commits December 21, 2022 11:09

Fix check in RayParallelBackend's init

f35de7a

Do not expose available_cpus at the package level

7d25f14

Disallow instantiating parallel backend classes directly

8ed2806

Remove unused import

cbe1ca9

Fix RayActorWrapper's docstring

9cdcde8

Pass inputs to MapReduceJob at initialization

fd8f6bd

AnesBenmerzoug marked this pull request as ready for review December 28, 2022 09:49

AnesBenmerzoug requested a review from mdbenito December 28, 2022 09:50

mdbenito reviewed Dec 28, 2022

View reviewed changes

AnesBenmerzoug and others added 14 commits December 29, 2022 09:15

Apply suggestions from code review

0ce2d59

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Fix indentation error

59534d6

Improve effective_n_jobs interface

ff247df

Set default n_jobs in shapley methods to 1

bbaa07b

Call put on map_kwargs' and reduce_kwargs' items inside MapReduceJob'…

4dffe03

…s __init__

Update src/pydvl/utils/parallel/backend.py

e9ccbc2

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Remove usage of singledispatchmethod in MapReduceJob

fe2bab6

Fix type hint

f8e3aba

Call put() on chunks inside _chunkify

0cd9945

Remove n_runs from MapReduceJobs, fix bug in _chunkify when passing n…

12cfb0c

…umpy arrays

Set default value of max_parallel_tasks to None, add tests for _get_v…

03c690f

…alue and using partial instead of map_kwargs and reduce_kwargs

Merge branch 'develop' into fix-and-improve-map-reduce

f698124

Update changelog

faa582e

Update import path for available_cpus helper function

cdabd2f

AnesBenmerzoug requested a review from mdbenito December 30, 2022 18:43

mdbenito approved these changes Jan 1, 2023

View reviewed changes

Apply suggestions from code review

cdfe9eb

Co-authored-by: Miguel de Benito Delgado <[email protected]>

AnesBenmerzoug merged commit c94aafe into develop Jan 1, 2023

AnesBenmerzoug deleted the fix-and-improve-map-reduce branch January 5, 2023 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve map reduce #232

Fix and improve map reduce #232

AnesBenmerzoug commented Dec 19, 2022 •

edited

Loading

AnesBenmerzoug commented Dec 19, 2022

mdbenito left a comment

AnesBenmerzoug commented Dec 30, 2022

mdbenito left a comment

mdbenito Jan 1, 2023

AnesBenmerzoug Jan 1, 2023

mdbenito Jan 1, 2023

mdbenito Jan 1, 2023

AnesBenmerzoug Jan 1, 2023

Fix and improve map reduce #232

Fix and improve map reduce #232

Conversation

AnesBenmerzoug commented Dec 19, 2022 • edited Loading

Description

Changes

Checklist

AnesBenmerzoug commented Dec 19, 2022

mdbenito left a comment

Choose a reason for hiding this comment

AnesBenmerzoug commented Dec 30, 2022

mdbenito left a comment

Choose a reason for hiding this comment

mdbenito Jan 1, 2023

Choose a reason for hiding this comment

AnesBenmerzoug Jan 1, 2023

Choose a reason for hiding this comment

mdbenito Jan 1, 2023

Choose a reason for hiding this comment

mdbenito Jan 1, 2023

Choose a reason for hiding this comment

AnesBenmerzoug Jan 1, 2023

Choose a reason for hiding this comment

AnesBenmerzoug commented Dec 19, 2022 •

edited

Loading