New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Adding InMemoryCacheHolderMapDataPipe #328

Closed

NivekT wants to merge 12 commits into gh/NivekT/69/base from gh/NivekT/69/head

Contributor

NivekT commented Mar 24, 2022 •

edited

Loading

Stack from ghstack:

I think it will be useful to have a MapDataPipe for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the iter version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

Differential Revision: D35979905


          Adding InMemoryCacheHolderMapDataPipe

2059b08

[ghstack-poisoned]

This was referenced Mar 24, 2022

Adding UnZipperMapDataPipe #325

Closed

Small refactor to ensure 'to_map_datapipe' is generated as a method in IterDataPipe's pyi interface #326

Closed

Adding MapToIterConverter #327

Closed

Generate pyi file for MapDataPipes defined in this library #329

Closed

facebook-github-bot added the CLA Signed label

NivekT marked this pull request as draft

March 24, 2022 23:10

NivekT requested review from VitalyFedyunin and ejguan

March 24, 2022 23:10

NivekT commented

View reviewed changes

torchdata/datapipes/map/util/cacheholder.py Outdated Show resolved Hide resolved

torchdata/datapipes/map/util/cacheholder.py Outdated

+                      if cache_fn is None:
+                          cache_fn = list_cache
+                      self.cache = cache_fn(len(source_dp))
+                      if not (hasattr(self.cache, "__getitem__") and hasattr(self.cache, "__setitem__")):

Contributor Author

NivekT Mar 24, 2022

The only requirement is for the initialized cache to have these methods. Does this seem reasonable enough?

Contributor Author

NivekT commented Mar 28, 2022

@ejguan @VitalyFedyunin Since there were some questions about whether this is needed (or if users should use the IterDataPipe version), can you have a look at this before I add the docstring/tests?

ejguan reviewed

View reviewed changes

torchdata/datapipes/map/util/cacheholder.py Outdated Show resolved Hide resolved

NivekT added 2 commits

March 29, 2022 11:46


          Update on "Adding InMemoryCacheHolderMapDataPipe"

f34f80f

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

TODO: add docstring, tests, etc

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

488cb3f

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

TODO: add docstring, tests, etc

[ghstack-poisoned]

NivekT marked this pull request as ready for review

March 29, 2022 22:51

NivekT added 5 commits

March 29, 2022 18:54


          Update on "Adding InMemoryCacheHolderMapDataPipe"

fdef0fa

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

7aa5d56

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

e3ed8b3

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

b157786

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

dde0195

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]

ejguan reviewed

View reviewed changes

torchdata/datapipes/map/util/cacheholder.py Outdated

+                      [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+                  """
+                  def __init__(self, source_dp: MapDataPipe[T_co], cache: Optional[Callable] = None) -> None:

Contributor

ejguan Apr 27, 2022

Why do we want to expose cache to users? I believe a dict should be sufficient for the cache.

Contributor Author

NivekT Apr 27, 2022

The intention is here is allow users to specify the data structure if they wish (such as an LRU cache, numpy.array, or multiprocessing.Array.)

Contributor

ejguan Apr 27, 2022

If it's LRU cache, the element will be popped out the cache when it's full. Then, there is a bug that index_cached will still keep the index that has already removed from the cache.

I believe you want to use numpy.array or mp.Array for COW issue during multiprocessing. However, I don't think this is the problem in this case because self.cache is empty when we pass this DataPipe instance into multiprocessing. It's different from putting list of indices into Dataset causing the COW issue. This is legitimate to have increasing memory usage as each process will hold a separate cache object.

Contributor Author

NivekT Apr 27, 2022

If the COW issue isn't relevant here, then I will just use a dict. Thanks for the comment!

torchdata/datapipes/map/util/cacheholder.py Outdated Show resolved Hide resolved

torchdata/datapipes/map/util/cacheholder.py Outdated Show resolved Hide resolved


          Update on "Adding InMemoryCacheHolderMapDataPipe"

33fafe3

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]

NivekT requested a review from ejguan

April 27, 2022 19:10

ejguan approved these changes

View reviewed changes

Contributor

ejguan left a comment

LGTM thank you.

torchdata/datapipes/map/util/cacheholder.py

Comment on lines +42 to +45

+                  def __getitem__(self, index) -> T_co:
+                      if index not in self.cache:
+                          self.cache[index] = self.source_dp[index]  # type: ignore[index]
+                      return self.cache[index]  # type: ignore[index]

Contributor

ejguan Apr 27, 2022

I have a crazy idea that might be over-engineering here. Should we remove self.source_dp after all values are cached? We can save memory for users.
😃

Contributor Author

NivekT Apr 27, 2022 •

edited

Loading

Haha interesting idea. I will hold off on that for now since I am unsure how that will interact with other features such as graph traversal and etc.
For example, if the cache is filled and we delete self.source_dp, then we run traverse, then the graph will be incomplete? I am going to add a note here to state that it may be possible though.

NivekT added 3 commits

April 27, 2022 15:42


          Update on "Adding InMemoryCacheHolderMapDataPipe"

ab0e769

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

80e7499

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]


          Update on "Adding InMemoryCacheHolderMapDataPipe"

24562a7

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]

Contributor Author

NivekT commented Apr 27, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot closed this in

ce4cf80

facebook-github-bot deleted the gh/NivekT/69/head branch

May 6, 2022 14:17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels