Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding InMemoryCacheHolderMapDataPipe #328

Closed
wants to merge 12 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Mar 24, 2022

Stack from ghstack:

I think it will be useful to have a MapDataPipe for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the iter version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

Differential Revision: D35979905

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2022
@NivekT NivekT marked this pull request as draft March 24, 2022 23:10
@NivekT NivekT requested review from VitalyFedyunin and ejguan March 24, 2022 23:10
if cache_fn is None:
cache_fn = list_cache
self.cache = cache_fn(len(source_dp))
if not (hasattr(self.cache, "__getitem__") and hasattr(self.cache, "__setitem__")):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only requirement is for the initialized cache to have these methods. Does this seem reasonable enough?

@NivekT
Copy link
Contributor Author

NivekT commented Mar 28, 2022

@ejguan @VitalyFedyunin Since there were some questions about whether this is needed (or if users should use the IterDataPipe version), can you have a look at this before I add the docstring/tests?

NivekT added 2 commits March 29, 2022 11:46
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

TODO: add docstring, tests, etc

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

TODO: add docstring, tests, etc

[ghstack-poisoned]
@NivekT NivekT marked this pull request as ready for review March 29, 2022 22:51
NivekT added 5 commits March 29, 2022 18:54
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
"""

def __init__(self, source_dp: MapDataPipe[T_co], cache: Optional[Callable] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to expose cache to users? I believe a dict should be sufficient for the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is here is allow users to specify the data structure if they wish (such as an LRU cache, numpy.array, or multiprocessing.Array.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's LRU cache, the element will be popped out the cache when it's full. Then, there is a bug that index_cached will still keep the index that has already removed from the cache.

I believe you want to use numpy.array or mp.Array for COW issue during multiprocessing. However, I don't think this is the problem in this case because self.cache is empty when we pass this DataPipe instance into multiprocessing. It's different from putting list of indices into Dataset causing the COW issue. This is legitimate to have increasing memory usage as each process will hold a separate cache object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the COW issue isn't relevant here, then I will just use a dict. Thanks for the comment!

I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
@NivekT NivekT requested a review from ejguan April 27, 2022 19:10
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thank you.

Comment on lines +42 to +45
def __getitem__(self, index) -> T_co:
if index not in self.cache:
self.cache[index] = self.source_dp[index] # type: ignore[index]
return self.cache[index] # type: ignore[index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a crazy idea that might be over-engineering here. Should we remove self.source_dp after all values are cached? We can save memory for users.
😃

Copy link
Contributor Author

@NivekT NivekT Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha interesting idea. I will hold off on that for now since I am unsure how that will interact with other features such as graph traversal and etc.
For example, if the cache is filled and we delete self.source_dp, then we run traverse, then the graph will be incomplete? I am going to add a note here to state that it may be possible though.

NivekT added 3 commits April 27, 2022 15:42
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version.

This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Apr 27, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot deleted the gh/NivekT/69/head branch May 6, 2022 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants