-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding InMemoryCacheHolderMapDataPipe #328
Conversation
[ghstack-poisoned]
if cache_fn is None: | ||
cache_fn = list_cache | ||
self.cache = cache_fn(len(source_dp)) | ||
if not (hasattr(self.cache, "__getitem__") and hasattr(self.cache, "__setitem__")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only requirement is for the initialized cache
to have these methods. Does this seem reasonable enough?
@ejguan @VitalyFedyunin Since there were some questions about whether this is needed (or if users should use the |
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). TODO: add docstring, tests, etc [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). TODO: add docstring, tests, etc [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] | ||
""" | ||
|
||
def __init__(self, source_dp: MapDataPipe[T_co], cache: Optional[Callable] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to expose cache
to users? I believe a dict
should be sufficient for the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention is here is allow users to specify the data structure if they wish (such as an LRU cache, numpy.array
, or multiprocessing.Array
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's LRU cache, the element will be popped out the cache when it's full. Then, there is a bug that index_cached
will still keep the index that has already removed from the cache.
I believe you want to use numpy.array
or mp.Array
for COW issue during multiprocessing. However, I don't think this is the problem in this case because self.cache
is empty when we pass this DataPipe instance into multiprocessing. It's different from putting list of indices into Dataset
causing the COW issue. This is legitimate to have increasing memory usage as each process will hold a separate cache
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the COW issue isn't relevant here, then I will just use a dict
. Thanks for the comment!
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thank you.
def __getitem__(self, index) -> T_co: | ||
if index not in self.cache: | ||
self.cache[index] = self.source_dp[index] # type: ignore[index] | ||
return self.cache[index] # type: ignore[index] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a crazy idea that might be over-engineering here. Should we remove self.source_dp
after all values are cached? We can save memory for users.
😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha interesting idea. I will hold off on that for now since I am unsure how that will interact with other features such as graph traversal and etc.
For example, if the cache is filled and we delete self.source_dp
, then we run traverse
, then the graph will be incomplete? I am going to add a note here to state that it may be possible though.
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
I think it will be useful to have a `MapDataPipe` for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use the `iter` version. This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list). [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
I think it will be useful to have a
MapDataPipe
for caching data, especially right before shuffling and etc. The use case should be common enough that we should support it rather than asking users to use theiter
version.This also have the added functionality for users to customize the wrapper around their data (e.g. user numpy array instead of list).
Differential Revision: D35979905