Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support keep_key in Grouper? #256

Closed
pmeier opened this issue Feb 25, 2022 · 15 comments
Closed

Support keep_key in Grouper? #256

pmeier opened this issue Feb 25, 2022 · 15 comments
Labels
good first issue Good for newcomers

Comments

@pmeier
Copy link
Contributor

pmeier commented Feb 25, 2022

IterKeyZipper has an option to keep the key that was zipped on:

keep_key: bool = False,

Is this something we want to support going forward? If yes, it would be nice to have this also on Grouper and possibly other similar datapipes. That would come in handy in situations if the key is used multiple times for example if we have a IterKeyZipper after an Grouper.

Additional Context for New Contributors

See comment below

@ejguan
Copy link
Contributor

ejguan commented Feb 25, 2022

Sounds reasonable to me

@NivekT NivekT added the good first issue Good for newcomers label Mar 7, 2022
@NivekT
Copy link
Contributor

NivekT commented Mar 7, 2022

For new contributors who would like to work on this issue, we recommend:

  1. Look at how keep_key is used by IterKeyZipper
  2. Add the same feature to Grouper
  3. Update the unit test for Grouper to ensure the new behavior works and is backward compatible with existing usages

@NicolasHug
Copy link
Member

I took a quick look at this issue.
I wonder if it would make sense to instead avoid the overlap between keep_key and merge_fn? For example we could remove the keep_key parameter, and instead let the merge_fn function accept key, source_dp_item, ref_dp_item ? Then passing keep_key=True would be equivalent to passing

merge_fn = lambda key, source_dp_item, ref_dp_item: (key, source_dp_item, ref_dp_item)

@NivekT
Copy link
Contributor

NivekT commented Mar 28, 2022

I took a quick look at this issue. I wonder if it would make sense to instead avoid the overlap between keep_key and merge_fn? For example we could remove the keep_key parameter, and instead let the merge_fn function accept key, source_dp_item, ref_dp_item ? Then passing keep_key=True would be equivalent to passing

merge_fn = lambda key, source_dp_item, ref_dp_item: (key, source_dp_item, ref_dp_item)

I think that is a reasonable proposal and the API should be cleaner as well, but I feel unsure about doing that for BC and consistency reasons. Ideally, the IterKeyZipper, MapKeyZipper, Grouper, and similar DataPipes should have consistent APIs. If we do something different for Grouper, then IterKeyZipper should also have to change.

cc: @pmeier @ejguan Any thoughts?

@pmeier
Copy link
Contributor Author

pmeier commented Mar 28, 2022

Yes, we should be consistent. But there is no harm to phase out keep_key on the datapipes that have that param in favor of merge_fn.

@NivekT
Copy link
Contributor

NivekT commented Mar 28, 2022

Actually Grouper currently doesn't take in a merge_fn, so we can modify Grouper without any BC-breaking.
@NicolasHug was your comment focused on IterKeyZipper? If not, what is your proposal for Grouper?

In another note, I just added another issue since we had a request to allow IterKeyZipper to accept any number of input DataPipes. If we do decide to move forward with that, we will have a BC-breaking change anyway. We can make two modifications to the IterKeyZipper API at the same time:

  1. accept any number of DataPipes
  2. remove keep_key and let merge_fn accept key as an argument

@NicolasHug
Copy link
Member

NicolasHug commented Mar 28, 2022

@NicolasHug was your comment focused on IterKeyZipper?

Yes :)

what is your proposal for Grouper?

That was actually going to be my next question: if we were to add a merge_fn to Grouper, what would it look like? I guess it has to have the following signature (and default behaviour)?

def merge_fn(key, *args):
    return tuple(args)

@ejguan
Copy link
Contributor

ejguan commented Mar 28, 2022

Emmm. I agree to add keep_key to Grouper. And, if keep_key can be dropped from IterKeyZipper by using merge_fn, we can drop it. But, I personally don't think adding merge_fn to Grouperis beneficial.

  1. I personally don't realize the use case to have a merge_fn for a Grouper. merge_fn is used for IterKeyZipper to flatten a nested data structure
  2. The behavior of these two kinds of DataPipe is different. The size of each data yielded from IterKeyZipper is fixed based on the number of input DataPipes (tuple is an appropriate data structure as the return type). The size of data yielded Grouper can vary across iterations (list is a better data structure as the return type). As the return types are different, we can not provide the same default merge_fn for both DataPipes.

@NivekT
Copy link
Contributor

NivekT commented Mar 28, 2022

@NicolasHug

If we choose to add merge_fn for Grouper, it will replace wrapper_class.

https://github.com/pytorch/pytorch/blob/ea44645c9a682a4e212e64b94a86383c3388ed6b/torch/utils/data/datapipes/iter/grouping.py#L262-L264

As @ejguan mentioned, wrapper_class is applied onto a List (variable result_to_yield) with length not known in advance. I do agree that Grouper is doing something different from the Zippers and such that divergence in their APIs make sense.

My proposal would be to add keep_key to Grouper, which will return a tuple (key, wrapper_class(result_to_yield)) when True, and wrapper_class(result_to_yield) when False.

@NicolasHug
Copy link
Member

Should we just add keep_key to Grouper and remove merge_fn from the Zipper then? Both APIs would be consistent.

Perhaps I'm missing something but it looks like merge_fn isn't really necessary since it can be achieved just as easily with a subsequent call to Mapper?

@ejguan
Copy link
Contributor

ejguan commented Mar 29, 2022

Perhaps I'm missing something but it looks like merge_fn isn't really necessary since it can be achieved just as easily with a subsequent call to Mapper?

Yeah. When IterKeyZipper starts to support multiple datapipes, the merge_fn is not really useful. IIRC, merge_fn is used as a syntax sugar for a sequence of IterKeyZipper.

@SvenDS9
Copy link
Contributor

SvenDS9 commented Jan 10, 2023

Is this issue still relevant? The Grouper class seems to be gone (or moved away?). Removing the merge_fn from IterKeyZipper is still possible. But does it still make sense? I'm looking for a good first issue :)

@ejguan

@ejguan
Copy link
Contributor

ejguan commented Jan 10, 2023

@SvenDS9 Thanks for taking interest in contributing TorchData. Grouper is currently landed in PyTorch core: https://github.com/pytorch/pytorch/blob/ce50a8de7535b5e359f7ed7ead4285414a966d3f/torch/utils/data/datapipes/iter/grouping.py#L202-L203
So, adding a keep_key would be a good first issue

@SvenDS9
Copy link
Contributor

SvenDS9 commented Jan 11, 2023

Should I still work on removing the merge_fn from IterKeyZipper?

Since Grouper has been moved away to PyTorch core, how can I still contribute here?

@ejguan
Copy link
Contributor

ejguan commented Jan 11, 2023

Should I still work on removing the merge_fn from IterKeyZipper?

I think it needs to be done with this issue #334

Since Grouper has been moved away to PyTorch core, how can I still contribute here?

You can open a PR against PyTorch Core. And, we will review it for sure.

@atalman atalman closed this as completed Jan 27, 2023
SvenDS9 added a commit to SvenDS9/PytorchData that referenced this issue Feb 23, 2023
facebook-github-bot pushed a commit that referenced this issue Feb 27, 2023
Summary:
As mentioned in #256 it would be useful to have this in all dp that use a `key_fn`.

### Changes

- Add keep_key option to `MapKeyZipper`
- Test for the new option
- Improve example in documentation by using sphinx doctest

Pull Request resolved: #1042

Reviewed By: ejguan

Differential Revision: D43550124

Pulled By: NivekT

fbshipit-source-id: 453418560dd586ebb9ce42dd27eb1bc86eb734d0
NivekT pushed a commit that referenced this issue Feb 28, 2023
Summary:
As mentioned in #256 it would be useful to have this in all dp that use a `key_fn`.

### Changes

- Add keep_key option to `MapKeyZipper`
- Test for the new option
- Improve example in documentation by using sphinx doctest

Pull Request resolved: #1042

Reviewed By: ejguan

Differential Revision: D43550124

Pulled By: NivekT

fbshipit-source-id: 453418560dd586ebb9ce42dd27eb1bc86eb734d0
ejguan pushed a commit that referenced this issue Feb 28, 2023
Summary:
As mentioned in #256 it would be useful to have this in all dp that use a `key_fn`.

### Changes

- Add keep_key option to `MapKeyZipper`
- Test for the new option
- Improve example in documentation by using sphinx doctest

Pull Request resolved: #1042

Reviewed By: ejguan

Differential Revision: D43550124

Pulled By: NivekT

fbshipit-source-id: 453418560dd586ebb9ce42dd27eb1bc86eb734d0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants