Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement map_batches to align with TorchArrow API #359

Closed
wants to merge 5 commits into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Apr 12, 2022

Per title

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 12, 2022
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan requested review from NivekT and VitalyFedyunin April 12, 2022 18:34
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! Thanks for adding this!


def __iter__(self) -> Iterator[T_co]:
batch: List = []
for d in self.datapipe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be done by .batch(batch_size).map(...)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hhhh, totally make sense to me. 😃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing, .batch(batch_size).map() is not equal to .map_batches() when input_col is specified because map would treat input as a single data structure rather than a batch. Then, the input_col would be applied at batch level rather than data level.
Let's say our input is [(0,1), (2,3), (3,4), (5,6), (7,8), (9,10)] and batch size as 3. With input_col as 1, the inputs sent into fn are different between the two implementations.

  • .batch().map(): Inputs are (2, 3) and (7, 8)
  • .map_batches(): inputs are (1, 3, 4) and (6, 8, 10)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is because we have data[idx] rather than batch[idx] within _apply_fn.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants