Implement map_batches to align with TorchArrow API #359

ejguan · 2022-04-12T16:42:17Z

Per title

facebook-github-bot · 2022-04-12T16:42:47Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

Overall, looks good! Thanks for adding this!

NivekT · 2022-04-12T21:09:25Z

torchdata/datapipes/iter/transform/callable.py

+
+    def __iter__(self) -> Iterator[T_co]:
+        batch: List = []
+        for d in self.datapipe:


nit: can this be done by .batch(batch_size).map(...)?

hhhh, totally make sense to me. 😃

After testing, .batch(batch_size).map() is not equal to .map_batches() when input_col is specified because map would treat input as a single data structure rather than a batch. Then, the input_col would be applied at batch level rather than data level.
Let's say our input is [(0,1), (2,3), (3,4), (5,6), (7,8), (9,10)] and batch size as 3. With input_col as 1, the inputs sent into fn are different between the two implementations.

.batch().map(): Inputs are (2, 3) and (7, 8)

.map_batches(): inputs are (1, 3, 4) and (6, 8, 10)

I see. This is because we have data[idx] rather than batch[idx] within _apply_fn.

test/test_datapipe.py

facebook-github-bot · 2022-04-13T15:58:22Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-04-13T16:26:17Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan added 3 commits April 12, 2022 16:27

Implement map_batches

4749cc2

Add tests

108b1d3

Add doc

5eceea6

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 12, 2022

ejguan requested review from NivekT and VitalyFedyunin April 12, 2022 18:34

NivekT approved these changes Apr 12, 2022

View reviewed changes

Add serialization tests

d8cd4e6

Fix flake8

2c2ebd5

facebook-github-bot closed this in 6a7420b Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement map_batches to align with TorchArrow API #359

Implement map_batches to align with TorchArrow API #359

ejguan commented Apr 12, 2022

facebook-github-bot commented Apr 12, 2022

NivekT left a comment

NivekT Apr 12, 2022

ejguan Apr 13, 2022

ejguan Apr 13, 2022

NivekT Apr 13, 2022

facebook-github-bot commented Apr 13, 2022

facebook-github-bot commented Apr 13, 2022

Implement map_batches to align with TorchArrow API #359

Implement map_batches to align with TorchArrow API #359

Conversation

ejguan commented Apr 12, 2022

facebook-github-bot commented Apr 12, 2022

NivekT left a comment

Choose a reason for hiding this comment

NivekT Apr 12, 2022

Choose a reason for hiding this comment

ejguan Apr 13, 2022

Choose a reason for hiding this comment

ejguan Apr 13, 2022

Choose a reason for hiding this comment

NivekT Apr 13, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Apr 13, 2022

facebook-github-bot commented Apr 13, 2022