Add tfrecord_loader implementation #308

jkulhanek · 2022-03-19T07:15:39Z

Fixes #306

Changes

Added TFRecordLoaderIterDataPipe that parses opened tfrecord streams and yields the individual records. It supports both tf.train.Example and tf.train.SequenceExample and users can additionally specify shape/dtype for each feature in the dict.
Added tests for TFRecordLoaderIterDataPipe.

VitalyFedyunin · 2022-03-23T18:18:40Z

Can you please drop tensorflow dependency and leave only protobuf. It is ok if you would need to commit binaries that are required for testing directly into the repo (please keep them within 100kb).

jkulhanek · 2022-03-25T21:27:38Z

I removed the TensorFlow dependency.

jkulhanek · 2022-04-06T14:17:44Z

Any update?

torchdata/datapipes/iter/util/tfrecordloader.py

VitalyFedyunin · 2022-04-07T22:20:49Z

torchdata/datapipes/iter/util/_tfrecord_example_pb2.py

@@ -0,0 +1,698 @@
+# Generated by the protocol buffer compiler.  DO NOT EDIT!


This file should be generated, not part of git repo. Instead please put the original protobuf file + generation script (inside of tests folder).

I think we would need to include several protobuf files (example.proto depends on other files). Should the file be generated in setup.py?

If there are only a few protobuf files, we can add them to the library without having them generated.

Do you know how many there are?

I think there are two: "example.proto" and "feature.proto" https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/example

But are you sure the "tests" folder is a good fit for ".proto" files?

I just realized we need protobuf compiler "protoc" executable for the generation. It would be bad to assume it as a dependency if you support installation from source code in the README.

What about we raise an error message with instructions (and a script) to generate the file within the library if the file is missing? I think that avoids add the generation to setup.py and the multiprocessing issue.

cc: @VitalyFedyunin

Can't we just leave the generated file in the package? It seems to me like the easiest option and we would not require users to install protobuf compiler before using the loader.

Agree to leave it as is as soon as we place source .proto files besides it.

Agree, we should add the generated files and the source .proto files as well. Somewhere like torchdata/datapipes/iter/util/protobuf_template.

…tfrecordloader

NivekT

Overall, looks good! I left a few comments above. And:

Can you rebase to the latest origin/main? You might need to update to header from "Facebook" to "Meta Platforms"
Add the DataPipe to test_serialization?
Run flake8 and mypy on each of the .py file

Thank you so much for working with us on this PR!

jkulhanek · 2022-04-20T20:12:52Z

Thank you for your help with the PR!
I have updated the files according to your comments. After merging the main branch, one of the tests is still failing on class "torchdata.datapipes.iter.util.unzipper.UnZipperIterDataPipe". Otherwise, it looks good. Both mypy and flake8 report no errors in the added code.

NivekT · 2022-04-21T14:56:17Z

Thank you for your help with the PR! I have updated the files according to your comments. After merging the main branch, one of the tests is still failing on class "torchdata.datapipes.iter.util.unzipper.UnZipperIterDataPipe". Otherwise, it looks good. Both mypy and flake8 report no errors in the added code.

Do you have the error message for that failure? I am not seeing anything on my end.

The CI failures seem pre-existing as well.

facebook-github-bot · 2022-04-21T14:56:48Z

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

NivekT

LGTM!

jkulhanek · 2022-04-24T14:27:07Z

I merged the main branch, all tests are passing locally now (except for torchtext and torchaudio - these I dit not run).

facebook-github-bot · 2022-04-25T14:02:55Z

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2022-04-26T15:30:42Z

This PR breaks mypy lint CI. @NivekT

VitalyFedyunin · 2022-04-26T15:51:44Z

Let's forward fix it, if it possible to do quickly.

jkulhanek · 2022-04-26T18:02:21Z

The linked PR should fix the problem.

Summary: This PR fixes mypy errors with TFRecord implementation. Fixes #308 ### Changes - Example and ExampleSpec renamed to TFExample and TFExampleSpec and exported to `torchdata.datapipes.iter` Pull Request resolved: #374 Reviewed By: msaroufim Differential Revision: D35943347 Pulled By: NivekT fbshipit-source-id: 65b728225b21f7b36262d88a2a03e6121689488d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2022

jkulhanek force-pushed the main branch from 6c773bd to cec0174 Compare March 19, 2022 07:21

NivekT self-requested a review March 21, 2022 04:01

Add tfrecord_loader implementation

492ee94

jkulhanek force-pushed the main branch from 52fdb37 to 492ee94 Compare March 21, 2022 07:43

NivekT requested a review from VitalyFedyunin March 21, 2022 19:52

jkulhanek added 2 commits March 25, 2022 16:25

Remove TF dependency from test

87e0dd7

Remove TensorFlow dependency

9660210

VitalyFedyunin reviewed Apr 7, 2022

View reviewed changes

torchdata/datapipes/iter/util/tfrecordloader.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Apr 7, 2022

View reviewed changes

jkulhanek added 2 commits April 13, 2022 15:19

Add static example.proto and feature.proto assumption comment

5d29d7f

Merge branch 'tfrecordloader' of github.com:jkulhanek/torchdata into …

14c4802

…tfrecordloader

NivekT reviewed Apr 14, 2022

View reviewed changes

NivekT added ciflow/period Run period tests and removed ciflow/period Run period tests labels Apr 14, 2022

NivekT mentioned this pull request Apr 14, 2022

Prevent any release workflow running on fork #361

Closed

jkulhanek added 6 commits April 20, 2022 21:35

Update TFRecordLoader impl

79d2a84

Merge

903e909

Merge

de2bc1a

Update copyright to Meta

fd147d6

Add test serialization for tfrecord

e662acf

Add mypy support

3493fb0

NivekT approved these changes Apr 21, 2022

View reviewed changes

Merge branch 'main' of github.com:pytorch/data

3cf8d20

facebook-github-bot closed this in d9bbbec Apr 25, 2022

jkulhanek mentioned this pull request Apr 26, 2022

Fixes mypy problems with tfrecord_loader implementation #374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tfrecord_loader implementation #308

Add tfrecord_loader implementation #308

jkulhanek commented Mar 19, 2022

VitalyFedyunin commented Mar 23, 2022

jkulhanek commented Mar 25, 2022

jkulhanek commented Apr 6, 2022

VitalyFedyunin Apr 7, 2022

jkulhanek Apr 11, 2022

NivekT Apr 12, 2022

jkulhanek Apr 13, 2022

jkulhanek Apr 13, 2022

jkulhanek Apr 15, 2022

NivekT Apr 15, 2022

jkulhanek Apr 15, 2022

VitalyFedyunin Apr 19, 2022

NivekT Apr 19, 2022

NivekT left a comment •

edited

Loading

jkulhanek commented Apr 20, 2022

NivekT commented Apr 21, 2022 •

edited

Loading

facebook-github-bot commented Apr 21, 2022

NivekT left a comment

jkulhanek commented Apr 24, 2022

facebook-github-bot commented Apr 25, 2022

ejguan commented Apr 26, 2022

VitalyFedyunin commented Apr 26, 2022

jkulhanek commented Apr 26, 2022

		@@ -0,0 +1,698 @@
		# Generated by the protocol buffer compiler. DO NOT EDIT!

Add tfrecord_loader implementation #308

Add tfrecord_loader implementation #308

Conversation

jkulhanek commented Mar 19, 2022

Changes

VitalyFedyunin commented Mar 23, 2022

jkulhanek commented Mar 25, 2022

jkulhanek commented Apr 6, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NivekT left a comment • edited Loading

Choose a reason for hiding this comment

jkulhanek commented Apr 20, 2022

NivekT commented Apr 21, 2022 • edited Loading

facebook-github-bot commented Apr 21, 2022

NivekT left a comment

Choose a reason for hiding this comment

jkulhanek commented Apr 24, 2022

facebook-github-bot commented Apr 25, 2022

ejguan commented Apr 26, 2022

VitalyFedyunin commented Apr 26, 2022

jkulhanek commented Apr 26, 2022

NivekT left a comment •

edited

Loading

NivekT commented Apr 21, 2022 •

edited

Loading