Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tfrecord_loader implementation #308

Closed
wants to merge 12 commits into from
Closed

Conversation

jkulhanek
Copy link
Contributor

Fixes #306

Changes

  • Added TFRecordLoaderIterDataPipe that parses opened tfrecord streams and yields the individual records. It supports both tf.train.Example and tf.train.SequenceExample and users can additionally specify shape/dtype for each feature in the dict.
  • Added tests for TFRecordLoaderIterDataPipe.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2022
@NivekT NivekT self-requested a review March 21, 2022 04:01
@VitalyFedyunin
Copy link
Contributor

Can you please drop tensorflow dependency and leave only protobuf. It is ok if you would need to commit binaries that are required for testing directly into the repo (please keep them within 100kb).

@jkulhanek
Copy link
Contributor Author

I removed the TensorFlow dependency.

@jkulhanek
Copy link
Contributor Author

Any update?

@@ -0,0 +1,698 @@
# Generated by the protocol buffer compiler. DO NOT EDIT!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should be generated, not part of git repo. Instead please put the original protobuf file + generation script (inside of tests folder).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would need to include several protobuf files (example.proto depends on other files). Should the file be generated in setup.py?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are only a few protobuf files, we can add them to the library without having them generated.

Do you know how many there are?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are two: "example.proto" and "feature.proto" https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But are you sure the "tests" folder is a good fit for ".proto" files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we need protobuf compiler "protoc" executable for the generation. It would be bad to assume it as a dependency if you support installation from source code in the README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we raise an error message with instructions (and a script) to generate the file within the library if the file is missing? I think that avoids add the generation to setup.py and the multiprocessing issue.

cc: @VitalyFedyunin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just leave the generated file in the package? It seems to me like the easiest option and we would not require users to install protobuf compiler before using the loader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to leave it as is as soon as we place source .proto files besides it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we should add the generated files and the source .proto files as well. Somewhere like torchdata/datapipes/iter/util/protobuf_template.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good! I left a few comments above. And:

  1. Can you rebase to the latest origin/main? You might need to update to header from "Facebook" to "Meta Platforms"
  2. Add the DataPipe to test_serialization?
  3. Run flake8 and mypy on each of the .py file

Thank you so much for working with us on this PR!

@NivekT NivekT added ciflow/period Run period tests and removed ciflow/period Run period tests labels Apr 14, 2022
@jkulhanek
Copy link
Contributor Author

Thank you for your help with the PR!
I have updated the files according to your comments. After merging the main branch, one of the tests is still failing on class "torchdata.datapipes.iter.util.unzipper.UnZipperIterDataPipe". Otherwise, it looks good. Both mypy and flake8 report no errors in the added code.

@NivekT
Copy link
Contributor

NivekT commented Apr 21, 2022

Thank you for your help with the PR! I have updated the files according to your comments. After merging the main branch, one of the tests is still failing on class "torchdata.datapipes.iter.util.unzipper.UnZipperIterDataPipe". Otherwise, it looks good. Both mypy and flake8 report no errors in the added code.

Do you have the error message for that failure? I am not seeing anything on my end.

The CI failures seem pre-existing as well.

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jkulhanek
Copy link
Contributor Author

I merged the main branch, all tests are passing locally now (except for torchtext and torchaudio - these I dit not run).

@facebook-github-bot
Copy link
Contributor

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor

ejguan commented Apr 26, 2022

This PR breaks mypy lint CI. @NivekT

@VitalyFedyunin
Copy link
Contributor

Let's forward fix it, if it possible to do quickly.

@jkulhanek
Copy link
Contributor Author

The linked PR should fix the problem.

facebook-github-bot pushed a commit that referenced this pull request Apr 27, 2022
Summary:
This PR fixes mypy errors with TFRecord implementation.

Fixes  #308

### Changes
- Example and ExampleSpec renamed to TFExample and TFExampleSpec and exported to `torchdata.datapipes.iter`

Pull Request resolved: #374

Reviewed By: msaroufim

Differential Revision: D35943347

Pulled By: NivekT

fbshipit-source-id: 65b728225b21f7b36262d88a2a03e6121689488d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tfrecord support
5 participants