-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tfrecord_loader implementation #308
Conversation
Can you please drop |
I removed the TensorFlow dependency. |
Any update? |
@@ -0,0 +1,698 @@ | |||
# Generated by the protocol buffer compiler. DO NOT EDIT! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should be generated, not part of git repo. Instead please put the original protobuf file + generation script (inside of tests folder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need to include several protobuf files (example.proto depends on other files). Should the file be generated in setup.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are only a few protobuf files, we can add them to the library without having them generated.
Do you know how many there are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two: "example.proto" and "feature.proto" https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But are you sure the "tests" folder is a good fit for ".proto" files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we need protobuf compiler "protoc" executable for the generation. It would be bad to assume it as a dependency if you support installation from source code in the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we raise an error message with instructions (and a script) to generate the file within the library if the file is missing? I think that avoids add the generation to setup.py
and the multiprocessing issue.
cc: @VitalyFedyunin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just leave the generated file in the package? It seems to me like the easiest option and we would not require users to install protobuf compiler before using the loader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to leave it as is as soon as we place source .proto files besides it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we should add the generated files and the source .proto files as well. Somewhere like torchdata/datapipes/iter/util/protobuf_template
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good! I left a few comments above. And:
- Can you rebase to the latest origin/main? You might need to update to header from "Facebook" to "Meta Platforms"
- Add the DataPipe to
test_serialization
? - Run
flake8
andmypy
on each of the.py
file
Thank you so much for working with us on this PR!
Thank you for your help with the PR! |
Do you have the error message for that failure? I am not seeing anything on my end. The CI failures seem pre-existing as well. |
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I merged the main branch, all tests are passing locally now (except for torchtext and torchaudio - these I dit not run). |
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This PR breaks mypy lint CI. @NivekT |
Let's forward fix it, if it possible to do quickly. |
The linked PR should fix the problem. |
Summary: This PR fixes mypy errors with TFRecord implementation. Fixes #308 ### Changes - Example and ExampleSpec renamed to TFExample and TFExampleSpec and exported to `torchdata.datapipes.iter` Pull Request resolved: #374 Reviewed By: msaroufim Differential Revision: D35943347 Pulled By: NivekT fbshipit-source-id: 65b728225b21f7b36262d88a2a03e6121689488d
Fixes #306
Changes
TFRecordLoaderIterDataPipe
that parses opened tfrecord streams and yields the individual records. It supports bothtf.train.Example
andtf.train.SequenceExample
and users can additionally specify shape/dtype for each feature in the dict.TFRecordLoaderIterDataPipe
.