-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: TFX Standardized Inputs #162
Conversation
Do you have any plan to support zero-copy tensor on the cuda devices with apache arrow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really should have been brought up before.
already supported) to each component they want to use. | ||
* Individual components rely on unenforceable assumptions on how to interpret | ||
the input data consistently. | ||
* The complexity of adding new logical data representations (for example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense to me. Couldn't all components use TF itself to parse the data from whatever format the user has it in to a bunch of tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems "parse the data" here implies a record based I/O abstraction?
Unfortunately in TFX, parsing has to be considered together with I/O because there might not be a record serialization of the data. And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.
The co-existence of beam's PSource and tf.Dataset in TFX and their incompatibility is one (but not the only one) of the factors led to this design -- if I can't use TF for I/O + parsing, why using a TF data structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.
How difficult would that be? There are a lot of custom tf.data.Dataset
adapters in tensorflow/io, so there's some precedent here. Not all of them require a record-based serialization.
I'm not familiar with Beam's extension points, but other similar systems provide a way for users to provide a "mapper" that controls iteration across an individual partition (e.g. Spark's RDD.mapPartitions()
).
For the TFX components that are passing their input directly into TF, there's probably some value in optimizing the path to tensors, instead of indirecting via RecordBatch.
This proposal is more about using Apache Arrow as the in-memory representation in TFX components that do data analysis as well as feeding TF. So in most cases, the data has to be present in the main RAM for data analysis tasks and some copying has to happen if the data is fed to a TF graph exectured with GPU kernels. The only place zero-copy might make sense is in the TF Trainer (which is only one of handful of TFX components), and there is a dedicated TFXIO API, TFDataset() for it. I could imagine in that Dataset one could prepare RecordBatches backed by CudaBuffers, but the following conversion to Tensors has to be copy-free and compute-free in order to profit. I'm not familiar with the performance characteristics of GPUs (i.e. how expensive is copying a usually large batch) so I'm not sure if it's worth it. |
(Adding people who we thought (in the internal discussion) might be interested): |
/cc @TheNeuralBit might be good a reviewer. Pinged above but I want to ping individually. |
@brills what's the status of resolving comments to this proposal? |
@theadactyl Or, is the protocol that I have to resolve all the comments before the review? |
All major questions should be resolved with the stakeholders involved
beforehand. We can resolve details in the review (often, naming and such),
but there shouldn't be open architecture or directional questions.
|
I revised the following sections of the RFC to include what we thought was missing / unclear:
And the internal design doc (at go/standardized-tfx-inputs) contains more numbers / evidence to support the claims in those sections. I hope they will help clarify things and answer better your question "why not StructuredTensor?" Please take another look. |
@brills -- it's really hard to tell what changed based on the most recent commit; is there a diff that doesn't rewrite the whole file? |
@karmel: I reverted the squash to recover the history. A diff between the revised version and the original version can be viewed at https://github.com/tensorflow/community/pull/162/files/c47a0fe60838abf78e5f0be14abe645b129ec036..273c62a1943c6986afd47858d0b4abb7c488fc21 |
…TFX components does not always need to run TF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found today that the colab links in the doc and posted in the comments are only available within Google. I've made copies that are accessible by everyone and updated the doc. Sorry about that.
Notes from the design review meeting held Nov 7, 2:00PM PST: LegendKK: Gus Apache Arrow vs tf.StructuredTensorStill, why not use tf.StructuredTensor as the in-memory representation?Are custom TFX components forced to use TFXIO?KK: Custom components can work with well-known abstractions in Arrow and benefit from the supporting infrastructure. ABI compatibilityMaintenance of pyarrow wheelKA: what’s the worst-case scenario? How can users get access to a record (row) based Dataset?KA: have we reached out to modellers for feedback? |
@brills Do you think we are good to merge? |
@theadactyl I think so. Thanks! |
Review period open through 2019-10-17
Objective
flat (
tf.Example
), sequence(
tf.SequenceExample
) or structured data (e.g. Protocol Buffers or Apache Avro).cases with.
representation from supported physical storage formats, while hiding TFX’s
choice of such storage formats from TFX users.
(i.e. Tensors and certain CompositeTensors).