RFC: TFX Standardized Inputs #162

brills · 2019-10-03T17:07:08Z

Review period open through 2019-10-17

Status	Accepted
Author(s)	Zhuo Peng ([email protected]), Kester Tong ([email protected])
Sponsor	Konstantinos Katsiapis ([email protected])
Updated	2019-10-03

Objective

To define a common in-memory data representation that:
- is powerful enough to encode the following logical training data format:
  flat (tf.Example), sequence
  (tf.SequenceExample) or structured data (e.g. Protocol Buffers or Apache Avro).
- all TFX components can understand and can support their own unique use
  cases with.
To define an I/O abstraction layer that produces the above in-memory
representation from supported physical storage formats, while hiding TFX’s
choice of such storage formats from TFX users.
To define a bridge from the above in-memory representation to TF feedables
(i.e. Tensors and certain CompositeTensors).

rfcs/20191017-tfx-standardized-inputs.md

VoVAllen · 2019-10-07T15:50:28Z

Do you have any plan to support zero-copy tensor on the cuda devices with apache arrow?

alextp

This really should have been brought up before.

alextp · 2019-10-08T15:54:28Z

rfcs/20191017-tfx-standardized-inputs.md

+    already supported) to each component they want to use.
+*   Individual components rely on unenforceable assumptions on how to interpret
+    the input data consistently.
+*   The complexity of adding new logical data representations (for example,


This doesn't make sense to me. Couldn't all components use TF itself to parse the data from whatever format the user has it in to a bunch of tensors?

It seems "parse the data" here implies a record based I/O abstraction?

Unfortunately in TFX, parsing has to be considered together with I/O because there might not be a record serialization of the data. And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.

The co-existence of beam's PSource and tf.Dataset in TFX and their incompatibility is one (but not the only one) of the factors led to this design -- if I can't use TF for I/O + parsing, why using a TF data structure?

And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.

How difficult would that be? There are a lot of custom tf.data.Dataset adapters in tensorflow/io, so there's some precedent here. Not all of them require a record-based serialization.

I'm not familiar with Beam's extension points, but other similar systems provide a way for users to provide a "mapper" that controls iteration across an individual partition (e.g. Spark's RDD.mapPartitions()).

For the TFX components that are passing their input directly into TF, there's probably some value in optimizing the path to tensors, instead of indirecting via RecordBatch.

/cc @katsiapis @yongtang @jsimsa

rfcs/20191017-tfx-standardized-inputs.md

brills · 2019-10-10T15:58:42Z

@VoVAllen

Do you have any plan to support zero-copy tensor on the cuda devices with apache arrow?

This proposal is more about using Apache Arrow as the in-memory representation in TFX components that do data analysis as well as feeding TF. So in most cases, the data has to be present in the main RAM for data analysis tasks and some copying has to happen if the data is fed to a TF graph exectured with GPU kernels.

The only place zero-copy might make sense is in the TF Trainer (which is only one of handful of TFX components), and there is a dedicated TFXIO API, TFDataset() for it. I could imagine in that Dataset one could prepare RecordBatches backed by CudaBuffers, but the following conversion to Tensors has to be copy-free and compute-free in order to profit. I'm not familiar with the performance characteristics of GPUs (i.e. how expensive is copying a usually large batch) so I'm not sure if it's worth it.

brills · 2019-10-10T16:28:16Z

(Adding people who we thought (in the internal discussion) might be interested):
@jsimsa, @zoyahav, @rohan100jain, @mdreves, @zhitaoli, @Rose-RongLiu, @emkornfield, @TheNeuralBit, @aaltay, @robertwb, @chenkai

rfcs/20191017-tfx-standardized-inputs.md

aaltay · 2019-10-11T23:48:05Z

/cc @TheNeuralBit might be good a reviewer. Pinged above but I want to ping individually.

theadactyl · 2019-10-24T18:50:54Z

@brills what's the status of resolving comments to this proposal?

brills · 2019-10-28T17:10:41Z

@theadactyl
I left them unresolved because they are important questions. I'm preparing a revision to the RFC to include more background to help address those questions, but I'd still like to keep them open until after the review.

Or, is the protocol that I have to resolve all the comments before the review?

martinwicke · 2019-10-28T17:28:02Z

All major questions should be resolved with the stakeholders involved beforehand. We can resolve details in the review (often, naming and such), but there shouldn't be open architecture or directional questions.

…the alternatives

brills · 2019-10-29T17:28:02Z

@alextp, @karmel :

I revised the following sections of the RFC to include what we thought was missing / unclear:

Motivation
Requirements
Alternatives Considered

And the internal design doc (at go/standardized-tfx-inputs) contains more numbers / evidence to support the claims in those sections.

I hope they will help clarify things and answer better your question "why not StructuredTensor?" Please take another look.

karmel · 2019-10-30T02:12:17Z

@brills -- it's really hard to tell what changed based on the most recent commit; is there a diff that doesn't rewrite the whole file?

brills · 2019-10-30T15:16:18Z

@karmel: I reverted the squash to recover the history. A diff between the revised version and the original version can be viewed at https://github.com/tensorflow/community/pull/162/files/c47a0fe60838abf78e5f0be14abe645b129ec036..273c62a1943c6986afd47858d0b4abb7c488fc21

rfcs/20191017-tfx-standardized-inputs.md

…TFX components does not always need to run TF.

brills

I found today that the colab links in the doc and posted in the comments are only available within Google. I've made copies that are accessible by everyone and updated the doc. Sorry about that.

theadactyl · 2019-11-07T22:59:55Z

Notes from the design review meeting held Nov 7, 2:00PM PST:

Legend

KK: Gus
KA: Karmel
Z: Zhuo
PD: Pavel Dournov

Apache Arrow vs tf.StructuredTensor

Still, why not use tf.StructuredTensor as the in-memory representation?

Are custom TFX components forced to use TFXIO?

KK: Custom components can work with well-known abstractions in Arrow and benefit from the supporting infrastructure.

ABI compatibility

Maintenance of pyarrow wheel

KA: what’s the worst-case scenario?
Z: Arrow will likely hand the release over to someone else. The current process is handled through a rotation. The burden is unclear but seems low in the latest release. Most issues in the past have been with TF which was not manylinux compliant. But this has been solved now (TF is manylinux compliant).

How can users get access to a record (row) based Dataset?

KA: have we reached out to modellers for feedback?
Z: similar issues exist internally and the corresponding efforts mirror the proposal.
KK: there is break glass functionality but it is highly discouraged.
KA: any idea of break-glass usage internal?
Z: one known case is where the user bypasses tf.Dataset.
KK: one way to look at is that TFX APIs align with TF, which already recommends tf.Dataset. We should start with more constrained APIs to enable transparent optimizations and gradually open them up. When there is a need for flexibility then open up the API further (with the associated “cans of worms”). Perhaps follow-up RFCs can indicate what extensions are needed.
PD: At what level is the API opinionated and locked?
KK: There is no requirement to use tfx.io -- the component can go directly to the raw files but miss out on optimizations.
PD: Can CSV data be used with tfx.io?
Z: It depends on where data lives. Internal components are not likely to work with CSV as an internal storage format. CSV data can be imported with a conversion to an internal storage format.
PD: after data is ingested, can the user inject a custom component?
Z: Yes, the component can just the tfx.io API. The user can also freeze the format of the data through an option and then the component can handle the specific format.
Z: TFX will support a small set of formats. CSV will probably not be an internal representation due to inefficiency.

theadactyl · 2019-11-07T23:02:03Z

@brills Do you think we are good to merge?

brills · 2019-11-07T23:17:44Z

@theadactyl I think so. Thanks!

brills requested review from ewilderj, goldiegadde, martinwicke and theadactyl as code owners October 3, 2019 17:07

googlebot added the cla: yes label Oct 3, 2019

theadactyl added RFC: Proposed RFC Design Document TFX labels Oct 3, 2019

ckkuang reviewed Oct 3, 2019

View reviewed changes

rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved

ckkuang reviewed Oct 3, 2019

View reviewed changes