Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: TFX Standardized Inputs #162

Merged
merged 8 commits into from
Nov 8, 2019
Merged

RFC: TFX Standardized Inputs #162

merged 8 commits into from
Nov 8, 2019

Conversation

brills
Copy link
Contributor

@brills brills commented Oct 3, 2019

Review period open through 2019-10-17

Status Accepted
Author(s) Zhuo Peng ([email protected]), Kester Tong ([email protected])
Sponsor Konstantinos Katsiapis ([email protected])
Updated 2019-10-03

Objective

  • To define a common in-memory data representation that:
  • To define an I/O abstraction layer that produces the above in-memory
    representation from supported physical storage formats, while hiding TFX’s
    choice of such storage formats from TFX users.
  • To define a bridge from the above in-memory representation to TF feedables
    (i.e. Tensors and certain CompositeTensors).

rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
@VoVAllen
Copy link
Contributor

VoVAllen commented Oct 7, 2019

Do you have any plan to support zero-copy tensor on the cuda devices with apache arrow?

Copy link
Contributor

@alextp alextp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really should have been brought up before.

already supported) to each component they want to use.
* Individual components rely on unenforceable assumptions on how to interpret
the input data consistently.
* The complexity of adding new logical data representations (for example,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense to me. Couldn't all components use TF itself to parse the data from whatever format the user has it in to a bunch of tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems "parse the data" here implies a record based I/O abstraction?

Unfortunately in TFX, parsing has to be considered together with I/O because there might not be a record serialization of the data. And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.

The co-existence of beam's PSource and tf.Dataset in TFX and their incompatibility is one (but not the only one) of the factors led to this design -- if I can't use TF for I/O + parsing, why using a TF data structure?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the TF Ops that do I/O are tf.Dataset, so unless we bridge the gap between beam's PSource and tf.Dataset, we can't easily use TF OPs to do parsing.

How difficult would that be? There are a lot of custom tf.data.Dataset adapters in tensorflow/io, so there's some precedent here. Not all of them require a record-based serialization.

I'm not familiar with Beam's extension points, but other similar systems provide a way for users to provide a "mapper" that controls iteration across an individual partition (e.g. Spark's RDD.mapPartitions()).

For the TFX components that are passing their input directly into TF, there's probably some value in optimizing the path to tensors, instead of indirecting via RecordBatch.

/cc @katsiapis @yongtang @jsimsa

rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Outdated Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
@brills
Copy link
Contributor Author

brills commented Oct 10, 2019

@VoVAllen

Do you have any plan to support zero-copy tensor on the cuda devices with apache arrow?

This proposal is more about using Apache Arrow as the in-memory representation in TFX components that do data analysis as well as feeding TF. So in most cases, the data has to be present in the main RAM for data analysis tasks and some copying has to happen if the data is fed to a TF graph exectured with GPU kernels.

The only place zero-copy might make sense is in the TF Trainer (which is only one of handful of TFX components), and there is a dedicated TFXIO API, TFDataset() for it. I could imagine in that Dataset one could prepare RecordBatches backed by CudaBuffers, but the following conversion to Tensors has to be copy-free and compute-free in order to profit. I'm not familiar with the performance characteristics of GPUs (i.e. how expensive is copying a usually large batch) so I'm not sure if it's worth it.

@brills
Copy link
Contributor Author

brills commented Oct 10, 2019

(Adding people who we thought (in the internal discussion) might be interested):
@jsimsa, @zoyahav, @rohan100jain, @mdreves, @zhitaoli, @Rose-RongLiu, @emkornfield, @TheNeuralBit, @aaltay, @robertwb, @chenkai

@brills brills closed this Oct 10, 2019
@brills brills reopened this Oct 10, 2019
@aaltay
Copy link

aaltay commented Oct 11, 2019

/cc @TheNeuralBit might be good a reviewer. Pinged above but I want to ping individually.

@theadactyl
Copy link
Contributor

@brills what's the status of resolving comments to this proposal?

@brills
Copy link
Contributor Author

brills commented Oct 28, 2019

@theadactyl
I left them unresolved because they are important questions. I'm preparing a revision to the RFC to include more background to help address those questions, but I'd still like to keep them open until after the review.

Or, is the protocol that I have to resolve all the comments before the review?

@martinwicke
Copy link
Member

martinwicke commented Oct 28, 2019 via email

@brills
Copy link
Contributor Author

brills commented Oct 29, 2019

@alextp, @karmel :

I revised the following sections of the RFC to include what we thought was missing / unclear:

  • Motivation
  • Requirements
  • Alternatives Considered

And the internal design doc (at go/standardized-tfx-inputs) contains more numbers / evidence to support the claims in those sections.

I hope they will help clarify things and answer better your question "why not StructuredTensor?" Please take another look.

@karmel
Copy link

karmel commented Oct 30, 2019

@brills -- it's really hard to tell what changed based on the most recent commit; is there a diff that doesn't rewrite the whole file?

@brills
Copy link
Contributor Author

brills commented Oct 30, 2019

@karmel: I reverted the squash to recover the history. A diff between the revised version and the original version can be viewed at https://github.com/tensorflow/community/pull/162/files/c47a0fe60838abf78e5f0be14abe645b129ec036..273c62a1943c6986afd47858d0b4abb7c488fc21

rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
rfcs/20191017-tfx-standardized-inputs.md Show resolved Hide resolved
…TFX components does not always need to run TF.
Copy link
Contributor Author

@brills brills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found today that the colab links in the doc and posted in the comments are only available within Google. I've made copies that are accessible by everyone and updated the doc. Sorry about that.

@theadactyl
Copy link
Contributor

theadactyl commented Nov 7, 2019

Notes from the design review meeting held Nov 7, 2:00PM PST:

Legend

KK: Gus
KA: Karmel
Z: Zhuo
PD: Pavel Dournov

Apache Arrow vs tf.StructuredTensor

Still, why not use tf.StructuredTensor as the in-memory representation?

Are custom TFX components forced to use TFXIO?

KK: Custom components can work with well-known abstractions in Arrow and benefit from the supporting infrastructure.

ABI compatibility

Maintenance of pyarrow wheel

KA: what’s the worst-case scenario?
Z: Arrow will likely hand the release over to someone else. The current process is handled through a rotation. The burden is unclear but seems low in the latest release. Most issues in the past have been with TF which was not manylinux compliant. But this has been solved now (TF is manylinux compliant).

How can users get access to a record (row) based Dataset?

KA: have we reached out to modellers for feedback?
Z: similar issues exist internally and the corresponding efforts mirror the proposal.
KK: there is break glass functionality but it is highly discouraged.
KA: any idea of break-glass usage internal?
Z: one known case is where the user bypasses tf.Dataset.
KK: one way to look at is that TFX APIs align with TF, which already recommends tf.Dataset. We should start with more constrained APIs to enable transparent optimizations and gradually open them up. When there is a need for flexibility then open up the API further (with the associated “cans of worms”). Perhaps follow-up RFCs can indicate what extensions are needed.
PD: At what level is the API opinionated and locked?
KK: There is no requirement to use tfx.io -- the component can go directly to the raw files but miss out on optimizations.
PD: Can CSV data be used with tfx.io?
Z: It depends on where data lives. Internal components are not likely to work with CSV as an internal storage format. CSV data can be imported with a conversion to an internal storage format.
PD: after data is ingested, can the user inject a custom component?
Z: Yes, the component can just the tfx.io API. The user can also freeze the format of the data through an option and then the component can handle the specific format.
Z: TFX will support a small set of formats. CSV will probably not be an internal representation due to inefficiency.

@theadactyl
Copy link
Contributor

@brills Do you think we are good to merge?

@brills
Copy link
Contributor Author

brills commented Nov 7, 2019

@theadactyl I think so. Thanks!

@theadactyl theadactyl added RFC: Accepted RFC Design Document: Accepted by Review and removed RFC: Proposed RFC Design Document labels Nov 8, 2019
@theadactyl theadactyl merged commit d249774 into tensorflow:master Nov 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes RFC: Accepted RFC Design Document: Accepted by Review TFX
Projects
None yet
Development

Successfully merging this pull request may close these issues.