Parametrize data input according to scheduled run date #2275

casassg · 2020-08-04T19:17:38Z

As an engineer I would like to window the input data into my pipeline according to my run date. Similar to how Airflow allows you to parametrize execution according to the DAG execution date (see: https://airflow.apache.org/docs/stable/concepts.html#execution-date) we would need to do something similar in TFX.

One way to implement this would be to have a RuntimeParameter that maps to execution_date.

Wondering what options do we have to solve this atm and maybe any way we can contribute to make it easier.

1025KB · 2020-08-06T02:29:19Z

Just FYI, we are working on support date spec for ExampleGen input config

this is for auto pick up latest date, and we will support specify a certain span later

will add user guide here

casassg · 2020-08-06T03:00:58Z

That's good to know. Looks like this can be used.

Is there any plan to support defining spans for Query based ExampleGen components? Asking as we are evaluating using BigQuery as our data source, but we may still need support for data partitioning.

Another question is configuring this input span to have windows of time (aka use last 5 days of data) as well as different windows of data (use 5 days of data for training and last day for eval dataset)

1025KB · 2020-08-06T18:08:14Z

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

casassg · 2020-08-06T18:23:48Z

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

Sounds it could be generalized enough. Should we try to OSS it if we end up going that route?

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

That seems a bit complicated to use. We would need to run the Example Gen N times before we can actually train at all.

1025KB · 2020-08-06T19:31:03Z

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

Sounds it could be generalized enough. Should we try to OSS it if we end up going that route?

We do have solutions for internal query based examplegen, we are still WIP for oss query based examplegen.
May I ask the use case? is the table contains a date column and you want examplegen to process daily?

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

That seems a bit complicated to use. We would need to run the Example Gen N times before we can actually train at all.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]
if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

and we are working on advanced async execution, which allows component run on different pace (thus it can be 3 examplegen + 1 trainer)

casassg · 2020-08-06T21:49:29Z

May I ask the use case? is the table contains a date column and you want examplegen to process daily?

Precisely. Our tables have date in it and we currently are manually settting the time span we want. Issue is to productionize the pipeline we want to have a way to partition and roll windows of data for input of the model.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]

Yes, this is most of our cases I believe.

if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

Is the idea to support selecting X latest spans in the current implementation? From the code it seems it only collects latest date span...

1025KB · 2020-08-10T18:02:19Z

May I ask the use case? is the table contains a date column and you want examplegen to process daily?

Precisely. Our tables have date in it and we currently are manually settting the time span we want. Issue is to productionize the pipeline we want to have a way to partition and roll windows of data for input of the model.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]

Yes, this is most of our cases I believe.
Then each pipeline run (other than the first several runs) will produce a new span and get a rolling window of spans (by resolver) for training, so ExampleGen and Trainer's execution times are the same.

if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

Is the idea to support selecting X latest spans in the current implementation? From the code it seems it only collects latest date span...

Check the resolver in this PR, it can resolver multiple inputs

casassg · 2020-08-12T01:03:23Z

I see. Will this resolver then fail for the first N-1 runs if pipeline is running on sync mode?

1025KB · 2020-08-12T01:45:58Z

It depends on the resolver's implementation, the latestArtifactResolver won't fail, it will try get N artifact, if not return whatever it's resolved

you can implement your own resolver if current resolvers doesn't fit the needs

casassg · 2020-08-12T01:47:55Z

Mmm I see, whenever there is conditional execution this may be easier to achieve then

1025KB · 2020-08-12T01:54:39Z

The condition support is in our roadmap, stay tuned

casassg · 2020-08-12T01:57:00Z

If we can make a way to support condition based on resolver config that would be great 🎉

singhniraj08 · 2023-05-24T05:39:31Z

@casassg,

Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions.

github-actions · 2023-06-01T02:14:53Z

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-06-09T02:06:44Z

This issue was closed due to lack of activity after being marked stale for past 7 days.

rmothukuru self-assigned this Aug 5, 2020

rmothukuru added stat:awaiting tensorflower type:feature labels Aug 5, 2020

rmothukuru assigned hj929 and unassigned rmothukuru Aug 5, 2020

zhitaoli assigned 1025KB Nov 20, 2020

singhniraj08 self-assigned this May 24, 2023

singhniraj08 added stat:awaiting response and removed stat:awaiting tensorflower labels May 24, 2023

github-actions bot added the stale label Jun 1, 2023

github-actions bot closed this as completed Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parametrize data input according to scheduled run date #2275

Parametrize data input according to scheduled run date #2275

casassg commented Aug 4, 2020

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020 •

edited

Loading

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020

1025KB commented Aug 10, 2020

casassg commented Aug 12, 2020

1025KB commented Aug 12, 2020

casassg commented Aug 12, 2020

1025KB commented Aug 12, 2020

casassg commented Aug 12, 2020

singhniraj08 commented May 24, 2023

github-actions bot commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

Parametrize data input according to scheduled run date #2275

Parametrize data input according to scheduled run date #2275

Comments

casassg commented Aug 4, 2020

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020 • edited Loading

1025KB commented Aug 6, 2020

casassg commented Aug 6, 2020

1025KB commented Aug 10, 2020

casassg commented Aug 12, 2020

1025KB commented Aug 12, 2020

casassg commented Aug 12, 2020

1025KB commented Aug 12, 2020

casassg commented Aug 12, 2020

singhniraj08 commented May 24, 2023

github-actions bot commented Jun 1, 2023

github-actions bot commented Jun 9, 2023

casassg commented Aug 6, 2020 •

edited

Loading