Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrize data input according to scheduled run date #2275

Closed
casassg opened this issue Aug 4, 2020 · 15 comments
Closed

Parametrize data input according to scheduled run date #2275

casassg opened this issue Aug 4, 2020 · 15 comments

Comments

@casassg
Copy link
Member

casassg commented Aug 4, 2020

As an engineer I would like to window the input data into my pipeline according to my run date. Similar to how Airflow allows you to parametrize execution according to the DAG execution date (see: https://airflow.apache.org/docs/stable/concepts.html#execution-date) we would need to do something similar in TFX.

One way to implement this would be to have a RuntimeParameter that maps to execution_date.

Wondering what options do we have to solve this atm and maybe any way we can contribute to make it easier.

@1025KB
Copy link
Collaborator

1025KB commented Aug 6, 2020

Just FYI, we are working on support date spec for ExampleGen input config

this is for auto pick up latest date, and we will support specify a certain span later

will add user guide here

@casassg
Copy link
Member Author

casassg commented Aug 6, 2020

That's good to know. Looks like this can be used.

Is there any plan to support defining spans for Query based ExampleGen components? Asking as we are evaluating using BigQuery as our data source, but we may still need support for data partitioning.

Another question is configuring this input span to have windows of time (aka use last 5 days of data) as well as different windows of data (use 5 days of data for training and last day for eval dataset)

@1025KB
Copy link
Collaborator

1025KB commented Aug 6, 2020

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

@casassg
Copy link
Member Author

casassg commented Aug 6, 2020

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

Sounds it could be generalized enough. Should we try to OSS it if we end up going that route?

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

That seems a bit complicated to use. We would need to run the Example Gen N times before we can actually train at all.

@1025KB
Copy link
Collaborator

1025KB commented Aug 6, 2020

we don't have plan for query based span support yet, a custom ExampleGen might be needed.

Sounds it could be generalized enough. Should we try to OSS it if we end up going that route?

We do have solutions for internal query based examplegen, we are still WIP for oss query based examplegen.
May I ask the use case? is the table contains a date column and you want examplegen to process daily?

For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer.

That seems a bit complicated to use. We would need to run the Example Gen N times before we can actually train at all.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]
if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

and we are working on advanced async execution, which allows component run on different pace (thus it can be 3 examplegen + 1 trainer)

@casassg
Copy link
Member Author

casassg commented Aug 6, 2020

May I ask the use case? is the table contains a date column and you want examplegen to process daily?

Precisely. Our tables have date in it and we currently are manually settting the time span we want. Issue is to productionize the pipeline we want to have a way to partition and roll windows of data for input of the model.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]

Yes, this is most of our cases I believe.

if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

Is the idea to support selecting X latest spans in the current implementation? From the code it seems it only collects latest date span...

@1025KB
Copy link
Collaborator

1025KB commented Aug 10, 2020

May I ask the use case? is the table contains a date column and you want examplegen to process daily?

Precisely. Our tables have date in it and we currently are manually settting the time span we want. Issue is to productionize the pipeline we want to have a way to partition and roll windows of data for input of the model.

This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5]

Yes, this is most of our cases I believe.
Then each pipeline run (other than the first several runs) will produce a new span and get a rolling window of spans (by resolver) for training, so ExampleGen and Trainer's execution times are the same.

if what you need is [1,2,3] [4,5,6][7,8,9], you can treat [1,2,3] as a single input unit for examplegen, thus you just need to run examplegen once and train on examplegen's output (that contains data for [x, x+1, x+2])

Is the idea to support selecting X latest spans in the current implementation? From the code it seems it only collects latest date span...

Check the resolver in this PR, it can resolver multiple inputs

@casassg
Copy link
Member Author

casassg commented Aug 12, 2020

I see. Will this resolver then fail for the first N-1 runs if pipeline is running on sync mode?

@1025KB
Copy link
Collaborator

1025KB commented Aug 12, 2020

It depends on the resolver's implementation, the latestArtifactResolver won't fail, it will try get N artifact, if not return whatever it's resolved

you can implement your own resolver if current resolvers doesn't fit the needs

@casassg
Copy link
Member Author

casassg commented Aug 12, 2020

Mmm I see, whenever there is conditional execution this may be easier to achieve then

@1025KB
Copy link
Collaborator

1025KB commented Aug 12, 2020

The condition support is in our roadmap, stay tuned

@casassg
Copy link
Member Author

casassg commented Aug 12, 2020

If we can make a way to support condition based on resolver config that would be great 🎉

@singhniraj08 singhniraj08 self-assigned this May 24, 2023
@singhniraj08
Copy link
Contributor

@casassg,

Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 1, 2023

This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale label Jun 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2023

This issue was closed due to lack of activity after being marked stale for past 7 days.

@github-actions github-actions bot closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants