-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parametrize data input according to scheduled run date #2275
Comments
Just FYI, we are working on support date spec for ExampleGen input config this is for auto pick up latest date, and we will support specify a certain span later will add user guide here |
That's good to know. Looks like this can be used. Is there any plan to support defining spans for Query based ExampleGen components? Asking as we are evaluating using BigQuery as our data source, but we may still need support for data partitioning. Another question is configuring this input span to have windows of time (aka use last 5 days of data) as well as different windows of data (use 5 days of data for training and last day for eval dataset) |
we don't have plan for query based span support yet, a custom ExampleGen might be needed. For windowing, we have resolver to get previously processed ExampleGen's output to fit to Trainer. |
Sounds it could be generalized enough. Should we try to OSS it if we end up going that route?
That seems a bit complicated to use. We would need to run the Example Gen N times before we can actually train at all. |
We do have solutions for internal query based examplegen, we are still WIP for oss query based examplegen.
This is for rolling window, e.g., train on examplegen's output [1, 2, 3], then [2, 3, 4], [3, 4, 5] and we are working on advanced async execution, which allows component run on different pace (thus it can be 3 examplegen + 1 trainer) |
Precisely. Our tables have date in it and we currently are manually settting the time span we want. Issue is to productionize the pipeline we want to have a way to partition and roll windows of data for input of the model.
Yes, this is most of our cases I believe.
Is the idea to support selecting X latest spans in the current implementation? From the code it seems it only collects latest date span... |
Check the resolver in this PR, it can resolver multiple inputs |
I see. Will this resolver then fail for the first N-1 runs if pipeline is running on sync mode? |
It depends on the resolver's implementation, the latestArtifactResolver won't fail, it will try get N artifact, if not return whatever it's resolved you can implement your own resolver if current resolvers doesn't fit the needs |
Mmm I see, whenever there is conditional execution this may be easier to achieve then |
The condition support is in our roadmap, stay tuned |
If we can make a way to support condition based on resolver config that would be great 🎉 |
Are you still looking for a resolution? We are planning on prioritising the issues based on the community interests. Please let us know if this issue still persists with the latest TFX 1.13 release so that we can work on fixing it. Thank you for your contributions. |
This issue has been marked stale because it has no recent activity since 7 days. It will be closed if no further activity occurs. Thank you. |
This issue was closed due to lack of activity after being marked stale for past 7 days. |
As an engineer I would like to window the input data into my pipeline according to my run date. Similar to how Airflow allows you to parametrize execution according to the DAG execution date (see: https://airflow.apache.org/docs/stable/concepts.html#execution-date) we would need to do something similar in TFX.
One way to implement this would be to have a RuntimeParameter that maps to execution_date.
Wondering what options do we have to solve this atm and maybe any way we can contribute to make it easier.
The text was updated successfully, but these errors were encountered: