-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExampleGen should support custom time series splits #352
Comments
@Gowtham-kp I've been looking at that for the last hour and I even went down to the base_example_gen_executor level to try and understand what was happening. Its still non-intuitive to me. In my use-case, every example has for example a field called 'timestamp', that will be used as basis for partition. For example, I have a function that says everything in 2017 should be train and everything in 2018 should be eval. Now I have not quite yet grasped how to assign the hash_buckets in such a way that it works out. I've gotten as far as understanding that there must be a mechanism where a given example is mapped to the hashing, but I have not figured out how. Am I correct in what im thinking or completely off the mark? Some guidance would much appreciated |
Hi Hazma. Have you taken a look yet at the second example in https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split? That has an example how to take pre-split train/eval splits and configure the ExampleGen component to receive each. That assumes your data has been split prior to calling ExampleGen. Another option would be to create a custom ExampleGen, shown at https://www.tensorflow.org/tfx/guide/examplegen#custom_examplegen. In this case, you'd build your own ExampleGen and write your executor to perform the shuffle yourself. Hope that helps! |
Thanks for the reply @krazyhaas! Pre-splitting is definitely an option but I wouldn't like to write beam code that does that just to prepare it for the main TFX pipeline. However yes it certainly is an option! As far as the option to write a custom ExampleGen, that would mean to override both GetInputSourceToExamplePTransform and GenerateExamplesByBeam in BaseExampleGenExecutor? Or would I need to do something with FileBasedExampleGen? To be honest, I am generally confused between the use of the 'subclasses' like FileBasedExampleGen and BaseExampleGenExecutor. Can you clarify? |
I'm not sure what your input dataset looks like, so I'll assume CSV solely for the purpose of easier explanation. This workaround is a bit cumbersome right now but should suffice until we support custom time series splits: First, create a new component that is a clone of the CsvExampleGen component. This is the "more cumbersome than it needs to be" part. Next, create a new partitioning function similar to the current, but partition based on year opposed to a hash. The partition number is used to create the train/eval datasets here. At this point, your pipeline should be able to use the new component to partition the data based on time. Let me know if this unblocks you. We've yet to release custom partitioning, but we do plan to make the partitioning function much more configurable. |
@krazyhaas can we also do many splits at this moment? like if i was to have many timestreams in my data (lets say its IoT data and I have many different assets with different timestreams all flowing in from the CSV), can i extend the notion of splits and just send many splits downstream (not just train and eval)? Would the rest of the components work like that or would I also have to rewrite those a bit? |
Revisiting this issue, I suspect a better way to model this problem is to use span and make examples in the same timestamp partition into the same span. Please see https://github.com/tensorflow/tfx/blob/46bb4f975c36ea1defde4b3c33553e088b3dc5b8/docs/guide/examplegen.md#span for what we have released for span part, and @ruoyu90 can assist you for our next steps. I don't feel we have enough information to really answer this, because this is related to the data and the problem underneath. Are you still blocked on this one, @htahir1 ? |
@zhitaoli I solved it using the approach suggested by @krazyhaas , by using a custom partition function. Spans are also a good way to solve it, but I have not gotten into that yet. If it helps the TFX teams internal processes, i'll go ahead and close the issue. Thanks for your support! |
Is there a way to specify a split for the ExampleGen component that splits across a 'timestamp' field? This cannot work currently as the raw data might not be sorted according to time.
The text was updated successfully, but these errors were encountered: