Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-series joins for ksql #6350

Open
ChristophSchranz opened this issue Oct 2, 2020 · 0 comments
Open

Time-series joins for ksql #6350

ChristophSchranz opened this issue Oct 2, 2020 · 0 comments

Comments

@ChristophSchranz
Copy link

A variety of real-time data streaming problems rely on performing efficient Time-Series Joins (TSJ) of two streams. Merging two data streams or filtering one data stream based on a value of another requires the join of their time series in advance.
The time-serieses have to be joined in a way, that each record within one time-series is matched with its previous and subsequent complement from the other time-series (see the Figure). This property should pertain independent of the record's ingestion time (the records should be ingested in order within a stream but not across them, as it is guaranteed by kafka).

localstreambuffer_joins

This feature request is a special and common example of a non-equi stream-stream join, but is not related to #4424 or KIP-213. I have not found a ticket for a non-equi join in github or wiki, please inform me if I'm wrong.
It can be seen as an non-equi jon variant of KIP-260 where the key is the ingestion time (must be in order to work). In KIP-260 a method is searched for a faster join, using my algorithm one can match each record from one stream with the first successor and the latest predeccor of the other stream.

Describe the solution you'd like
There should be the possibility to join two asynchronious time-series deterministically. That means, the resulting join stream must be the unambiguous, regardless of the input streams' latency! Additionally, the processing latency must be minimal and a high-throughput should be possible. The exact

Describe alternatives you've considered
I've had two problems that were both based on time-series joins. The core stream-join framework is implemented and described in StatefulStreamProcessor. There are multiple tests for it and the implementation is with exactly-once processing (thanks to Kafka)! This algorithm joins them with linear complexity for equally late streams and increases linearly with the latency shift. A join rate of around 15000 joins per seconds were achieved using this algorithm, a join in Apache Flink only 5000 with much more latency due to windowing.

I wrote a work-in-progress paper on this join type, which was presented some weeks ago at the IEEE ETFA conference. Please find the preprint of it in the attachment (note that the description of the implementation is not perfectly accurate anymore, as there were some improvements this summer).

DeterministicTimeSeriesJoinsForDataStreams_final.pdf

Have you had issues that require this type of non-equi stream-stream joins? I would be interested if you are thinking about similar algorithms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants