Support boundless streams of inputs/outputs #68

YPares · 2019-10-16T13:28:24Z

For now, VirtualFiles can either be unique or repeated (and indexed, in which case each we read it as a stream), but we cannot really manipulate unbounded streams of data, where the concept of index has no meaning, because you have no control over the order in which data arrives.

To summarize my thinking, I think external data can exist in three repetition modes:

Statically-indexed: the number of occurences and their paths are known in advance. In porcupine you would handle that with a VirtualFile which would appear several times in your VirtualTree with a different virtual path (e.g. with ptaskInSubtree), or with layers if the data read from these files is a Semigroup.
Dynamically-indexed: the number of occurences and their paths is known at the execution of the program (eg. because we compute these paths from a list of indices obtained either from CLI options or from another file). In porcupine you would handle that either with layers (if data is Semigroup), which doesn't put constraints on these files' paths, or with repeated virtual files (loadDataStream, parMapTask or FoldA), which doesn't put any constraint on your data but puts one on the files' path (which now have to be the same up to some index).
Unindexed: the number of occurences or their indices cannot be known at all, possibly because they don't even have any index, e.g if we read an unbounded stream of data (and therefore might need to generate an unbounded stream of outputs as a result). We need to read the data until we have no more data, and we can't know when that will be in advance. Currently you cannot handle that case in porcupine.

For the communication specifics, we could resort on existing standards, like (no surprises) ... Apache Arrow! See https://arrow.apache.org/docs/format/Flight.html (based on gRPC).
But ideally we'd like to support various backends (start an HTTP server to receive the stream, thrift/avro streams, etc). So possibly that'd mean adding a StreamAccessor next to LocationAccessor.

The text was updated successfully, but these errors were encountered:

YPares · 2019-10-30T17:50:09Z

Probably we should also have a look at what Hailstorm and Streamly propose wrt that.

mgajda · 2019-10-30T18:20:09Z

Many people use boundless streams not just because data is boundless, but because they think incremental algo is faster. And it would be nice to allow incremental update/adding of data.

YPares added the enhancement New feature or request label Oct 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support boundless streams of inputs/outputs #68

Support boundless streams of inputs/outputs #68

YPares commented Oct 16, 2019 •

edited

Loading

YPares commented Oct 30, 2019

mgajda commented Oct 30, 2019

Support boundless streams of inputs/outputs #68

Support boundless streams of inputs/outputs #68

Comments

YPares commented Oct 16, 2019 • edited Loading

YPares commented Oct 30, 2019

mgajda commented Oct 30, 2019

YPares commented Oct 16, 2019 •

edited

Loading