You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now, VirtualFiles can either be unique or repeated (and indexed, in which case each we read it as a stream), but we cannot really manipulate unbounded streams of data, where the concept of index has no meaning, because you have no control over the order in which data arrives.
To summarize my thinking, I think external data can exist in three repetition modes:
Statically-indexed: the number of occurences and their paths are known in advance. In porcupine you would handle that with a VirtualFile which would appear several times in your VirtualTree with a different virtual path (e.g. with ptaskInSubtree), or with layers if the data read from these files is a Semigroup.
Dynamically-indexed: the number of occurences and their paths is known at the execution of the program (eg. because we compute these paths from a list of indices obtained either from CLI options or from another file). In porcupine you would handle that either with layers (if data is Semigroup), which doesn't put constraints on these files' paths, or with repeated virtual files (loadDataStream, parMapTask or FoldA), which doesn't put any constraint on your data but puts one on the files' path (which now have to be the same up to some index).
Unindexed: the number of occurences or their indices cannot be known at all, possibly because they don't even have any index, e.g if we read an unbounded stream of data (and therefore might need to generate an unbounded stream of outputs as a result). We need to read the data until we have no more data, and we can't know when that will be in advance. Currently you cannot handle that case in porcupine.
For the communication specifics, we could resort on existing standards, like (no surprises) ... Apache Arrow! See https://arrow.apache.org/docs/format/Flight.html (based on gRPC).
But ideally we'd like to support various backends (start an HTTP server to receive the stream, thrift/avro streams, etc). So possibly that'd mean adding a StreamAccessor next to LocationAccessor.
The text was updated successfully, but these errors were encountered:
Many people use boundless streams not just because data is boundless, but because they think incremental algo is faster. And it would be nice to allow incremental update/adding of data.
For now, VirtualFiles can either be unique or repeated (and indexed, in which case each we read it as a stream), but we cannot really manipulate unbounded streams of data, where the concept of index has no meaning, because you have no control over the order in which data arrives.
To summarize my thinking, I think external data can exist in three repetition modes:
ptaskInSubtree
), or with layers if the data read from these files is a Semigroup.For the communication specifics, we could resort on existing standards, like (no surprises) ... Apache Arrow! See https://arrow.apache.org/docs/format/Flight.html (based on gRPC).
But ideally we'd like to support various backends (start an HTTP server to receive the stream, thrift/avro streams, etc). So possibly that'd mean adding a
StreamAccessor
next toLocationAccessor
.The text was updated successfully, but these errors were encountered: