Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support boundless streams of inputs/outputs #68

Open
YPares opened this issue Oct 16, 2019 · 2 comments
Open

Support boundless streams of inputs/outputs #68

YPares opened this issue Oct 16, 2019 · 2 comments
Labels
enhancement New feature or request

Comments

@YPares
Copy link
Owner

YPares commented Oct 16, 2019

For now, VirtualFiles can either be unique or repeated (and indexed, in which case each we read it as a stream), but we cannot really manipulate unbounded streams of data, where the concept of index has no meaning, because you have no control over the order in which data arrives.

To summarize my thinking, I think external data can exist in three repetition modes:

  • Statically-indexed: the number of occurences and their paths are known in advance. In porcupine you would handle that with a VirtualFile which would appear several times in your VirtualTree with a different virtual path (e.g. with ptaskInSubtree), or with layers if the data read from these files is a Semigroup.
  • Dynamically-indexed: the number of occurences and their paths is known at the execution of the program (eg. because we compute these paths from a list of indices obtained either from CLI options or from another file). In porcupine you would handle that either with layers (if data is Semigroup), which doesn't put constraints on these files' paths, or with repeated virtual files (loadDataStream, parMapTask or FoldA), which doesn't put any constraint on your data but puts one on the files' path (which now have to be the same up to some index).
  • Unindexed: the number of occurences or their indices cannot be known at all, possibly because they don't even have any index, e.g if we read an unbounded stream of data (and therefore might need to generate an unbounded stream of outputs as a result). We need to read the data until we have no more data, and we can't know when that will be in advance. Currently you cannot handle that case in porcupine.

For the communication specifics, we could resort on existing standards, like (no surprises) ... Apache Arrow! See https://arrow.apache.org/docs/format/Flight.html (based on gRPC).
But ideally we'd like to support various backends (start an HTTP server to receive the stream, thrift/avro streams, etc). So possibly that'd mean adding a StreamAccessor next to LocationAccessor.

@YPares YPares added the enhancement New feature or request label Oct 16, 2019
@YPares
Copy link
Owner Author

YPares commented Oct 30, 2019

Probably we should also have a look at what Hailstorm and Streamly propose wrt that.

@mgajda
Copy link

mgajda commented Oct 30, 2019

Many people use boundless streams not just because data is boundless, but because they think incremental algo is faster. And it would be nice to allow incremental update/adding of data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants