Improve composability of serials #8

YPares · 2018-10-27T16:25:21Z

For now, the repetition of VirtualFiles is done at the task level. This is a problem for two reasons:

It is not possible to take e.g. a PureDeserial A and generalize it to a PureDeserial (Stream (Of (Key, A)) m ()) (to read several A from the same source, which is necessary or at least very convenient if we want to use frames or cassava).
It is not possible to abstract out the way the As are laid out in the end in files. For instance, if each A is in a different JSON file, the code of the pipeline won't be exactly the same as if all the As are one-line json documents in the same file. porcupine has initially been envisioned to support that use case, and ideally only the pipeline config file should have to change between both cases.

Point 1. is difficult to address because currently the monad m type is hidden inside the serialization functions, and using a serial of Stream means exposing it. The problem is that serialization functions are supposed to be oblivious of the LocationMonad they're running in (they shouldn't care whether they read local or remote files).

Point 2. requires some thinking as to how we whould handle the repetition keys. In the case of "each bit of data is in a different file", it's simple. The repetition key suffixes the file name, et voilà, we don't care in which order the keys are read. It's not the same if every bit of data is just a line from a file, because now (unless we want to hog on to all the file in memory) we have to make some assumptions about the order in which the keys are present in the file, or if the keys are present at all. Plus, the ways keys will be laid out is serialization-dependent (for JSON one-liners for instance, then each line can contain the key as a field).

For now, the "simplest" way I can think of is to intermingle serials and tasks even more. For now serialization functions are plain functions. What if they were PTasks? This way, any serialization function could internally use any tool already available at the task level, for instance reusing sub VirtualFiles. This way, a SerialsFor would just work at the level of a subtree of the resource tree, and could access anything it wants in there. The problem is that this way, we'll have a somewhat "non-fixed" resource tree in the end, because the subtrees would depend on which serialization functions are actually chosen in the end, and that would make pipeline parameterization more complicated for the end user.

The text was updated successfully, but these errors were encountered:

YPares · 2019-10-08T17:01:10Z

After new thinking on that issue, it actually turns out that it is perfectly possible to create PureDeserial (Stream (Of (Key, A)) IO ()). Indeed we don't need to carry the m around since our m is always an instance of MonadUnliftIO, so we can always strip it out and replace it with IO.
So it should be possible to wrap and transform serials.

YPares · 2019-10-08T17:02:53Z

#40 should be handled first though. Then we should assess if this issue is still relevant.

YPares added the enhancement New feature or request label Oct 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve composability of serials #8

Improve composability of serials #8

YPares commented Oct 27, 2018 •

edited

Loading

YPares commented Oct 8, 2019

YPares commented Oct 8, 2019

Improve composability of serials #8

Improve composability of serials #8

Comments

YPares commented Oct 27, 2018 • edited Loading

YPares commented Oct 8, 2019

YPares commented Oct 8, 2019

YPares commented Oct 27, 2018 •

edited

Loading