Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve composability of serials #8

Open
YPares opened this issue Oct 27, 2018 · 2 comments
Open

Improve composability of serials #8

YPares opened this issue Oct 27, 2018 · 2 comments
Labels
enhancement New feature or request

Comments

@YPares
Copy link
Owner

YPares commented Oct 27, 2018

For now, the repetition of VirtualFiles is done at the task level. This is a problem for two reasons:

  1. It is not possible to take e.g. a PureDeserial A and generalize it to a PureDeserial (Stream (Of (Key, A)) m ()) (to read several A from the same source, which is necessary or at least very convenient if we want to use frames or cassava).
  2. It is not possible to abstract out the way the As are laid out in the end in files. For instance, if each A is in a different JSON file, the code of the pipeline won't be exactly the same as if all the As are one-line json documents in the same file. porcupine has initially been envisioned to support that use case, and ideally only the pipeline config file should have to change between both cases.

Point 1. is difficult to address because currently the monad m type is hidden inside the serialization functions, and using a serial of Stream means exposing it. The problem is that serialization functions are supposed to be oblivious of the LocationMonad they're running in (they shouldn't care whether they read local or remote files).

Point 2. requires some thinking as to how we whould handle the repetition keys. In the case of "each bit of data is in a different file", it's simple. The repetition key suffixes the file name, et voilà, we don't care in which order the keys are read. It's not the same if every bit of data is just a line from a file, because now (unless we want to hog on to all the file in memory) we have to make some assumptions about the order in which the keys are present in the file, or if the keys are present at all. Plus, the ways keys will be laid out is serialization-dependent (for JSON one-liners for instance, then each line can contain the key as a field).

For now, the "simplest" way I can think of is to intermingle serials and tasks even more. For now serialization functions are plain functions. What if they were PTasks? This way, any serialization function could internally use any tool already available at the task level, for instance reusing sub VirtualFiles. This way, a SerialsFor would just work at the level of a subtree of the resource tree, and could access anything it wants in there. The problem is that this way, we'll have a somewhat "non-fixed" resource tree in the end, because the subtrees would depend on which serialization functions are actually chosen in the end, and that would make pipeline parameterization more complicated for the end user.

@YPares YPares added the enhancement New feature or request label Oct 27, 2018
@YPares
Copy link
Owner Author

YPares commented Oct 8, 2019

After new thinking on that issue, it actually turns out that it is perfectly possible to create PureDeserial (Stream (Of (Key, A)) IO ()). Indeed we don't need to carry the m around since our m is always an instance of MonadUnliftIO, so we can always strip it out and replace it with IO.
So it should be possible to wrap and transform serials.

@YPares
Copy link
Owner Author

YPares commented Oct 8, 2019

#40 should be handled first though. Then we should assess if this issue is still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant