You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For now, the repetition of VirtualFiles is done at the task level. This is a problem for two reasons:
It is not possible to take e.g. a PureDeserial A and generalize it to a PureDeserial (Stream (Of (Key, A)) m ()) (to read several A from the same source, which is necessary or at least very convenient if we want to use frames or cassava).
It is not possible to abstract out the way the As are laid out in the end in files. For instance, if each A is in a different JSON file, the code of the pipeline won't be exactly the same as if all the As are one-line json documents in the same file. porcupine has initially been envisioned to support that use case, and ideally only the pipeline config file should have to change between both cases.
Point 1. is difficult to address because currently the monad m type is hidden inside the serialization functions, and using a serial of Stream means exposing it. The problem is that serialization functions are supposed to be oblivious of the LocationMonad they're running in (they shouldn't care whether they read local or remote files).
Point 2. requires some thinking as to how we whould handle the repetition keys. In the case of "each bit of data is in a different file", it's simple. The repetition key suffixes the file name, et voilà, we don't care in which order the keys are read. It's not the same if every bit of data is just a line from a file, because now (unless we want to hog on to all the file in memory) we have to make some assumptions about the order in which the keys are present in the file, or if the keys are present at all. Plus, the ways keys will be laid out is serialization-dependent (for JSON one-liners for instance, then each line can contain the key as a field).
For now, the "simplest" way I can think of is to intermingle serials and tasks even more. For now serialization functions are plain functions. What if they were PTasks? This way, any serialization function could internally use any tool already available at the task level, for instance reusing sub VirtualFiles. This way, a SerialsFor would just work at the level of a subtree of the resource tree, and could access anything it wants in there. The problem is that this way, we'll have a somewhat "non-fixed" resource tree in the end, because the subtrees would depend on which serialization functions are actually chosen in the end, and that would make pipeline parameterization more complicated for the end user.
The text was updated successfully, but these errors were encountered:
After new thinking on that issue, it actually turns out that it is perfectly possible to create PureDeserial (Stream (Of (Key, A)) IO ()). Indeed we don't need to carry the m around since our m is always an instance of MonadUnliftIO, so we can always strip it out and replace it with IO.
So it should be possible to wrap and transform serials.
For now, the repetition of VirtualFiles is done at the task level. This is a problem for two reasons:
PureDeserial A
and generalize it to aPureDeserial (Stream (Of (Key, A)) m ())
(to read severalA
from the same source, which is necessary or at least very convenient if we want to useframes
orcassava
).A
s are laid out in the end in files. For instance, if eachA
is in a different JSON file, the code of the pipeline won't be exactly the same as if all theA
s are one-line json documents in the same file.porcupine
has initially been envisioned to support that use case, and ideally only the pipeline config file should have to change between both cases.Point 1. is difficult to address because currently the monad
m
type is hidden inside the serialization functions, and using a serial of Stream means exposing it. The problem is that serialization functions are supposed to be oblivious of the LocationMonad they're running in (they shouldn't care whether they read local or remote files).Point 2. requires some thinking as to how we whould handle the repetition keys. In the case of "each bit of data is in a different file", it's simple. The repetition key suffixes the file name, et voilà, we don't care in which order the keys are read. It's not the same if every bit of data is just a line from a file, because now (unless we want to hog on to all the file in memory) we have to make some assumptions about the order in which the keys are present in the file, or if the keys are present at all. Plus, the ways keys will be laid out is serialization-dependent (for JSON one-liners for instance, then each line can contain the key as a field).
For now, the "simplest" way I can think of is to intermingle serials and tasks even more. For now serialization functions are plain functions. What if they were PTasks? This way, any serialization function could internally use any tool already available at the task level, for instance reusing sub VirtualFiles. This way, a
SerialsFor
would just work at the level of a subtree of the resource tree, and could access anything it wants in there. The problem is that this way, we'll have a somewhat "non-fixed" resource tree in the end, because the subtrees would depend on which serialization functions are actually chosen in the end, and that would make pipeline parameterization more complicated for the end user.The text was updated successfully, but these errors were encountered: