-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor/transform: introduce transform and improve error handling #35
Conversation
Thanks for the PR! I also think a Also thanks for fixing #33 and the |
It seems crate |
@bminixhofer a first round of review would be appreciated 🦺 it could be simplified in a few areas, but that's smth for another day. |
Sure! But likely tomorrow 🙂
IMO it's fine to just use something else in the tests here. I just tried this and have the same issue (hangs indefinitely). But I don't think it is needed for this PR. It's good to have two different compression algorithms in the tests but I see no reason to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, review done 🙂 Mostly just comments and small changes. The overall approach is good!
The most significant thing I would've done differently is not using trait
s for the transform
arguments, that adds quite a bit of boilerplate. I think we can also simplify the signature a bit.
build/src/lib.rs
Outdated
pub type Result<T> = std::result::Result<T, Error>; | ||
|
||
/// Definition of the data transformation for the network retrieved, binencoded rules and tokenizer datasets. | ||
pub trait TransformDataFn<I: Read>: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for having a trait and the generic I
and Cursor<&'r mut Vec<u8>>
here instead of e. g. type TransformDataFn = Box<dyn Fn(Vec<u8>) -> Result<Vec<u8>>>;
? That would simplify things quite a bit and Fn(Vec<u8>) -> Result<Vec<u8>>
is still expressive enough. I would argue readability is more important than performance here because the closure is likely called very infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I meant in my previous comment in the PR (not in the review pane). That's complexity to be removed.
I don't think it makes sense to use type aliases. Passing the type would then require the user to Box or have the trait bounds there (and hence twice).
While in this case one could argue, that a simplified fn
signature might be good, I find that a bit of an anti-pattern. impl Read
and impl Write
can easily be "chained"/combined without intermediate buffers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just attempted it, but having an dyn Fn(dyn Read, dyn Write) -> result::Result<(), OtherError> {}
opens a can of worms, so I am not sure that's in scope of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I agree on the impl Read
/ impl Write
now, it is also consistent with the signature of postprocess
.
Passing the type would then require the user to Box or have the trait bounds there (and hence twice).
I don't see an issue with having the bounds twice (once in the type alias, once in the transform
function). Something like this would work:
pub type TransformDataFn =
Box<dyn Fn(Cursor<Vec<u8>>, Cursor<&mut Vec<u8>>) -> result::Result<(), OtherError>>;
and
pub fn transform<F, C>(mut self, proc_fn: C, path_fn: F) -> Self
where
C: Fn(Cursor<Vec<u8>>, Cursor<&mut Vec<u8>>) -> result::Result<(), OtherError> + 'static,
F: TransformPathFn + 'static,
{
self.transform_data_fn = Some(Box::new(proc_fn));
self.transform_path_fn = Some(Box::new(path_fn));
self
}
or am I missing something? The repetition is not perfect but changing one or the other immediately gives a compile time error so that's not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the last commit, one needs late bound life times and a bunch of boxes to be able to use dyn Read
and dyn Write
in the args which is needed for the get_build_dir
case.
Note that this now includes a bunch of lifetime trickery, which I am happy to explain. |
Oh, and please squash merge this 🗜️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simplify the signature of TransformDataFn
to use concrete structs. That would also remove the "lifetime trickery". 🙂
Besides that only small changes. Please also run rustfmt
on the files.
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
Co-authored-by: Benjamin Minixhofer <[email protected]>
CI failed because Backblaze bandwith got exceeded, I fixed that now. But there's a compilation error now. |
Should be fixed now. Sidenote: Why do I need python3.9 to link the rust crate? |
I'm not sure; maybe because of the Python bindings? They're in the same workspace. Looks good to me now! Thanks a lot! |
Note that this PR still lacks test adjustments required and
transform
is not covered yet.Changes:
fs-err
for better errors without much chorefn transform
for transformations before artifacts hit thecache_dir
(this is not quite correct now, tbd)type Result<T>
to avoid boilerplateBufReader
andBufWriter
instead of intermediateVec
where possible