-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API to include the correct binaries at compile time #12
Comments
Hi, thanks! I don't fully understand the issue here. These dumps are needed as a developer of NLPRule to produce binaries distributed via Github Releases. Distributing a single binary for rules / tokenizer is much more convenient than distributing a bunch of files which are all prone to change. If you want to include the data for NLPRule at compile time you can already do so:
use std::io::Cursor;
use nlprule::{Rules, Tokenizer};
fn main() {
let tokenizer =
Tokenizer::new_from(Cursor::new(include_bytes!("../data/en_tokenizer.bin"))).unwrap();
let rules = Rules::new_from(Cursor::new(include_bytes!("../data/en_rules.bin"))).unwrap();
assert_eq!(
rules.correct("She was not been here since Monday.", &tokenizer),
String::from("She was not here since Monday.")
);
} At the moment this is a bit awkward because the files are quite large - but there is potential to decrease the size. Is there something I'm missing? |
is the result of
That is correct, but I would expect to be able to create them reproducibly as part of the CI pipeline myself :) |
The source are files from LanguageTool: the XML files where rules are defined, an exported dictionary (see here) and some files used for optimization such as a cache of precomputed Regexes and a list of common words in the language. Currently I am hosting this internal data on Backblaze, this is how the binaries are built and tested in CI: nlprule/.github/workflows/ci.yml Lines 33 to 76 in 19c5a59
My goal was that the user can just download the released binaries and never worry about the "ugly" build details. But I see your point now. I'll look at how to best streamline the process. I'm not sure how a
should definitely be added. Thanks! |
That said, I still don't agree with
The binaries are encoded via IMO, distributing |
I probably should have phrased it differently: Avoid external file dependencies outside of |
Sorry, I don't quite understand this part:
Could you elaborate a bit? But yes, better documentation is definitely the way to go. Thinking about this a bit it also occurred to me that the Maybe building directly from the build files is feasible, I definitely have to look into |
Cargo can include arbitrary files for distribution, specification is here: https://doc.rust-lang.org/cargo/reference/manifest.html#the-exclude-and-include-fields Either way, more documentation would be much appreciated :) |
This is now mostly addressed. There are build docs in build/ and a documented script to generate all the build files in build/make_build_dir.py. Let me know if I missed anything. Also
I adressed most of this potential by encoding the tagger as 2 FSTs, the tokenizer is now roughly x2 smaller for English at 7.8MB gzipped and ~ x6 smaller for German at 1.8MB gzipped. So it's starting to become more feasible to inline it. And on the topic of
@drahnr do you think using cbor or MessagePack instead of bincode would be an improvement? |
Depends on the goal. If you anticipate the inlined binaries to be compressed (i.e. zstd), I don't think either will yield much of a perf win. If you anticipate using the plain encoded binary, it will make a (small? large?) difference if the codec is zero copy i.e. https://github.com/3Hren/msgpack-rust I think the only good way is explicitly defining the goal, then impl both variants and measure. |
Yes, I briefly tried both msgpack and cbor, size-wise there is no significant improvement. And as you said if it is compressed there isn't much of a difference. I was thinking more along the lines of the "non-standard format" you mentionend - or did you not mean format of the binary but having a binary that needs to be distributed at all? The distributed file will have to be binary since anything else would of course blow up the size but the content of the binary would be more predictable if it were in a standard format and it would (in theory) be possible to read it from another language, although I could not come up with a use case for that. Also, reading once more over your original comment, I think what you mean would be possible. There could be a
|
One solution that would make it more ergonomic without raising the two issues from above would be:
This would be easy to implement, match the Python |
I put a quick one together here https://github.com/drahnr/cargo-spellcheck/pull/148/files#diff-d0d98998092552a1d3259338c2c71e118a5b8343dd4703c0c7f552ada7f9cb42 for the PR drahnr/cargo-spellcheck#148 This is far from perfect, but I wanted to avoid pulling in an error library. I understand the need to separate the binary from the rest, yet it should be as easy as possible to retrieve required binaries. The most ergonomic would probably to provide a sample On the other hand, if you plan on hosting it (and by that I mean reliably hosting it) you could provide the method anyways (beware, gh has rate limits too and for me the linked one occassionally produces garbage downloads). So I have no strong opinion about this, I would probably vote for the feature flag and provided download helper rust fn. |
Thanks! That's helpful. Also cool that you're moving forward with integration in Back to the topic:
GH claims there is no rate limit (https://docs.github.com/en/github/administering-a-repository/about-releases):
But others have also reported issues with that (#9). I'll have to see what to do about it, moving to some external cloud hosting is an option but I really like the convenience of GH releases and was not able to reproduce a problem so far.. |
Thanks for the hint. Noted 👍
My bad, that was then either obsolete or I mixed that up with some other gh API being rate limited 🧐
I had plenty of |
Currently the plan is:
So far there are two known issues with this approach:
|
There is now a PR here: #24 |
After merging the PR I did basically a 180 degree turn on how this is implemented because: #[macro_export]
macro_rules! tokenizer {
($lang_code:literal) => {{
let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/", $lang_code, "_tokenizer.bin"));
// irrelevant
}};
} This uses the
fn main() {
println!("cargo:rerun-if-changed=build.rs");
nlprule_build::BinaryBuilder::new(
Some(&["en"]),
std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
)
.fallback_to_build_dir(true) // this specifies the builder should fall back to downloading & building from the build dir if no distributed binaries are found (i. e. user is running a dev version)
.build();
}
use nlprule::{Rules, Tokenizer};
use std::io::Cursor;
fn main() {
let tokenizer = Tokenizer::from_reader(Cursor::new(include_bytes!(concat!(
env!("OUT_DIR"),
"/en_tokenizer.bin"
))))
.unwrap();
let rules = Rules::from_reader(Cursor::new(include_bytes!(concat!(
env!("OUT_DIR"),
"/en_rules.bin"
))))
.unwrap();
assert_eq!(
rules.correct("She was not been here since Monday.", &tokenizer),
String::from("She was not here since Monday.")
);
}
[package]
name = "nlpruletest"
version = "0.1.0"
authors = ["Benjamin Minixhofer <[email protected]>"]
edition = "2018"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
nlprule = { git = "https://github.com/bminixhofer/nlprule" }
[build-dependencies]
nlprule-build = { git = "https://github.com/bminixhofer/nlprule" } What do you think about this API @drahnr ? The general notion of an |
I like it generally speaking! Did not look into the details yet. The split makes sense. |
Looking at it with a pair of fresh eyes, it looks like the right approach. I would attempt to validate the artifacts after downloading in the So use fs_err as fs;
fn main() {
println!("cargo:rerun-if-changed=build.rs");
let generator = nlprule_build::BinaryBuilder::new(
Some(&["en"]),
std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
)
.fallback_to_build_dir(true) // this specifies the builder should fall back to downloading & building from the build dir if no distributed binaries are found (i. e. user is running a dev version)
.build();
for (path, what) in generator.binary_artifact_paths() { // <<<<< or make this a explicit method(s) in the builder, to cover the use case where those files were modified manually between builds
println!("cargo:rerun-if-changed={}", path.display());
match what {
What::Rules => Rules::from_reader(fs::File::open(path)).expect("Rules checks pass | qed"),
What::Tokenizer => Tokenizer::from_reader(fs::File::open(path)).expect("Tokenizer checks must pass | qed"),
_ => {}
}
} On a different note: In an example I would recommend to introduce a Tokenizer::from_reader(Cursor::new(include_bytes!(concat!(
env!("OUT_DIR"),
"/en_tokenizer.bin"
)))) static TOKENIZER_BYTES: &'static [u8] = include_bytes!(concat!(
env!("OUT_DIR"),
"/en_tokenizer.bin"
))
let _tokenizer = Tokenizer::from_reader(&mut TOKENIZER_BYTES).expect("Compile time rules are checked in build.rs | qed") |
Thanks for taking a look!
Good point. I'll add a method for that. An error can still happen at runtime though. [dependencies]
nlprule = "0.3.0"
[build-dependencies]
nlprule-build = "0.4.0" would be possible and unless I'm missing something could only throw an error at runtime. So #20 is still very relevant.
I agree, that's better. |
I had that in mind as well, but it seems a rather odd corner case, assuming you make tandem releases of |
Alright, I added a It is now released: https://github.com/bminixhofer/nlprule/releases/tag/0.4.0 |
After a couple misfires the release is finished now and there's a recommended setup for the build API here: https://crates.io/crates/nlprule-build. I'm closing this issue now! |
Hey, nice library and I am currently checking what would be needed to obsolete the current LanguageTool backend in https://github.com/drahnr/cargo-spellcheck .
There are a few things which would need to be addressed, the most important is to
avoid the need for https://github.com/bminixhofer/nlprule/blob/master/scripts/build.sh .
The
compile
feature could gate abuild.rs
file which would prep the data which in turn could be included viainclude_bytes!
.That way, one can locally source the data at compile time and include the source files within the binary, with optional external overrides.
Another thing that would be nice, is documentation on how to obtain the referenced dumps.
Looking forward 💯
The text was updated successfully, but these errors were encountered: