Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API to include the correct binaries at compile time #12

Closed
drahnr opened this issue Jan 22, 2021 · 23 comments · Fixed by #24
Closed

API to include the correct binaries at compile time #12

drahnr opened this issue Jan 22, 2021 · 23 comments · Fixed by #24
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request P2 Medium priority

Comments

@drahnr
Copy link
Contributor

drahnr commented Jan 22, 2021

Hey, nice library and I am currently checking what would be needed to obsolete the current LanguageTool backend in https://github.com/drahnr/cargo-spellcheck .

There are a few things which would need to be addressed, the most important is to
avoid the need for https://github.com/bminixhofer/nlprule/blob/master/scripts/build.sh .
The compile feature could gate a build.rs file which would prep the data which in turn could be included via include_bytes!.
That way, one can locally source the data at compile time and include the source files within the binary, with optional external overrides.
Another thing that would be nice, is documentation on how to obtain the referenced dumps.

Looking forward 💯

@bminixhofer
Copy link
Owner

Hi, thanks! cargo-spellcheck looks great too, would be exciting to switch to NLPRule!

I don't fully understand the issue here. These dumps are needed as a developer of NLPRule to produce binaries distributed via Github Releases. Distributing a single binary for rules / tokenizer is much more convenient than distributing a bunch of files which are all prone to change.

If you want to include the data for NLPRule at compile time you can already do so:

main.rs

use std::io::Cursor;

use nlprule::{Rules, Tokenizer};

fn main() {
    let tokenizer =
        Tokenizer::new_from(Cursor::new(include_bytes!("../data/en_tokenizer.bin"))).unwrap();
    let rules = Rules::new_from(Cursor::new(include_bytes!("../data/en_rules.bin"))).unwrap();

    assert_eq!(
        rules.correct("She was not been here since Monday.", &tokenizer),
        String::from("She was not here since Monday.")
    );
}

At the moment this is a bit awkward because the files are quite large - but there is potential to decrease the size.

Is there something I'm missing?
Also, I agree on the issue of documentation, the non-public docs definitely needs to be improved, especially to make it easier to contribute to NLPRule. I'm working on that at the moment.

@drahnr
Copy link
Contributor Author

drahnr commented Jan 22, 2021

../data/en_tokenizer.bin

is the result of build.sh, which - imho - should be generated automatically by the build process itself on file change fo the source (whatever that is, that's where some additional docs would come in nicely), rather than manully doing it.
This is indeed not a big deal and the lib user can do that as well.

Distributing a single binary for rules / tokenizer is much more convenient than distributing a bunch of files which are all prone to change.

That is correct, but I would expect to be able to create them reproducibly as part of the CI pipeline myself :)

@bminixhofer
Copy link
Owner

The source are files from LanguageTool: the XML files where rules are defined, an exported dictionary (see here) and some files used for optimization such as a cache of precomputed Regexes and a list of common words in the language.

Currently I am hosting this internal data on Backblaze, this is how the binaries are built and tested in CI:

- run: |
mkdir data
mkdir storage
cd data
wget https://f000.backblazeb2.com/file/nlprule/en.zip
unzip en.zip
wget https://f000.backblazeb2.com/file/nlprule/de.zip
unzip de.zip
- uses: actions-rs/cargo@v1
with:
command: run
# build english, see BUILD.md
args: --all-features --release --bin compile -- --tag-paths data/en/tags/output.dump data/en/tags/added.txt --tag-remove-paths data/en/tags/removed.txt --disambiguation-path data/en/disambiguation.canonic.xml --tokenizer-config-path configs/en/tokenizer.json --grammar-path data/en/grammar.canonic.xml --rules-config-path configs/en/rules.json --common-words-path data/en/common.txt --chunker-path data/en/chunker.json --out-tokenizer-path storage/en_tokenizer.bin --out-rules-path storage/en_rules.bin --regex-cache-path data/en/regex_cache.bin
- uses: actions-rs/cargo@v1
with:
command: run
# build german, see BUILD.md
args: --all-features --release --bin compile -- --tag-paths data/de/tags/output.dump data/de/tags/added.txt --tag-remove-paths data/de/tags/removed.txt --disambiguation-path data/de/disambiguation.canonic.xml --tokenizer-config-path configs/de/tokenizer.json --grammar-path data/de/grammar.canonic.xml --rules-config-path configs/de/rules.json --common-words-path data/de/common.txt --out-tokenizer-path storage/de_tokenizer.bin --out-rules-path storage/de_rules.bin --regex-cache-path data/de/regex_cache.bin
- uses: actions-rs/cargo@v1
with:
command: test
args: --verbose --all-features --release -p nlprule
- uses: actions-rs/cargo@v1
with:
command: run
args: --all-features --release --bin test_disambiguation -- --tokenizer storage/en_tokenizer.bin
- uses: actions-rs/cargo@v1
with:
command: run
args: --all-features --release --bin test -- --tokenizer storage/en_tokenizer.bin --rules storage/en_rules.bin
- uses: actions-rs/cargo@v1
with:
command: run
args: --all-features --release --bin test_disambiguation -- --tokenizer storage/de_tokenizer.bin
- uses: actions-rs/cargo@v1
with:
command: run
args: --all-features --release --bin test -- --tokenizer storage/de_tokenizer.bin --rules storage/de_rules.bin
- name: Upload binaries as artifact
uses: actions/upload-artifact@v2
with:
name: binaries
path: storage/*

the lib user can do that as well

My goal was that the user can just download the released binaries and never worry about the "ugly" build details. But I see your point now. I'll look at how to best streamline the process. I'm not sure how a build.rs is handled for libraries, I'll look into that as well but regardless:

  1. better docs on where the build data comes from
  2. better docs on how to build the binaries from the build data

should definitely be added. Thanks!

@bminixhofer bminixhofer self-assigned this Jan 22, 2021
@bminixhofer bminixhofer added the documentation Improvements or additions to documentation label Jan 22, 2021
@bminixhofer
Copy link
Owner

That said, I still don't agree with

Avoid external file dependencies in non-standard formats.

The binaries are encoded via bincode. As said that's much more convenient than distributing all the build data and comes with smaller file sizes. There needs to be an intermediate step, otherwise all the potentially much larger than necessary build files would have to be distributed.

IMO, distributing .bin files is fine but as you said it should be more transparent where it comes from.

@drahnr
Copy link
Contributor Author

drahnr commented Jan 22, 2021

I probably should have phrased it differently: Avoid external file dependencies outside of Cargo.toml (which can pull distribute binary arbitrary artifacts) and document the used format.

@bminixhofer
Copy link
Owner

Sorry, I don't quite understand this part:

Avoid external file dependencies outside of Cargo.toml (which can pull distribute binary arbitrary artifacts)

Could you elaborate a bit?

But yes, better documentation is definitely the way to go.

Thinking about this a bit it also occurred to me that the .bin is very prone to change compared to the build files which is a breaking change at the moment. For the Python bindings this is not an issue since .load just downloads the correct binary automatically but for Rust it is of course more of a problem.

Maybe building directly from the build files is feasible, I definitely have to look into build.rs functionality.

@bminixhofer bminixhofer added the enhancement New feature or request label Jan 22, 2021
@drahnr
Copy link
Contributor Author

drahnr commented Jan 22, 2021

Cargo can include arbitrary files for distribution, specification is here: https://doc.rust-lang.org/cargo/reference/manifest.html#the-exclude-and-include-fields
This could help to include the files. But this might be better suited for the using project rather than the library itself.

Either way, more documentation would be much appreciated :)

@bminixhofer
Copy link
Owner

This is now mostly addressed. There are build docs in build/ and a documented script to generate all the build files in build/make_build_dir.py.

Let me know if I missed anything.

Also

there is potential to decrease the size

I adressed most of this potential by encoding the tagger as 2 FSTs, the tokenizer is now roughly x2 smaller for English at 7.8MB gzipped and ~ x6 smaller for German at 1.8MB gzipped. So it's starting to become more feasible to inline it.

And on the topic of

non-standard

@drahnr do you think using cbor or MessagePack instead of bincode would be an improvement?

@drahnr
Copy link
Contributor Author

drahnr commented Feb 2, 2021

Depends on the goal. If you anticipate the inlined binaries to be compressed (i.e. zstd), I don't think either will yield much of a perf win. If you anticipate using the plain encoded binary, it will make a (small? large?) difference if the codec is zero copy i.e. https://github.com/3Hren/msgpack-rust

I think the only good way is explicitly defining the goal, then impl both variants and measure.

@bminixhofer
Copy link
Owner

bminixhofer commented Feb 2, 2021

Yes, I briefly tried both msgpack and cbor, size-wise there is no significant improvement. And as you said if it is compressed there isn't much of a difference.

I was thinking more along the lines of the "non-standard format" you mentionend - or did you not mean format of the binary but having a binary that needs to be distributed at all?

The distributed file will have to be binary since anything else would of course blow up the size but the content of the binary would be more predictable if it were in a standard format and it would (in theory) be possible to read it from another language, although I could not come up with a use case for that.

Also, reading once more over your original comment, I think what you mean would be possible. There could be a build.rs which downloads the build directory and builds the binary. The problems I see with that are:

  • there would have to be a feature flag for each language and boilerplate code for each language - I am not entirely against this but I'm not sure I like having such a tight integration of the languages. I'll think a bit more about this.
  • the build directories for all past releases would have to be hosted somewhere. This is good anyway for reproducibility but assuming NLPRule in the mid-term supports 10 languages, this would mean an extra ~ 20MB * 10 = 200MB to host per release. while at the moment it is only ~ 20MB per language (not per language per release).

@bminixhofer
Copy link
Owner

One solution that would make it more ergonomic without raising the two issues from above would be:

  • A build.rs which downloads all the binaries for the current version from GH releases.
  • A function Rules::from_lang_code() and Tokenizer::from_lang_code taking a 'static str which includes_bytes! the binary.

This would be easy to implement, match the Python .load API (which is also nice!) and keep the build directory farther away from the user. What do you think?

@drahnr
Copy link
Contributor Author

drahnr commented Feb 2, 2021

I put a quick one together here https://github.com/drahnr/cargo-spellcheck/pull/148/files#diff-d0d98998092552a1d3259338c2c71e118a5b8343dd4703c0c7f552ada7f9cb42 for the PR drahnr/cargo-spellcheck#148

This is far from perfect, but I wanted to avoid pulling in an error library.

I understand the need to separate the binary from the rest, yet it should be as easy as possible to retrieve required binaries.

The most ergonomic would probably to provide a sample build.rs as part of a rust example project. That way this is easy to copy and paste, yet you don't infer the cost for it.

On the other hand, if you plan on hosting it (and by that I mean reliably hosting it) you could provide the method anyways (beware, gh has rate limits too and for me the linked one occassionally produces garbage downloads).

So I have no strong opinion about this, I would probably vote for the feature flag and provided download helper rust fn.

@bminixhofer
Copy link
Owner

I put a quick one together here https://github.com/drahnr/cargo-spellcheck/pull/148/files#diff-d0d98998092552a1d3259338c2c71e118a5b8343dd4703c0c7f552ada7f9cb42 for the PR drahnr/cargo-spellcheck#148

Thanks! That's helpful. Also cool that you're moving forward with integration in cargo-spellcheck. One thing to be aware of is that the Rust API currently always expects one sentence i. e. sentence splitting has to be done separately. I still have to add a trait for that and add functionality equivalent to the Python SplitOn (#15).

Back to the topic:

On the other hand, if you plan on hosting it (and by that I mean reliably hosting it) you could provide the method anyways (beware, gh has rate limits too and for me the linked one occassionally produces garbage downloads).

GH claims there is no rate limit (https://docs.github.com/en/github/administering-a-repository/about-releases):

Each file included in a release must be under 2 GB. There is no limit on the total size of a release, nor bandwidth usage.

But others have also reported issues with that (#9). I'll have to see what to do about it, moving to some external cloud hosting is an option but I really like the convenience of GH releases and was not able to reproduce a problem so far..

@bminixhofer bminixhofer added the P2 Medium priority label Feb 3, 2021
@drahnr
Copy link
Contributor Author

drahnr commented Feb 3, 2021

I put a quick one together here https://github.com/drahnr/cargo-spellcheck/pull/148/files#diff-d0d98998092552a1d3259338c2c71e118a5b8343dd4703c0c7f552ada7f9cb42 for the PR drahnr/cargo-spellcheck#148

Thanks! That's helpful. Also cool that you're moving forward with integration in cargo-spellcheck. One thing to be aware of is that the Rust API currently always expects one sentence i. e. sentence splitting has to be done separately. I still have to add a trait for that and add functionality equivalent to the Python SplitOn (#15).

Thanks for the hint. Noted 👍

On the other hand, if you plan on hosting it (and by that I mean reliably hosting it) you could provide the method anyways (beware, gh has rate limits too and for me the linked one occassionally produces garbage downloads).

GH claims there is no rate limit (https://docs.github.com/en/github/administering-a-repository/about-releases):

My bad, that was then either obsolete or I mixed that up with some other gh API being rate limited 🧐

Each file included in a release must be under 2 GB. There is no limit on the total size of a release, nor bandwidth usage.

But others have also reported issues with that (#9). I'll have to see what to do about it, moving to some external cloud hosting is an option but I really like the convenience of GH releases and was not able to reproduce a problem so far..

I had plenty of 2 is not of type .... I'll add it if I stumble upon it again.

@bminixhofer bminixhofer changed the title Avoid external file dependencies in non-standard formats. API to include the correct binaries at compile time Feb 4, 2021
@bminixhofer
Copy link
Owner

Currently the plan is:

  • A build.rs which downloads all the binaries for the current version from GH releases to some cache directory.
  • A function Rules::from_lang_code() and Tokenizer::from_lang_code taking a 'static str which includes_bytes! the binary.

So far there are two known issues with this approach:

  • The build.rs has to download the binaries for all languages because it does not know which ones the user actually needs. I'm okay with this as the size is not too much at the moment (~ 40MB without compression) and will presumably not grow very large; just ~10MB per language. It could also be a feature in default-features so the user can opt out.
  • It only works with released versions since only released binaries are distributed via GH releases. This is not too big if there is a good error message (Better error if binary does not match version #20). It could be solved by having nightly GH releases.

@bminixhofer
Copy link
Owner

bminixhofer commented Feb 8, 2021

There is now a PR here: #24

@bminixhofer
Copy link
Owner

bminixhofer commented Feb 11, 2021

After merging the PR I did basically a 180 degree turn on how this is implemented because:

#[macro_export]
macro_rules! tokenizer {
    ($lang_code:literal) => {{
        let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/", $lang_code, "_tokenizer.bin"));

        // irrelevant
    }};
}

This uses the OUT_DIR variable as defined by the user, not by nlprule, which is incorrect. It is impossible or very hacky to get that to work otherwise. I undid most of the changes from the branch now and made a binary builder instead. Usage is like this:

build.rs

fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    nlprule_build::BinaryBuilder::new(
        Some(&["en"]),
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .fallback_to_build_dir(true) // this specifies the builder should fall back to downloading & building from the build dir if no distributed binaries are found (i. e. user is running a dev version)
    .build();
}

src/main.rs

use nlprule::{Rules, Tokenizer};
use std::io::Cursor;

fn main() {
    let tokenizer = Tokenizer::from_reader(Cursor::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/en_tokenizer.bin"
    ))))
    .unwrap();
    let rules = Rules::from_reader(Cursor::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/en_rules.bin"
    ))))
    .unwrap();

    assert_eq!(
        rules.correct("She was not been here since Monday.", &tokenizer),
        String::from("She was not here since Monday.")
    );
}

Cargo.toml

[package]
name = "nlpruletest"
version = "0.1.0"
authors = ["Benjamin Minixhofer <[email protected]>"]
edition = "2018"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
nlprule = { git = "https://github.com/bminixhofer/nlprule" }

[build-dependencies]
nlprule-build = { git = "https://github.com/bminixhofer/nlprule" }

What do you think about this API @drahnr ? The general notion of an nlprule-build crate feels very good to me, there's less magic behind the scenes and less complexity in NLPRule itself. I'm not quite sure about the API (esp. the user manually having to specify the paths) though. But I'm not sure a method just for that would make sense.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 11, 2021

I like it generally speaking! Did not look into the details yet. The split makes sense.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 12, 2021

Looking at it with a pair of fresh eyes, it looks like the right approach. I would attempt to validate the artifacts after downloading in the build.rs, otherwise the download might be fine, yet the application could encounter encoding errors at runtime due to a missmatch of nlprule-build and nlprule versions. The best approach imho would be to showcase that in an example.

So build.rs would become:

use fs_err as fs;
fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    let generator = nlprule_build::BinaryBuilder::new(
        Some(&["en"]),
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .fallback_to_build_dir(true) // this specifies the builder should fall back to downloading & building from the build dir if no distributed binaries are found (i. e. user is running a dev version)
    .build();
    for (path, what) in generator.binary_artifact_paths() { // <<<<< or make this a explicit method(s) in the builder, to cover the use case where those files were modified manually between builds
         println!("cargo:rerun-if-changed={}", path.display());
         match what {
             What::Rules =>  Rules::from_reader(fs::File::open(path)).expect("Rules checks pass | qed"),
             What::Tokenizer =>  Tokenizer::from_reader(fs::File::open(path)).expect("Tokenizer checks must pass | qed"),
             _ => {}
    }
}

On a different note:

In an example I would recommend to introduce a staticin main.rs, so instead of

Tokenizer::from_reader(Cursor::new(include_bytes!(concat!(
        env!("OUT_DIR"),
        "/en_tokenizer.bin"
    ))))
static TOKENIZER_BYTES: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/en_tokenizer.bin"
    ))
let _tokenizer = Tokenizer::from_reader(&mut TOKENIZER_BYTES).expect("Compile time rules are checked in build.rs | qed")

@bminixhofer
Copy link
Owner

bminixhofer commented Feb 12, 2021

Thanks for taking a look!

I would attempt to validate the artifacts after downloading in the build.rs, otherwise the download might be fine, yet the application could encounter encoding errors at runtime due to a missmatch of nlprule-build and nlprule versions

Good point. I'll add a method for that. An error can still happen at runtime though. nlprule-build can know nothing about the nlprule version used by the application so something like this:

[dependencies]
nlprule = "0.3.0"

[build-dependencies]
nlprule-build = "0.4.0"

would be possible and unless I'm missing something could only throw an error at runtime. So #20 is still very relevant.

In an example I would recommend to introduce a static in main.rs

I agree, that's better.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 12, 2021

I had that in mind as well, but it seems a rather odd corner case, assuming you make tandem releases of nlprule{,-build}

@bminixhofer
Copy link
Owner

Alright, I added a .validate method and the possibility to check output files with .outputs. I didn't add println!("cargo:rerun-if-changed={}", path.display()); because it caused the build script to rerun every time (even if I didn't change anything) and I didn't want to look into it too much.

It is now released: https://github.com/bminixhofer/nlprule/releases/tag/0.4.0
The Release CI is still running (and it might fail, there's been significant changes in the release script) but in any case it will be available today :)

@bminixhofer
Copy link
Owner

After a couple misfires the release is finished now and there's a recommended setup for the build API here: https://crates.io/crates/nlprule-build.

I'm closing this issue now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request P2 Medium priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants