Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add macros for locale/langid #75

Closed
zbraniecki opened this issue May 5, 2020 · 8 comments · Fixed by #220
Closed

Add macros for locale/langid #75

zbraniecki opened this issue May 5, 2020 · 8 comments · Fixed by #220
Assignees
Labels
A-performance Area: Performance (CPU, Memory) C-locale Component: Locale identifiers, BCP47 T-core Type: Required functionality
Milestone

Comments

@zbraniecki
Copy link
Member

One of the features of unic-locale is that it allows for "free" encoding of language identifiers, locales and subtags thanks to proc macros.

In the current model it's quite quirky, and I didn't want to include it in the initial landing, but for that code to work, we only need to add two methods per subtag, and at least one of them doesn't have to be public - they just need to make it possible to create a subtag from a pre-computed u64/u32 (which is what the proc macro will do).

The good news is that Rust is about to stabilize proc macros (targeting Rust 1.45) which will allow us to get this feature without multi-crate hacks required before.

@zbraniecki zbraniecki added T-enhancement Type: Nice-to-have but not required C-locale Component: Locale identifiers, BCP47 A-performance Area: Performance (CPU, Memory) labels May 5, 2020
@zbraniecki
Copy link
Member Author

Pending #43

@echeran
Copy link
Contributor

echeran commented May 6, 2020

Can you remind me of the details of the benefits? (Was it compile-time static constants with dead code elimination for Rust users?) Also, what are the tradeoffs?

Proc macros sound similar to Scala/Haskell macros, although I haven't written them. Are there Rust style guides or rules of thumb about them? I think wisdom from Lisp macros should transfer over. Basically, the gist is don't write them unless you absolutely have to. (A concise and more nuanced guide is Ch 8 of On Lisp.) Along the spectrum of data<->fns<->macros, you tend trade off simplicity for power. Ex: macros can cause "macro-contagion" b/c they're not composable b/c they're not 1st-class data. Most common successful uses of macros seem to be either building up low level language constructs (for / while / println) or creating a streamlined sugary top layer for users of a library.

For ICU4X, could the result of the proc macros for locale cause the corresponding WASM binary to be larger, for example?

@Manishearth
Copy link
Member

For ICU4X, could the result of the proc macros for locale cause the corresponding WASM binary to be larger, for example?

the proc macro here would compile down to a single constant, so, no.

Proc macros sound similar to Scala/Haskell macros

They're basically transforms written in rust code that transform rust token trees.

I don't think the macro contagion problem will happen here, this macro is essentially being written like a const fn, except that the logic involved isn't something that can currently be written as a const fn

@zbraniecki
Copy link
Member Author

For ICU4X, could the result of the proc macros for locale cause the corresponding WASM binary to be larger, for example?

I don't think so.

Can you remind me of the details of the benefits? (Was it compile-time static constants with dead code elimination for Rust users?)

Sure. They basically are an elegant and complete replacement for the current ICU's "Static Public Member Functions".

Basically, if you need to encode in your code let x = "en-US"; but you don't want to pay at runtime for parsing of that string, since you put it there in place, and you guarantee that it's a valid one, you could try to manually parse the internals, and write:

let x = Locale {
  language: Language::from_str("en").unwrap(),
  script: None,
  region: Region::from_str("US").unwrap(),
  variants: vec![]
};

but then you still pay for parsing of the Language and Region while you know that it's the correct one. So you can dive deeper and expose new_unchecked constructor that takes the computed internal value and so on...

That's what proc_macro is doing for you. It encapsulates the code that takes a string "en-US" parses it, and produces a constructor that has all the values precomputed, then injects it into the code. It looks like this:

let x = locale!("en-US");
let lang = language!("en");
let region = region!("us"); // it'll canonicalize to US at build time

let broken = locale!("213d1"); // it'll fail at compile time

Also, what are the tradeoffs?

Today there are two - you need to expose some constructor that takes a precomputed internal value, and you need to play around to make the proc macros work on stable Rust.

The latter is getting resolved now and in 12 weeks will be in stable, which means that all we need is a pair of methods that can be used by the macro to "get the raw data" and "create from raw data" without any computations.

Here's the code I use today: https://github.com/zbraniecki/unic-locale/blob/master/unic-langid-macros-impl/src/lib.rs#L65-L108
and here are examples of uses: https://github.com/zbraniecki/unic-locale/blob/master/unic-langid/examples/simple-langid.rs#L6

I have a patch that adds the necessary get_raw and from_raw_unchecked to each subtag, and both of them are "semi-private" as in don't need to be documented, treated as stable etc. as we only need them to proc macros and don't consider part of the API.

In result you can skip them when building bindings to other languages and they don't need any additional binary size in case you don't use them.
We can also keep them behind a feature flag if we want to.

@echeran
Copy link
Contributor

echeran commented May 6, 2020

Ah, ok, thanks for the explanation. That makes sense -- since macro expansion happens pre-compile, you can use that to insert code to efficiently parse known valid locale string literals (without cluttering the source), and that makes it into the compiled code so that runtime is reduced.

I guess that means that there is a small tradeoff between extra code size and performance, but maybe it is negligible. I wonder how often in i18n algorithms do we need to refer specifically to particular locales.

@zbraniecki
Copy link
Member Author

Since I released unic-locale several people asked for some static const for en-US or several others that they need to "compare against". Not sure how to measure if it's a loud outlier or common need, but I think proc macro saves us from that dilemma.

It also allows you to do very performant comparisons like this:

if langid.language == language!("en") {
}

I guess that means that there is a small tradeoff between extra code size and performance, but maybe it is negligible

Where do you see the extra code size coming?

@echeran
Copy link
Contributor

echeran commented May 7, 2020

Nevermind, doesn't look like extra code, I didn't properly understand that the quoted values were being saved up for the single return expression in the above link.

@sffc sffc added T-core Type: Required functionality and removed T-enhancement Type: Nice-to-have but not required labels May 7, 2020
@sffc sffc added this to the 2020 Q3 milestone Jun 17, 2020
@zbraniecki
Copy link
Member Author

I have the PR in works, but we're waiting for rust-lang/rust#49146 which will be released on August 27 with rust 1.46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-locale Component: Locale identifiers, BCP47 T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants