Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datagen API #1819

Merged
merged 9 commits into from
Apr 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

55 changes: 25 additions & 30 deletions docs/tutorials/writing_a_new_data_struct.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,11 @@ In general, data structs should be annotated with `#[icu_provider::data_struct]`

As explained in *data_pipeline.md*, the data struct should support zero-copy deserialization. The `#[icu_provider::data_struct]` annotation will enforce this for you. **See more information in [style_guide.md](https://github.com/unicode-org/icu4x/blob/main/docs/process/style_guide.md#zero-copy-in-dataprovider-structs--required),** as well as the example below in this tutorial.

If adding a new crate, you may need to add a new data category to the [`ResourceCategory` enum](https://unicode-org.github.io/icu4x-docs/doc/icu_provider/prelude/enum.ResourceCategory.html) in `icu_provider`. This may change in the future.

### Data Download

The first step to introduce data into the ICU4X pipeline is to download it from an external source. This corresponds to step 1 above.

When clients use ICU4X, this is generally a manual step, although we may provide tooling to assist with it. For the purpose of ICU4X test data, the tool [`icu4x-testdata-download`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/index.html) should automatically download data from the external source and save it in the ICU4X tree. `icu4x-testdata-download` should not do anything other than downloading the raw source data.
When clients use ICU4X, this is generally a manual step, although we may provide tooling to assist with it. For the purpose of ICU4X test data, the tool [`icu4x-testdata-download-source`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/index.html) should automatically download data from the external source and save it in the ICU4X tree. `icu4x-testdata-download-source` should not do anything other than downloading the raw source data.

### Source Data Providers

Expand All @@ -55,13 +53,9 @@ Although they may share common code, source data providers are implemented speci

Examples of source data providers include:

- [`CldrJsonDataProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/cldr/transform/struct.CldrJsonDataProvider.html#)
- [`NumbersProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/cldr/transform/struct.NumbersProvider.html)
- [`PluralsProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/cldr/transform/struct.PluralsProvider.html)
- [`DateSymbolsProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/cldr/transform/struct.DateSymbolsProvider.html)
- [… more examples](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/cldr/transform/index.html)
- `BinaryPropertyUnicodeSetDataProvider`
- [`HelloWorldProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_provider/hello_world/struct.HelloWorldProvider.html)
- [`NumbersProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/transform/cldr/struct.NumbersProvider.html)
- [`BinaryPropertyUnicodeSetDataProvider`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/transform/uprops/struct.BinaryPropertyUnicodeSetDataProvider.html)
- [… more examples](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/transform/index.html)

Source data providers must implement the following traits:

Expand All @@ -73,7 +67,7 @@ Source data providers are often complex to write. Rules of thumb:

- Optimize for readability and maintainability. The source data providers are not used in production, so performance is not a driving concern; however, we want the transformer to be fast enough to make a good developer experience.
- If the data source is similar to an existing data source (e.g., importing new data from CLDR JSON), try to share code with existing data providers for that source.
- If the data source is novel, feel free to add a new crate under `/provider`.
- If the data source is novel, feel free to add a new module under `icu_datagen::transform`.

### Data Exporters and Runtime Data Providers

Expand All @@ -95,18 +89,30 @@ Examples of runtime data providers include:

### Data Generation Tool (`icu4x-datagen`)

The [data generation tool, i.e., `icu4x-datagen`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/index.html), ties together the source data providers with a data exporters.
The [data generation tool, i.e., `icu4x-datagen`](https://unicode-org.github.io/icu4x-docs/doc/icu_datagen/index.html), ties together the source data providers with a data exporter.

When adding new data structs, it is necessary to make `icu4x-datagen` aware of your source data provider. To do this, edit
[*provider/datagen/src/registry.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/registry.rs) and add your data provider to the macro

When adding new data structs, it may be necessary to make `icu4x-datagen` aware of your source data provider. This is *not* necessary for CLDR JSON providers, so long as they are properly hooked up into `CldrJsonDataProvider`.
```rust
macro_rules! create_datagen_provider {
// ...
FooProvider,
}
```
as well as to the list of keys

1. Add a dependency from `icu_datagen` to the crate containing your source data provider.
2. Edit the code in `icu_datagen` to support your new source provider. You may choose to add a new command-line flag if relevant.
```rust
pub fn get_all_keys() -> Vec<ResourceKey> {
// ...
v.push(FooV1Marker::KEY)
}
```

When finished, run from the top level:

```bash
$ cargo make testdata-build-json
$ cargo make testdata-build-blob
$ cargo make testdata
```

If everything is hooked together properly, JSON files for your new data struct should appear under *provider/testdata/data/json*, and the file *provider/testdata/data/testdata.postcard* should have changed.
Expand Down Expand Up @@ -145,7 +151,7 @@ The above example is an abridged definition for `DecimalSymbolsV1`. Note how the

### CLDR JSON Deserialize

[*provider/cldr/src/transform/numbers/cldr_serde.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/cldr/src/transform/numbers/cldr_serde.rs)
[*provider/datagen/src/transform/cldr/serde/numbers.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/transform/cldr/serde/numbers.rs)

```rust
pub mod numbers_json {
Expand Down Expand Up @@ -188,7 +194,7 @@ The above example is an abridged definition of the Serde structure corresponding

### Transformer

[*provider/cldr/src/transform/numbers/mod.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/cldr/src/transform/numbers/mod.rs)
[*provider/datagen/src/transform/cldr/numbers/mod.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/datagen/src/transform/cldr/numbers/mod.rs)

```rust
struct FooProvider {
Expand Down Expand Up @@ -230,14 +236,3 @@ icu_provider::impl_dyn_provider!(FooProvider, [
```

The above example is an abridged snippet of code illustrating the most important boilerplate for implementing and ICU4X data transform.

### `CldrJsonDataProvider`

New CLDR JSON transformers need to be discoverable from `CldrJsonDataProvider`. To do this, edit [*provider/cldr/src/transform/mod.rs*](https://github.com/unicode-org/icu4x/blob/main/provider/cldr/src/transform/mod.rs) and add your data provider to the macro at the bottom of the file:

```rust
cldr_json_data_provider!(
// ...
foo: FooProvider,
);
```
55 changes: 32 additions & 23 deletions provider/datagen/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ include = [
"LICENSE",
"README.md",
]
default-run = "icu4x-datagen"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL!


[package.metadata.cargo-all-features]
# Omit most optional dependency features from permutation testing
Expand All @@ -31,62 +32,70 @@ skip_optional_dependencies = true
all-features = true

[dependencies]
cached-path = { version = "0.5", optional = true }
clap = "2.33"
dhat = "0.3.0"
displaydoc = { version = "0.2.3", default-features = false }
elsa = "1.7"
eyre = "0.6"

# ICU components
icu_calendar = { version = "0.5", path = "../../components/calendar", features = ["datagen"] }
icu_codepointtrie = { version = "0.3.3", path = "../../utils/codepointtrie", features = ["serde_serialize"] }
icu_datetime = { version = "0.5", path = "../../components/datetime", features = ["datagen"] }
icu_decimal = { version = "0.5", path = "../../components/decimal", features = ["datagen"] }
icu_list = { version = "0.5", path = "../../components/list", features = ["datagen"]}
icu_locale_canonicalizer = { version = "0.5", path = "../../components/locale_canonicalizer", features = ["datagen"] }
icu_locid = { version = "0.5", path = "../../components/locid", features = ["std"]}
icu_plurals = { version = "0.5", path = "../../components/plurals", features = ["datagen"] }
icu_properties = { version = "0.5", path = "../../components/properties", features = ["std", "datagen"]}
icu_properties = { version = "0.5", path = "../../components/properties", features = ["datagen"]}
# (experimental)
icu_casemapping = { version = "0.1", path = "../../experimental/casemapping", features = ["datagen"], optional = true }
icu_segmenter = { version = "0.1", path = "../../experimental/segmenter", features = ["datagen"], optional = true }

# ICU provider infrastructure
icu_provider = { version = "0.5", path = "../core", features = ["std", "log_error_context"]}
icu_provider_adapters = { path = "../adapters", features = ["datagen"] }
icu_provider_blob = { version = "0.5", path = "../blob", features = ["export"] }
icu_provider_fs = { version = "0.5", path = "../fs", features = ["export"] }
icu_testdata = { version = "0.5", path = "../testdata", features = ["metadata"] }

# Other
displaydoc = { version = "0.2.3", default-features = false }
elsa = "1.7"
icu_codepointtrie = { version = "0.3.3", path = "../../utils/codepointtrie", features = ["serde_serialize"] }
icu_locid = { version = "0.5", path = "../../components/locid", features = ["std"]}
icu_uniset = { version = "0.4.1", path = "../../utils/uniset", features = ["serde"] }
itertools = "0.10"
json = "0.12"
litemap = { version = "0.3.0", path = "../../utils/litemap" }
log = "0.4"
pathdiff = "0.2.1"
rayon = "1.5"
serde = { version = "1.0", default-features = false, features = ["derive", "alloc"] }
serde_json = { version = "1.0", default-features = false, features = ["alloc"] }
serde-aux = "2.1.1"
serde-tuple-vec-map = "1.0"
sha2 = "0.10.2"
simple_logger = "1.12"
tinystr = { path = "../../utils/tinystr", version = "0.5.0", features = ["alloc", "serde", "zerovec"], default-features = false }
toml = "0.5"
walkdir = "2.3.2"
writeable = { path = "../../utils/writeable" }
zerovec = { version = "0.6", path = "../../utils/zerovec", features = ["serde_serialize", "yoke"] }

# Experimental crates
icu_casemapping = { version = "0.1", path = "../../experimental/casemapping", features = ["datagen"], optional = true }
icu_segmenter = { version = "0.1", path = "../../experimental/segmenter", features = ["datagen"], optional = true }
# Dependencies for "bin" feature
clap = { version = "2.33", optional = true }
eyre = { version = "0.6", optional = true }
simple_logger = { version = "1.12", optional = true }
cached-path = { version = "0.5", optional = true }
sha2 = { version = "0.10.2", optional = true }
pathdiff = { version = "0.2.1", optional = true }
walkdir = { version = "2.3.2", optional = true }

[dev-dependencies]
dhat = "0.3.0"
writeable = { path = "../../utils/writeable" }

[features]
experimental = ["icu_casemapping", "icu_segmenter"]
# Automatically download CLDR and uprops data
download = ["cached-path"]
bin = ["clap", "cached-path", "eyre", "pathdiff","sha2", "simple_logger", "walkdir"]

[[bin]]
name = "icu4x-datagen"
path = "src/bin/datagen.rs"
required-features = ["bin"]

[[bin]]
[[test]]
name = "icu4x-verify-zero-copy"
path = "src/bin/verify-zero-copy.rs"
path = "tests/verify-zero-copy.rs"

[[bin]]
name = "icu4x-fingerprint-data"
path = "src/bin/fingerprint-data.rs"
required-features = ["bin"]
64 changes: 28 additions & 36 deletions provider/datagen/README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,42 @@
# icu_datagen [![crates.io](https://img.shields.io/crates/v/icu_datagen)](https://crates.io/crates/icu_datagen)

`icu_datagen` contains command-line tools to generate and process ICU4X data.
`icu_datagen` is a library to generate data files that can be used in ICU4X data providers.

The tools include:

* `icu4x-datagen`: Read source data (CLDR JSON, uprops files) and dump ICU4X-format data.
* `icu4x-key-extract`: Extract `ResourceKey` objects present in a compiled executable.

More details on each tool can be found by running `--help`.
Data files can be generated either programmatically (i.e. in `build.rs`), or through a
command-line utility.

## Examples

Generate ICU4X Postcard blob (single file) for all keys and all locales:
### `build.rs`

```rust
use icu_datagen::*;
use icu_locid::langid;
use std::fs::File;
use std::path::PathBuf;

fn main() {
icu_datagen::datagen(
Some(&[langid!("de"), langid!("en-AU")]),
&icu_datagen::keys(&["list/and@1"]),
&SourceData::default().with_uprops(PathBuf::from("/path/to/uprops/root")),
Out::Blob(Box::new(File::create("data.postcard").unwrap())),
false,
).unwrap();
}
```

### Command line
The command line interface is available with the `bin` feature.
```bash
# Run from the icu4x project folder
$ cargo run --bin icu4x-datagen -- \
--cldr-tag 41.0.0 \
cargo run --features bin -- \
--uprops-root /path/to/uprops/root \
--all-keys \
--all-locales \
--locales de,en-AU \
--format blob \
--out /tmp/icu4x_data/icu4x_data.postcard
```

Extract the keys used by an executable into a key file:

```bash
# Run from the icu4x project folder
$ cargo build --example work_log --release --features serde
$ cargo make icu4x-key-extract \
target/release/examples/work_log \
/tmp/icu4x_data/work_log+keys.txt
$ cat /tmp/icu4x_data/work_log+keys.txt
```

Generate ICU4X JSON file tree from the key file for Spanish and German:

```bash
# Run from the icu4x project folder
$ cargo run --bin icu4x-datagen -- \
--cldr-tag 41.0.0 \
--key-file /tmp/icu4x_data/work_log+keys.txt \
--locales es \
--locales de \
--out /tmp/icu4x_data/work_log_json
--out data.postcard
```
More details can be found by running `--help`.

## More Information

Expand Down
Loading