Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding all keys to testdata #3042

Merged
merged 3 commits into from
Feb 1, 2023
Merged

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jan 30, 2023

icu_testdata currently doesn't include a handful of properties keys. It also doesn't include some collations for locales that it includes other data in.

@robertbastian robertbastian requested review from Manishearth, sffc and a team as code owners January 30, 2023 18:25
@Manishearth Manishearth removed request for a team and Manishearth January 30, 2023 18:39
@Manishearth
Copy link
Member

Deferring to Shane on this, he has more context

.github/workflows/build-test.yml Outdated Show resolved Hide resolved
tools/testdata-scripts/src/bin/make-testdata.rs Outdated Show resolved Hide resolved
provider/blob/benches/provider_blob.rs Outdated Show resolved Hide resolved
@jira-pull-request-webhook

This comment was marked as spam.

@robertbastian robertbastian requested a review from sffc January 30, 2023 19:51
@@ -34,9 +14,9 @@ fn main() {
.unwrap();

let source_data = SourceData::default()
.with_cldr(paths::cldr_json_root(), CldrLocaleSubset::Full)
.with_cldr_for_tag(&icu_testdata::versions::cldr_tag(), CldrLocaleSubset::Full)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Please, please make this read from the repo. A dev should be able to run cargo make testdata without an internet connection. This has always worked and there's no reason to break it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repo might not reflect the correct data for this tag. This is more of an issue for the follow-up PR, where we generate testdata at SourceData::LATEST_TESTED_CLDR_TAG and then write that tag into the metadata. Local data could be wrong because someone forgot a glob, or we just haven't run the download script with the latest tag. I'm also more confident calling the constant LATEST_TESTED_CLDR_TAG if we actually test with it.

"Download" has been required for many things locally and on CI for quite a while now. Because it caches it's a one-time thing and we haven't had issues with it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution we've discussed is that there should be a CI action that verifies that the repo data is not stale.

"Download" has been required for many things locally and on CI for quite a while now.

CI yes, but we've kept local development as no-network as possible.

It's also much easier to reason about the data pipeline if I can see exactly what data is being consumed by datagen in order to produce the processed ICU4X output data. Depending on some artifact downloaded from the network is simply not a good idea for the purposes of ICU4X development.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue relying on the network is much less of a problem locally than in CI. Then work-on-plane use case is rare, and even then you're gonna have the data in the cache unless you started ICU4X development just before boarding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you disagree with my point about calling the latest-tested-tag tested?

@robertbastian robertbastian requested a review from sffc January 31, 2023 16:48
@robertbastian robertbastian changed the title Adding all keys to testdata Make testdata datagen work with latest git tags Jan 31, 2023
@sffc
Copy link
Member

sffc commented Jan 31, 2023

Also remember that I was reluctant to go along with the user-facing version of icu4x-datagen downloading its data automatically; I agreed to that after weighing the following advantages: (1) icu4x-datagen has a lot of flags and we want to make it more accessible, since it's one of the first things ICU4X users need to use, and (2) as we add more data sources, we need to keep older invocations of icu4x-datagen working properly, which means we need to give them default sources.

Neither of those points for icu4x-datagen applies here in testdata datagen.

@robertbastian
Copy link
Member Author

robertbastian commented Jan 31, 2023

What's your preferred solution to the final testdata script then?

.with_cldr_for_tag(SourceData::LATEST_TESTED_CLDR_TAG, CldrLocaleSubset::Full)
.unwrap()
.with_icuexport_for_tag(SourceData::LATEST_TESTED_ICUEXPORT_TAG)

let cldr_tag = SourceData::LATEST_TESTED_CLDR_TAG;
let icu_tag = SourceData::LATEST_TESTED_ICUEXPORT_TAG;

Note that we generate testdata for a tag and then write that tag to metadata. Do we just pray that the local data is what would have been downloaded for the tag?

@sffc
Copy link
Member

sffc commented Jan 31, 2023

There should be a CI that verifies that the copy of the data in the repo corresponds to the tag.

@sffc
Copy link
Member

sffc commented Jan 31, 2023

See #2880

@robertbastian
Copy link
Member Author

This is where I fundamentally disagree, keeping state in the repo and having CI verify that the state is correct adds a lot of friction when the state is easy to regenerate. I already filed #3006 about this.

@robertbastian
Copy link
Member Author

Also the data will not correspond to the tag, because we only include select files. That itself could hide bugs.

@sffc
Copy link
Member

sffc commented Jan 31, 2023

This is where I fundamentally disagree, keeping state in the repo and having CI verify that the state is correct adds a lot of friction when the state is easy to regenerate. I already filed #3006 about this.

I see the friction argument for things that change "unexpectedly", like the GN files. But, these are files that don't change often, and when they do, it's normally intentional. If they get out of sync, that's a mistake, not an oversight.

Also the data will not correspond to the tag, because we only include select files. That itself could hide bugs.

Can you elaborate on the types of bugs this creates?

@robertbastian
Copy link
Member Author

If they get out of sync, that's a mistake, not an oversight.

I don't understand the difference and how that changes things.

Can you elaborate on the types of bugs this creates?

A new file gets added to icuexport but not to the glob. This file breaks some assumption in datagen (which often lists directories) and breaks it.

@robertbastian robertbastian changed the title Make testdata datagen work with latest git tags Adding all keys to testdata Feb 1, 2023
@robertbastian robertbastian requested a review from sffc February 1, 2023 11:32
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't make testdata smaller, but if adding all keys is the desired outcome, I like this way of doing it

@robertbastian
Copy link
Member Author

Testdata should contain all keys so that it works with all constructors. For collation maybe it's including too much, but we have to eat our own dogfood here, datagen users cannot at the moment exclude things like und-u-co-emoji.

@robertbastian robertbastian merged commit 8559789 into unicode-org:main Feb 1, 2023
@robertbastian robertbastian deleted the fulltest branch February 9, 2023 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants