-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding all keys to testdata #3042
Conversation
Deferring to Shane on this, he has more context |
92ea07e
to
78843a4
Compare
This comment was marked as spam.
This comment was marked as spam.
@@ -34,9 +14,9 @@ fn main() { | |||
.unwrap(); | |||
|
|||
let source_data = SourceData::default() | |||
.with_cldr(paths::cldr_json_root(), CldrLocaleSubset::Full) | |||
.with_cldr_for_tag(&icu_testdata::versions::cldr_tag(), CldrLocaleSubset::Full) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Please, please make this read from the repo. A dev should be able to run cargo make testdata
without an internet connection. This has always worked and there's no reason to break it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repo might not reflect the correct data for this tag. This is more of an issue for the follow-up PR, where we generate testdata at SourceData::LATEST_TESTED_CLDR_TAG
and then write that tag into the metadata. Local data could be wrong because someone forgot a glob, or we just haven't run the download script with the latest tag. I'm also more confident calling the constant LATEST_TESTED_CLDR_TAG
if we actually test with it.
"Download" has been required for many things locally and on CI for quite a while now. Because it caches it's a one-time thing and we haven't had issues with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution we've discussed is that there should be a CI action that verifies that the repo data is not stale.
"Download" has been required for many things locally and on CI for quite a while now.
CI yes, but we've kept local development as no-network as possible.
It's also much easier to reason about the data pipeline if I can see exactly what data is being consumed by datagen in order to produce the processed ICU4X output data. Depending on some artifact downloaded from the network is simply not a good idea for the purposes of ICU4X development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue relying on the network is much less of a problem locally than in CI. Then work-on-plane use case is rare, and even then you're gonna have the data in the cache unless you started ICU4X development just before boarding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you disagree with my point about calling the latest-tested-tag tested?
Also remember that I was reluctant to go along with the user-facing version of icu4x-datagen downloading its data automatically; I agreed to that after weighing the following advantages: (1) icu4x-datagen has a lot of flags and we want to make it more accessible, since it's one of the first things ICU4X users need to use, and (2) as we add more data sources, we need to keep older invocations of icu4x-datagen working properly, which means we need to give them default sources. Neither of those points for icu4x-datagen applies here in testdata datagen. |
What's your preferred solution to the final testdata script then? icu4x/tools/testdata-scripts/src/bin/make-testdata.rs Lines 27 to 29 in c2a5392
icu4x/tools/testdata-scripts/src/bin/make-testdata.rs Lines 89 to 90 in c2a5392
Note that we generate testdata for a tag and then write that tag to metadata. Do we just pray that the local data is what would have been downloaded for the tag? |
There should be a CI that verifies that the copy of the data in the repo corresponds to the tag. |
See #2880 |
This is where I fundamentally disagree, keeping state in the repo and having CI verify that the state is correct adds a lot of friction when the state is easy to regenerate. I already filed #3006 about this. |
Also the data will not correspond to the tag, because we only include select files. That itself could hide bugs. |
I see the friction argument for things that change "unexpectedly", like the GN files. But, these are files that don't change often, and when they do, it's normally intentional. If they get out of sync, that's a mistake, not an oversight.
Can you elaborate on the types of bugs this creates? |
I don't understand the difference and how that changes things.
A new file gets added to icuexport but not to the glob. This file breaks some assumption in datagen (which often lists directories) and breaks it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't make testdata smaller, but if adding all keys is the desired outcome, I like this way of doing it
Testdata should contain all keys so that it works with all constructors. For collation maybe it's including too much, but we have to eat our own dogfood here, datagen users cannot at the moment exclude things like |
icu_testdata
currently doesn't include a handful of properties keys. It also doesn't include some collations for locales that it includes other data in.