Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Default impls that load data #5554

Open
sffc opened this issue Sep 18, 2024 · 18 comments · May be fixed by #5958
Open

Remove Default impls that load data #5554

sffc opened this issue Sep 18, 2024 · 18 comments · May be fixed by #5958
Assignees
Labels
C-meta Component: Relating to ICU4X as a whole S-small Size: One afternoon (small bug fix or enhancement)

Comments

@sffc
Copy link
Member

sffc commented Sep 18, 2024

Currently we have a number of impl Default on classes that have a new() function that load data. The data is usually singleton but doesn't need to be (see SentenceSegmenter for instance).

This is bad because:

  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.
  2. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.
  3. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.
  4. It's also very confusing and misleading to allow anyone to write code such as that in TimeZone + ResolvedTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

We never had these impls until we added compiled_data constructors, and then we added them only because Clippy was telling us to. I don't think we ever really had a thorough discussion about it.

I hope this is a no-brainer and non-controversial.

@Manishearth @robertbastian @zbraniecki @hsivonen

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Sep 18, 2024
@sffc sffc added this to the ICU4X 2.0 ⟨P1⟩ milestone Sep 18, 2024
@robertbastian
Copy link
Member

robertbastian commented Sep 18, 2024

  1. The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

That's not what that says. It says

A trait for giving a type a useful default value.

Sometimes, you want to fall back to some kind of default value, and don’t particularly care what it is. This comes up often with structs that define a set of options

"This comes up often with structs that define a set of options", but is in no way limited to that. Compiled data is reasonable default data, that's why we're publishing it.

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

So? Some APIs are gated, and that includes impls.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

It only exists when new has no arguments. In that case we're in the singleton path and there is no work.

  1. It's also very confusing and misleading to allow anyone to write code such as that in TimeZone + ResolvedTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

Call sites can be written using Foo::default() instead. This applies the same way to all types implementing Default: you can write &mut Default::default() when you need to pass a &mut Vec<T>, that's not a reason for Vec not to implement Default.

@robertbastian
Copy link
Member

The issue title misrepresents what's happening. There is never any "data loading" in a Default impl. Even in SentenceSegmenter which you use as an example, there is not:

impl Default for SentenceSegmenter {
    fn default() -> Self {
        Self::new()
    }
}

impl SentenceSegmenter {
    pub fn new() -> Self {
        Self {
            payload: DataPayload::from_static_ref(
                crate::provider::Baked::SINGLETON_SENTENCE_BREAK_DATA_V2_MARKER,
            ),
            payload_locale_override: None,
        }
    }
}

@Manishearth
Copy link
Member

I don't think I agree with any of these points.

  • The contract of impl Default is to fill in default values for fields when there are reasonable default values. But these are opaques, not bag-of-field structs, and impl Default is just not in the right spirit.

I don't agree with this. I've seen Default for all kinds of types, and the docs seem to agree with that idea, they just give "bag-of-stuff" structs as a prominent example.

  • The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

Seems fine. A lot of things are gated on compiled_data. I don't see Default as special. If people are using the Default impl and it gets disabled, in a different world they'd be calling ::new() and having it be disabled.

  • impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

There's no expectation of Default to be cheap.

  • It's also very confusing and misleading to allow anyone to write code such as that in Concept: TimeZone + FormattableTimeZone #5549, where an unnamed positional argument that isn't an options bag is elided with &Default::default().

This is a rather subjective style choice and i don't see a need to force it on our clients. They can always call Normalizer::default() if they wish to be explicit (as can we). I'm generally not a fan of blocking off ways of writing code for clients because of our own style preferences.

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

Ok so it's not noncontroversial

I'm reasonably convinced by the counter-arguments on points 1. On point 4, I'm mainly concerned when it is used for passing these things as unnamed positional arguments, but we can avoid that by not adding such APIs. I'm not convinced by the counter-arguments on points 2 and 3, and there are more I'd like to add:

  1. The impl Default is gated on the compiled_data feature, which is dangerous since it can be disabled.

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

  1. impl Default might do work, since it is loading data. It should ideally exist only when POD can be filled in.

I used SentenceSegmenter as an example because it has two fields, one singleton and one non-singleton. Currently the non-singleton has no data in und, but if it did, it would need to be assembled. I recall seeing some code landing in maybe Collator or Normalizer that loaded und data from a key with non-singleton data in a const constructor.

To respond to @Manishearth's "There's no expectation of Default to be cheap": my expectation is that it is cheap. For example, it is used in bounds on things like constructing an array and other operations that where the impl is called a large number of times. Is your argument an "in my experience" argument or "this is a documented convention" argument?

New points:

  1. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.
  2. Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.
  3. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?

@sffc
Copy link
Member Author

sffc commented Sep 18, 2024

On the point 7: we have generally been conservative with adding impls to our types, and I think it has served us well. We have a very large API surface, and every impl adds cost. The only impl we agreed to universally add has been impl Debug. We add other impls such as PartialEq, Ord, Hash, Clone, ... only when independently motivated. That yardstick of "independently motivated" should apply here, too.

@Manishearth
Copy link
Member

I'll add to this: the fact that it is gated on compiled_data means that it is not a universal fixed default! Like, I don't want people's Default::default() to break when they enable --no-default-features. Default is sometimes used in bounds for other functions, too. Programmers can write code in a style that makes it difficult for them to break out of using the Default impl for a particular type.

Right, but if they're doing that they presumably need some default, without Default they'd be calling ::new().

"There's no expectation of Default to be cheap": my expectation is that it is cheap.

Okay, you should not hold on to that expectation because it is not one followed by the ecosystem.

My argument is one based on the ecosystem, and the fact that it is not documented as being cheap. I would consider this something that needs to be explicitly documented were it an expectation.

Worth noting that our Default impls still are pretty cheap, baked data loading is extremely fast. They're just not necessarily entirely devoid of runtime behavior.

5. These Default impls are not stable and are not in fact universal default values. They depend on data version, producing different behavior under cargo update. The fact that they only exist under compiled_data could be used as evidence that we believe that these values are not universal.

I don't consider this to be a requirement of the Default trait either.

  • Not all types have a pub fn new(). There's nothing "special" separating types that support pub fn new() and pub fn new(&DataLocale). It is inconsistent to have these impls on some types but not others. We could add impl Default that loads data for "und" on all other types to add consistency, but that drives home the point that these should not be considered universally default values.

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

7. We never approved as a group of adding the Default impls; we only approved adding pub const fn new() functions. So I would really like to take this discussion from the angle of "we don't have these impls; are they motivated to add?" Other than the clippy lint, what is the motivation?

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

I definitely do not think this is a case in which the clippy lint is wrong. My position is that deviating from clippy here is something that needs justification, not the new impl.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

I can provide the motivation for the clippy lint: Default impls enable a whole bunch of convenient functions like mem::take and stuff. It's annoying when you're working with a type that could have a Default but chooses not to.

Example developer journey: they are using postcard data. They apply mem::take to a field involving a property value or some other type implementing Default. Everything compiles. Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile. They don't fully understand why so they leave it alone. In the end, they are linking both compiled data and postcard data, all implicitly.

@Manishearth
Copy link
Member

Then two months later they realize that they should disable the compiled_data feature. But doing so causes mem::take to fail to compile.

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

I don't really think "they don't fully understand so they leave it alone" is that likely. It doesn't seem that confusing. We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

This ... seems like exactly what we want to happen? They now are forced to consider where their data should come from. That's good. That should exactly be a factor of the decision to disable the feature.

My point is that they should be forced to consider where their data comes from up front by using an API explicitly documented as "with compiled data", which was key to us agreeing to have the compiled data functions have short, convenient names.

@sffc
Copy link
Member Author

sffc commented Sep 19, 2024

We could add a "enabled by feature foo" tag on the impl if we really want to improve this experience.

The problem with docs on trait impls is that they are not discoverable. Often, Default is used implicitly, as you noted with mem::take, so these docs won't get read often, and by the time they do, the programmer has already accidentally linked the compiled data implicitly.

@Manishearth
Copy link
Member

The problem with docs on trait impls is that they are not discoverable

I mean, if you get a "missing Default impl" when you play with features I don't really see why it would be hard to figure out what happened. The docs on the trait impl are just an added bonus.

I'm not convinced of the sketched developer story being particularly onerous.

the programmer has already accidentally linked the compiled data implicitly.

I think that's fine? For singletons if someone is using a type they almost certainly need the data and 99% of users will get the data from the compiled data; the main benefit of loading data from a file for singletons is if it's some large program that wishes to lazily load everything. I can see a potential additional benefit of wishing to load the latest data.

I want most of our singleton users to be able to just derive(Default) where necessary.

People who care about data to that granularity are going to be caeful about bounds anyway.

Just out of curiosity, this is the current list of gated Default impls:

[15:10:32] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ 
$ rg -U "compiled_data.*\nimpl Default for"
components/normalizer/src/uts46.rs
38:#[cfg(feature = "compiled_data")]
39:impl Default for Uts46MapperBorrowed<'static> {
141:#[cfg(feature = "compiled_data")]
142:impl Default for Uts46Mapper {

components/locale/src/directionality.rs
40:#[cfg(feature = "compiled_data")]
41:impl Default for LocaleDirectionality {

components/locale/src/expander.rs
212:#[cfg(feature = "compiled_data")]
213:impl Default for LocaleExpander {

components/normalizer/src/properties.rs
51:#[cfg(feature = "compiled_data")]
52:impl Default for CanonicalCompositionBorrowed<'static> {
122:#[cfg(feature = "compiled_data")]
123:impl Default for CanonicalComposition {
196:#[cfg(feature = "compiled_data")]
197:impl Default for CanonicalDecompositionBorrowed<'static> {
463:#[cfg(feature = "compiled_data")]
464:impl Default for CanonicalDecomposition {
553:#[cfg(feature = "compiled_data")]
554:impl Default for CanonicalCombiningClassMapBorrowed<'static> {
644:#[cfg(feature = "compiled_data")]
645:impl Default for CanonicalCombiningClassMap {

components/segmenter/src/grapheme.rs
134:#[cfg(feature = "compiled_data")]
135:impl Default for GraphemeClusterSegmenter {

components/segmenter/src/sentence.rs
114:#[cfg(feature = "compiled_data")]
115:impl Default for SentenceSegmenter {

components/locale/src/canonicalizer.rs
199:#[cfg(feature = "compiled_data")]
200:impl Default for LocaleCanonicalizer {

components/casemap/src/titlecase.rs
207:#[cfg(feature = "compiled_data")]
208:impl Default for TitlecaseMapper<CaseMapper> {

components/casemap/src/closer.rs
43:#[cfg(feature = "compiled_data")]
44:impl Default for CaseMapCloser<CaseMapper> {

components/casemap/src/casemapper.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for CaseMapper {

components/timezone/src/windows_tz.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for WindowsTimeZoneMapper {

components/timezone/src/zone_offset.rs
19:#[cfg(feature = "compiled_data")]
20:impl Default for ZoneOffsetCalculator {

components/timezone/src/ids.rs
96:#[cfg(feature = "compiled_data")]
97:impl Default for TimeZoneIdMapper {
468:#[cfg(feature = "compiled_data")]
469:impl Default for TimeZoneIdMapperWithFastCanonicalization<TimeZoneIdMapper> {

components/timezone/src/metazone.rs
21:#[cfg(feature = "compiled_data")]
22:impl Default for MetazoneCalculator {
[15:10:45] मanishearth@manishearth-glaptop2 ~/dev/icu4x ^_^ 
$ rg -U "compiled_data.*\nimpl Default for"
components/normalizer/src/properties.rs
51:#[cfg(feature = "compiled_data")]
52:impl Default for CanonicalCompositionBorrowed<'static> {
122:#[cfg(feature = "compiled_data")]
123:impl Default for CanonicalComposition {
196:#[cfg(feature = "compiled_data")]
197:impl Default for CanonicalDecompositionBorrowed<'static> {
463:#[cfg(feature = "compiled_data")]
464:impl Default for CanonicalDecomposition {
553:#[cfg(feature = "compiled_data")]
554:impl Default for CanonicalCombiningClassMapBorrowed<'static> {
644:#[cfg(feature = "compiled_data")]
645:impl Default for CanonicalCombiningClassMap {

components/normalizer/src/uts46.rs
38:#[cfg(feature = "compiled_data")]
39:impl Default for Uts46MapperBorrowed<'static> {
141:#[cfg(feature = "compiled_data")]
142:impl Default for Uts46Mapper {

components/segmenter/src/grapheme.rs
134:#[cfg(feature = "compiled_data")]
135:impl Default for GraphemeClusterSegmenter {

components/segmenter/src/sentence.rs
114:#[cfg(feature = "compiled_data")]
115:impl Default for SentenceSegmenter {

components/locale/src/directionality.rs
40:#[cfg(feature = "compiled_data")]
41:impl Default for LocaleDirectionality {

components/locale/src/expander.rs
212:#[cfg(feature = "compiled_data")]
213:impl Default for LocaleExpander {

components/locale/src/canonicalizer.rs
199:#[cfg(feature = "compiled_data")]
200:impl Default for LocaleCanonicalizer {

components/casemap/src/titlecase.rs
207:#[cfg(feature = "compiled_data")]
208:impl Default for TitlecaseMapper<CaseMapper> {

components/casemap/src/casemapper.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for CaseMapper {

components/casemap/src/closer.rs
43:#[cfg(feature = "compiled_data")]
44:impl Default for CaseMapCloser<CaseMapper> {

components/timezone/src/ids.rs
96:#[cfg(feature = "compiled_data")]
97:impl Default for TimeZoneIdMapper {
468:#[cfg(feature = "compiled_data")]
469:impl Default for TimeZoneIdMapperWithFastCanonicalization<TimeZoneIdMapper> {

components/timezone/src/windows_tz.rs
41:#[cfg(feature = "compiled_data")]
42:impl Default for WindowsTimeZoneMapper {

components/timezone/src/metazone.rs
21:#[cfg(feature = "compiled_data")]
22:impl Default for MetazoneCalculator {

components/timezone/src/zone_offset.rs
19:#[cfg(feature = "compiled_data")]
20:impl Default for ZoneOffsetCalculator {

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Sep 21, 2024
@sffc
Copy link
Member Author

sffc commented Sep 21, 2024

I think the fundamental disagreement is: we agree as a TC that offering developers a way to switch data sources is fundamental, but we don't agree on the degree to which opting into specific data sources can and should impact developer ergonomics.

I have taken the position that, given its importance to ICU4X's value proposition, we should generally favor being explicit about data sources, even if it takes a hit on developer ergonomics.

The decision to name constructors .new() instead of .new_with_compiled_data() was contrary to that position, but I agreed to it based on two notions: (1) as the first experience with ICU4X and the primary entrypoint into our APIs, this is a uniquely high-impact area for developer ergonomics, and (2) we document the constructors as saying "with compiled data" in the first line, and always link to the docs page explaining what that means.

If I had to guess, @Manishearth is likely to take the position that Default impls are, indeed, a high-impact area for developer ergonomics. I'm happy to debate that point, but, from my perspective, it is moot, because we don't have an alternative on the table for developers making explicit decisions about data sources. With .new(), we were able to move "with compiled data" from the function name to the docs, which was an acceptable though less-than-ideal alternative. We don't have that luxury with Default because its docs are not discoverable and because it is often used implicitly.

@sffc
Copy link
Member Author

sffc commented Sep 21, 2024

Based on the above, I could agree to this policy:

  1. By default, we don't add Default impls that rely on compiled data, because doing so causes compiled data to be used implicitly, going against the value proposition of the project
  2. A Default impl could be added if one of the following cases can be made:
    • The ICU4X class in question uses data that is uniquely static in nature, such that a typical user would choose to use compiled data for it even if they use dynamic data in other parts of their application. (Potential example: stable properties, maybe other properties-driven things like normalizer)
    • The ICU4X class in question is uniquely commonly used in contexts where a Default impl is required for ergonomics. (I can't think of examples of this right now)

@Manishearth
Copy link
Member

TLDR: I think the proposed policy works but I think we should talk about what we think of as singletons here a bit more


I think the fundamental disagreement is: we agree as a TC that offering developers a way to switch data sources is fundamental, but we don't agree on the degree to which opting into specific data sources can and should impact developer ergonomics.

Okay, thanks for sketching this out, and for the proposed policy.

I think I see the disagreement here, and it's subtler and in two parts.

ICU4X's initial design didn't include baked data, and as such it was extremely important we gave people a lot of control over switching data sources. Of course, control over data is a core value proposition, but the way in which that control works has changed over time.

Specifically, with the advent of baked data, we've gone out of our way to make baked data easy to use. We could have made baked data need a BakedProvider, but instead we gave everyone convenient constructors, to the extent of sometimes using special Borrowed types to make the baked experience even better. Now, this choice does not necessarily imply that we wish to make baked "as easy to use as possible", there can of course be limits, but it was a signal.

The impression I had, at least from the way prior discussions have gone and especially points made by @robertbastian around the topic of baked data, was that for singletons at least we no longer considered control over data to be so central to the design, mostly because "it makes sense". Rather, we consider it crucial that there be some way to have control over data, and implicitly recognized that people who want control over data for singletons were an ultimately niche (but important!) use case.

This was the position I was coming from when recently discussing #5440, where my preference was that the "main entry type" for singletons be the baked, borrowed one, with the owned version having the qualified name because it was more niche.

Anyway, hopefully that explains where I stand on "offering developers a way to switch data is fundamental". I think it's fundamental, as you said, but I think specifically for singletons, it is a niche use case. I was also under the impression that this was where everyone else roughly stood, at the very least @robertbastian but I think I've expressed opinions based on this in the past and not received pushback from others. It's potentially worth reaffirming precisely where we stand there. I'm not opposed to a position that data loading is non-niche for every type, that's just not my default (heh) position right now

Secondly, and here is where it gets a bit interesting, I think perhaps we've been dancing around the topic of what is truly a "singleton". Previously in the discussion you brought up SentenceSegmenter and I said:

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

which I was me considering SentenceSegmenter to not be a singleton, though not saying so explicitly perhaps. It's possible that a lot of our types with Default fall into a similar bucket! As I mention later below, I think timezone stuff may be similar.

In your new proposal, you say this:

The ICU4X class in question uses data that is uniquely static in nature, such that a typical user would choose to use compiled data for it even if they use dynamic data in other parts of their application. (Potential example: stable properties, maybe other properties-driven things like normalizer)

I think, to me, this is where I'm coming from when I'm talking about "singletons", both previously in this thread and in the first half of this post. True singletons, to me, are ones where there is only one possible "best" value, which means runtime data loading is going to be extremely rare. The main use cases for those are:

  • Cases where you are on an embedded device and need to save space, and the part of the program loading the singleton is optional. Very important usecase; but also not what I perceive to be the bulk of our clients. I could be wrong, it is certainly a type of client where we are the only option so it behooves us to be somewhat cognizant of their needs.
  • Cases where you really really care about pinning a Unicode version. I think this is largely tools like unicodetools, which are ... important, but "tool used by people who work on Unicode" is quite niche 😄
  • Staying up to date with the latest Unicode version, faster than ICU4X releases. I'm not actually clear if this is an actual problem.

I feel like properties, casemapper, and the auxiliary normalizer types fit in this . The main normalizer types have a split NFC/NFKC constructor, they're ineligible. The properties set types also are not eligible, because we type erase them and have a generic constructor. Locale directionality probably fits here too, and ... maybe segmenters? Probably not? I'm not sure about that one but curious as to what others think.

I think timezone stuff doesn't really fit this: I think "load timezone from system tz files" is going to be a very common use case.

There are more in the list I posted, but this should be a rough sketch of my position. I suspect a productive next step for us is to agree on the policy you stated and work on figuring out what it actually means for something to be a "true singleton" (yes, I recognize that this is a confusing distinction I have potentially just made up, I'd love to hear terminology ideas)

@sffc
Copy link
Member Author

sffc commented Sep 26, 2024

Some discussion:

  • @zbraniecki I'm not sure I have a strong opinion on this, but in my experience, when modifying data, people want to either add locales or modify specific types of data. It doesn't seem like modifying calendar behavior is a high priority use case.
  • @Manishearth It sounds like your lithmus test is "how likely would people want to patch data" which is a good pperspective. Time zones are a great example of where people reasonably want to switch sources. Properties and case mapping are examples where... there could be use cases such as wanting to pin a particular Unicode version, but that's a more niche case. I can see a use case for patching data in segmenter. Time zones release faster than we update data.

@sffc
Copy link
Member Author

sffc commented Oct 2, 2024

Here's another point. A Default impl could be one of two things: with-default-data or without-data. We have a small number of classes, such as LocaleFallbacker, that work in both modes.

The argument "it has a new function so clearly there is a default value" is really precarious to apply here because I still stand by my position that ::new() in a pure-principles world would have been named ::new_with_compiled_data() but due to other factors in play we decided on ::new(), but not because it was always considered to be the default value.

@Manishearth
Copy link
Member

it has a new function so clearly there is a default value

I think this is a misrepresentation of my position, I don't think you're arguing against something anyone else here agrees with. I explicitly said early in the discussion:

I'm open to having some types have pub fn new_und() / pub fn new(&DataLocale) or something and not having Default on them. I think this argument works for those types but not for the general case of singletons.

I think those are some cases where we should perhaps not have a fn new() in the first place (as you say), but even if we do it is one where I agree there is no single sensible Default.

@robertbastian
Copy link
Member

robertbastian commented Oct 23, 2024

  • @sffc Default should be no different from any other impl we add
  • @zbraniecki: Aligned

Two positions:

  • @robertbastian: If it has new() and not also new_<variant>() it should have Default
  • @sffc: Functions where data provenance is unimportant to most developers, for example:
    • The data only evolves forward such as Unicode property data
    • Precomputed/optimization stuff like calendar data
    • Data where clients who dynamically load data would still in this case use compiled data

More discussion:

  • @robertbastian: Data provenance is unimportant for most devs
    • (manish agrees)
  • @sffc We could put clients in 3 broad buckets: users who don't care at all about data provenance, clients who care about it in some cases, and clients who care about it in all cases. Clients in that middle category would include clients who want to decouple their CLDR updates from ICU4X updates because of stability.
  • @robertbastian: But who are these clients? Where do category 2 clients draw the line? This is very subjective
  • @sffc Clients who want stability in their external-facing behavior or test cases.
  • @zbraniecki: If something is marked as Default, then it should stay the same.
  • @Manishearth That's not what the Default definition states.
  • @robertbastian - i18n libraries are in this weird state where behaviour changes are expected. Our new methods change behaviour, so it's fine for our Default impls to also change behaviour, even if in other libraries it shouldn't
  • @sffc I am concerned with the line being "there are multiple new functions" because we might add more new functions.
  • @robertbastian - it should be "multiple new functions that load different compiled data". For these we know the axes ahead of time
  • @sffc For example, Normalizer has NFC and NFKC. Maybe there's a new normalization mode added which is better than the legacy ones.
  • @robertbastian we cannot plan for such fundamental changes in the Unicode spec
  • @sffc Concrete example of this. Previously, SentenceSegmenter had a single constructor. Now it has an additional constructor that takes options. This was not due to a change in the spec; it was due to a feature request.
  • @zbraniecki This seems like the Locale Ord discussion. Some things have Ord, some don't. It seems like we can have a discussion on a case-by-case basis.
  • @sffc I'm okay with a conclusion where we don't have a litmus test and just look on a case-by-case basis.
  • @robertbastian That just seems like a lot of extra overhead on us.
  • @Manishearth There are various cases where people want to patch data. That's a useful test.

Types Shane thinks should not be defualt:

  • All segmenters, at least Sentence, Word, and Line
  • Grapheme should probably take an options bag or specify it is Extended
  • LocaleExpander (+ LocaleCanonicalizer and LocaleDirectionality) since people add nonstandard locales (XK, for example)
  • "default is fine but suspect" because data can change for existing characters
    • CaseMapper
    • CCC
    • UTS 46 (maybe doesn't change at all?)

What if: for these types we should call the functions new_something or have options on the new().

(@sffc and @Manishearth like this)

  • @sffc I'm actually more happy with renaming new functions because it makes data provenance even more explicit where it matters.

Conclusion:

On a case by case bases, if a thing is in a situation where it is:

  • Sufficiently unstable across CLDR or Unicode releases (not just small bugfixes). Note: behavior that is derived from only the UCD is generally considered to be sufficiently stable.
  • Liable to be patched (tailoring, custom locales, etc)

we do not provide a Default. We also strongly consider not providing a new(/* no arguments */). We may instead provide a new_foo() or a new(options).

Concrete next steps for 2.0:

  • Remove Default impls for segmenters, locale thingies.
  • Remove new() from the word/line/sentence segmenter, instead rename the options bag one to new(options).
  • File issues to defer discussions on concrete function names.

Agreed: @Manishearth @sffc (@robertbastian) @zbraniecki

Additional action items:

  • @Manishearth File an issue for the future of Grapheme Segmenter and if it wants to take an options bag, or have new_extended(). This does not block 2.0. We may deprecate, or may fix before 2.0.
  • @Manishearth File an issue for the ctors for LocaleExpander (etc) to see if we want them to be called new_root() new_cldr() etc. This does not block 2.0. We may deprecate, or may fix before 2.0.
  • @sffc Add a line about ctor signatures to the release checklist.

Discussion:

  • @robertbastian I'm not happy with this line. It seems highly arbitrary.
  • (additional discussion about what qualifies as explicit vs implicit)
  • @zbraniecki Maybe we just go extreme and forbid Default everywhere?
  • @Manishearth Then we fall back into the Rust conventions trap. It's really annoying when a Default doesn't exist.
  • @zbraniecki If it's valuable enough to have a Default, then we should write an explicit impl.
  • @sffc I think there are 2 things there. One, Default impls that call new() and give docs (I don't think that changes the categorization of Default being implicit). Two, Default impls should use our own data, not derived from external sources. That would reduce the number of types having new() to basically zero.
  • @sffc It's already arbitrary whether or not things haver new() functions.
  • @Manishearth Yes; Normalizer is an example of that.
  • @robertbastian That could be considered a mistake. Normalizer should be redesigned to be like segmenter, with separate types that share internals
  • @Manishearth Next time this comes up, let's re-evaluate the tentative rules

@sffc sffc added C-meta Component: Relating to ICU4X as a whole and removed discuss-priority Discuss at the next ICU4X meeting needs-approval One or more stakeholders need to approve proposal labels Oct 31, 2024
@sffc sffc added this to icu4x 2.0 Oct 31, 2024
@sffc sffc moved this to Unclaimed for sprint in icu4x 2.0 Oct 31, 2024
@Manishearth Manishearth self-assigned this Jan 7, 2025
Manishearth added a commit to Manishearth/icu4x that referenced this issue Jan 7, 2025
@Manishearth Manishearth added the S-small Size: One afternoon (small bug fix or enhancement) label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-meta Component: Relating to ICU4X as a whole S-small Size: One afternoon (small bug fix or enhancement)
Projects
Status: Unclaimed for sprint
3 participants