Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The standard as is right now is unfriendly / unusual for tech stacks that are "native utf-16" #895

Closed
mihnita opened this issue Sep 19, 2024 · 12 comments
Assignees
Labels
LDML46.1 MF2.0 Draft Candidate Preview-Feedback Feedback gathered during the technical preview resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with MF Syntax

Comments

@mihnita
Copy link
Collaborator

mihnita commented Sep 19, 2024

The rule for content-char currently looks like this:

content-char = %x01-08        ; omit NULL (%x00), HTAB (%x09) and LF (%x0A)
             ...
             / %x3001-D7FF    ; omit surrogates
             / %xE000-10FFFF
             ...

That is unusual for languages that use UTF-16 natively, like JavaScript, Java, and even the "wide version" of the Windows C APIs (using wchar_t, that is 16 bits on Windows)

Such languages try to enforce utf-16 correctness (the same way C/C++ don't try to enforce utf-8 or any other kind of utf correctness).

Any validation is done "at the edge", when data is ingested, if at all.

Existing APIs that are similar to MessageFormat 2 work just fine with incorrect surrogate sequences.

@Test
public void testBadSurrogates() {
  dumpHex(String.format("\uda02 %d \udc02", 42));
  dumpHex(java.text.MessageFormat.format("\uda02 {0} \udc02", 42));
  dumpHex(com.ibm.icu.text.MessageFormat.format("\uda02 {0} \udc02", 42));
}

private void dumpHex(String str) {
  str.chars().forEach(c -> System.out.printf(" %04X", c));
  System.out.println();
}

The code above does not throw, and the result preserves the surrogates "as is".
The output looks like this:

 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02
 DA02 0020 0034 0032 0020 DC02

The current restriction also contradicts what was agreed in this thread:

RGN: Surrogate code points. Those are code points reserved for representing code points in UTF-16 that are beyond the first plane (BMP) of 2^16 code points.

MIH: I understand what you mean. But we also implement this is C and Java, and so on. So what should we do if we receive a message with invalid UTF-8 code points. Do we expect to replace them with the replacement character, or do we just pass them through?

RGN: I think what you're asking about, using JavaScript as a concrete example, is that a JS string is allowed to have unpaired surrogates. So the question is a question for the JS adapter / implementation, but that's not a question for the standard itself.

MIH: So we leave it to the implementation?

RGN: Yes.

MIH: Okay, that is fine with me.

https://github.com/unicode-org/message-format-wg/blob/main/meetings/2022/notes-2022-06-13.md

@catamorphism
Copy link
Collaborator

Just for reference, the PR that introduced this requirement is #290 (from August 2022).

@mihnita
Copy link
Collaborator Author

mihnita commented Sep 19, 2024

I agree that unpaired surrogates are invalid in UTFs.
But some programming languages don't care about that, and they make no guarantees that their strings are UTF-16 correct.

MessageFormat 2 is an advanced form of "take this string with markers inside, and replace the markers with something I give you at runtime".

It should not be in the business of enforcing UTF correctness, or any other kind of correctness.
We don't we don't try to prevent the use of non-characters (U+?FFFE and U+?FFFF),
or U+0001–U+0008, U+000B–U+000C, U+000E–U+001F because they are invalid in XML 1.0,

If devs want to be strict, they can enforce it through linters, or in the storage format.

@aphillips aphillips added the Preview-Feedback Feedback gathered during the technical preview label Sep 19, 2024
@aphillips
Copy link
Member

(as an individual contributor)

I agree that encoding/UTF considerations don't belong in our specification, because we are not a storage format. Disallowing standalone surrogates is mostly a Good Thing, since non-Unicode encodings can't do anything (other than replace them) and UTF-8 can't encode them (except by exceptional pleading). We shouldn't make the mistake of, in the course of fixing UTF-16 (really "UCS-2-like"), support that we require UTF-8 based or USV String based implementations to do hokey things.

The spec is only marginally unfriendly to UCS-2 implementations, though. The ABNF and spec say that surrogate code points are not permitted in text or literal. I suspect the best approach here would be to allow "UCS-2" implementations to not enforce unpaired surrogate restrictions (or, more importantly, require them to be checked for in text). Along the lines of:

Implementations are not required to check for unpaired surrogate code points in text or literals.

[!NOTE]
Some implementations, in languages such as Java or JavaScript, use strings composed of
16-bit code units.
See for example Infra.
Such implementations do not check for unpaired surrogate code points,
even though these do not validly encode any character.
Such implementations are conformant, even though the grammar does not permit
these code points.

@aphillips aphillips added syntax Issues related with MF Syntax LDML46.1 MF2.0 Draft Candidate labels Sep 19, 2024
@aphillips
Copy link
Member

(chair hat) I have tagged this for post-46.

@mihnita
Copy link
Collaborator Author

mihnita commented Sep 23, 2024

Disallowing standalone surrogates is mostly a Good Thing

100% agree.
This is something I would definitely enforce in a lint rule.
But not in this kind of spec.

The spec is only marginally unfriendly to UCS-2 implementations, though.
The ABNF and spec say that surrogate code points are not permitted in text or literal.

Surrogates are excluded from content-char and name-start
Meaning simple-start-char and text-char and quoted-char
So they are excluded from simple-message & pattern.

Which means I can't even do Hello \uD800 world!

And I can't have it in name and identifier.
So I am not really sure where can I have it.

@mihnita
Copy link
Collaborator Author

mihnita commented Sep 23, 2024

(chair hat) I have tagged this for post-46.

I have no problem with that.
Thank you!

@aphillips
Copy link
Member

Surrogates are excluded from content-char and name-start Meaning simple-start-char and text-char and quoted-char So they are excluded from simple-message & pattern.

Which means I can't even do Hello \uD800 world!

And I can't have it in name and identifier. So I am not really sure where can I have it.

Be careful: this sword is sharp on both edges.

I don't know what practical use a string like Hello \uD800 world! has. Any message with unpaired surrogates faces ruin if it meets a UTF-8 encoder (such as (de)serializing it to/from a resource file) or in any number of tools. It doesn't mean anything different from and displays just like Hello \uFFFD world!.

Allowing unpaired surrogates means requiring support for them in the productions in languages that use byte-oriented (e.g. UTF-8) strings. If we allow unpaired in name then one has the problem of referring to values such as $\uD800 or invoking functions like :\uD800.

I'm somewhat sympathetic to allowing unpaired surrogates in text or, rather, to not checking if any appear in text. Permitting (which means requiring support for) their use elsewhere seems like something I'd rather impose on UTF-16 implementations.

@mihnita
Copy link
Collaborator Author

mihnita commented Sep 23, 2024

I am not necessarily arguing to allow them in name.
Or that they have a good use case.

It is about "going against the grain" for some platforms.
Even byte-oriented languages (often C/C++) treat strings as "a bunch a bytes, which just happen (or not) to be utf-8".

But we can shave this yak after LDML-46 :-)

@aphillips aphillips added the Agenda+ Requested for upcoming teleconference label Sep 29, 2024
@aphillips aphillips added Action-Item Action item assigned by the WG and removed Agenda+ Requested for upcoming teleconference labels Oct 7, 2024
@aphillips
Copy link
Member

In the 2024-10-07 call, we agreed that @mihnita would make a PR adding unpaired to text-char with appropriate wording.

@mihnita
Copy link
Collaborator Author

mihnita commented Oct 11, 2024

In the 2024-10-07 call, we agreed that @mihnita would make a PR adding unpaired to text-char with appropriate wording.

We agreed that I will create the PR. Not on the exact implementation.

text-char or something else is an implementation detail.

I think that the last quote from EAO leaves that door open:

Let’s see what MIH comes up with and go from there

--

Scanning the notes it looks like the intent is really to allow surrogates in text, and not in code:

AAP: ... change at least the content-char in text to allow for unpaired surrogate values in there.

AAP: ... But disallowing them in names and other things is responsible.

EAO: allow for unpaired surrogates in content-char but only there

RCH: Mostly I wanted it nailed down. ... Nailing down names is acceptable to me, I don’t know why someone would want the names to be non-conforming,

mihnita added a commit to mihnita/message-format-wg that referenced this issue Oct 20, 2024
aphillips added a commit that referenced this issue Oct 22, 2024
* Allow surrogates in content, issue #895

* Grammar and typos, linkify terms, make into a note, and fix 2119 keywords

Thanks Addison!

Co-authored-by: Addison Phillips <[email protected]>

* Not using "localizable elements"

Co-authored-by: Addison Phillips <[email protected]>

* Keep syntax.md in sync with message.abnf

* Added note about surrogates to quoted literals

* Moved the note about surrogates from Security Considerations to The Message

* Update spec/syntax.md

* Update spec/syntax.md

* Italicize  in a couple of places

* Implemeted more (all?) feedback from review

---------

Co-authored-by: Addison Phillips <[email protected]>
aphillips added a commit that referenced this issue Oct 26, 2024
* Create notes-2024-08-19.md

* Accept attributes design & remove spec note (#845)

* Accept attributes design & remove spec note

* Disallow duplicate attribute names (closes #756)

* Add link to contextual options PR

* Add more prose to tag example text

Co-authored-by: Addison Phillips <[email protected]>

* Mention attribute validity condition in the **_valid_** definition

---------

Co-authored-by: Addison Phillips <[email protected]>

* Update selection-declaration design doc based on mtg / issue discussion (#867)

* Add tests for pattern selection (#863)

* Add tests for pattern selection

* Add missing errors

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Add Duplicate Variant to table in test/README.md (#861)

* Add new selection-declaration alternative: Require annotation of selector variables in placeholders (#860)

* Add new selection-declaration alternative: Require annotation of selector variables in placeholders

* Improve examples

* Switch example order

* Update the stability policy (#834)

* Update the stability policy

Based on discussion in the 2024-07-22 call and in PR #829, update the stability policy.

* A deeper, more thorough rewrite

- Standardizes the phrasing completely.
- Moves all potential future changes (which are not, after all, stability policies) to an "important" block
- Removes duplication
- Separates functions, options, and option values into separate guarantees
- Clarifies the note about formatting changing over time

* Update spec/README.md

Co-authored-by: Tim Chevalier <[email protected]>

* Update spec/README.md

Co-authored-by: Eemeli Aro <[email protected]>

* remove well-formed

* Update spec/README.md

---------

Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>

* Refine error handling text (#816)

* Refine error handling text

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Update fallback text

* Turn bullet point list into paragraphs

* Be more mighty

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Create notes-2024-08-26.md

* Select "Match on variables instead of expressions" for selection-declarations (#824)

* Select "Match on variables instead of expressions" for selection-declarations

* Add hybrid option to selection-declaration.md (#870)

* Add hybrid option to selection-declaration.md

* Update selection-declaration.md

fixed glitch in original edit

* Update selection-declaration.md

* Apply suggestions from code review

Fixing typos

Co-authored-by: Addison Phillips <[email protected]>

* Update selection-declaration.md

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>

* Update selection-declaration.md

---------

Co-authored-by: Mark Davis <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* Fix "Allow immutable input declarative selectors" example (#874)

* Update README.md (#875)

* Update README.md

* Update README.md

* [DESIGN] Update bidi design document to show proposed design (#871)

* [DESIGN] Update bidi design document to show proposed design

The design I actually think we should adopt is the "hybrid approaches" one. This is a necessary first step on the highway to UAX31 compliance and I think is responsibly contained/managed. It is a hybrid approach, in that it permits testable strict implementations to be created (particularly for message serialization).

This PR consists of moving text around. I added one "pro" to one option also.

* Address comments

* Miscellaneous test fixes (#862)

* Add missing expected bad-selector errors

* Fix expected parts for unsupported-statement test

* Add a few new tests for leading-whitespace and duplicate-variant

* Add tests for escaped-char changes made in #743

* Fix tests for attributes with variable values

* Update contributing and joining info (#876)

* Update contributing and joining info

* Update README.md

* Update CONTRIBUTING.md

* Restore CLA copy

* Clarify error & fallback handling (#879)

* Clarify error & fallback handling

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Select last rather than first attribute

* Drop mention of "starting with Pattern Selection"

* Attributes can't change the formatted output

* Use "nor" instead of "or" regarding attribute restrictions

---------

Co-authored-by: Addison Phillips <[email protected]>

* Clarify rule selection (#878)

* Clarify rule selection

Fixes #868 

This adds normative SHOULD language to using CLDR plural and ordinal data, which was intended originally.

- clarifies that keyword selection follows exact match
- clarifies the purpose of rule-based selection
- makes non-CLDR-based implementation permitted

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* [DESIGN] Maintaining the Standard, Optional and Unicode Namespace Function Sets (#634)

* Design doc to capture registry maintenance

* Update maintaining-registry.md

* Update exploration/maintaining-registry.md

Co-authored-by: Tim Chevalier <[email protected]>

* Update exploration/maintaining-registry.md

Co-authored-by: Tim Chevalier <[email protected]>

* Add user stories, small updates to RGI

* Update exploration/maintaining-registry.md

* Adding additional detail

* Remove machine readable registry; update prose

* Update maintaining-registry.md

* Further development work

* Update to change format and naming

Per the 2024-08-19 call, we decided to switch towards a specification-per-function model, with statuses. This commit includes the initial set of changes to try and implement this.

* Address some comments.

---------

Co-authored-by: Tim Chevalier <[email protected]>

* Create notes-2024-09-09.md

* Fix a typo in an example (#880)

The upcoming work to implement resolved value might make this patch unnecessary or obsolete, but fixing the typo (missing `{`/`}` around the variable in the pattern) just in case

* Remove forward-compatibility promise and all reserved & private syntax (#883)

* Remove forwards compatibility from stability guarantee

* Drop reserved statements and expressions

* Drop private-use annotations

* Update tests

* Clarify that deprecation is not removal

* Match on variables instead of expressions (#877)

* Match on variables instead of expressions

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Apply suggestions from code review

* Add missing test changes noticed during implementation

* Empty commit to re-trigger CLA check

---------

Co-authored-by: Addison Phillips <[email protected]>

* Create notes-2024-09-10.md

* Add bidi support and address UAX31/UTS55 requirements (#884)

* Add bidi support and address UAX31/UTS55 requirements

Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax.

Formally defines optional vs. non-optional whitespace.

Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible)

* Update syntax.md including text from previous PR

* Repair the guidance on strongly directional marks

Include ALM and better specify how to use the marks.

* Fix formatting of the "important"

* Add bidi characters to description of whitespace.

* Permit bidi in a few more places

Add optional whitespace at the start of `variant`

Add optional whitespace around `quoted-pattern`

These changes result in allowing bidi around keys and quoted patterns as intended.

* Update syntax.md ABNF

* Update formatting.md

- Add a note about the difference between formatting and message syntax.
- Clarify the sentence about message directionality.

* Address comment about name/identifier

* Address comments related to bidi in `name`

* Fix variable's location

* Address comment about the list of LRI/PDI targets

* One character typo :-P

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Address comments about rule R3a-1

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Address comment about U+061C

* Change [o]wsp => `o` or `s`

* Match syntax spec to abnf

* Remove *

* Update syntax.md

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update syntax.md

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Specify `bad-option` for bad digit size option values (#882)

* Specify `bad-option` for bad digit size option values

Fixes #739

* adopt 'non-negative integer'

* Create notes-2024-09-16.md

* Address name and literal equality (#885)

* Address name and literal equality

This change defines equality as discussed in the 2024-09-09 teleconference in the following ways:

- It defines _name_ equality as being under NFC
- It defines _literal_ equality as explicitly **not** under NFC
- It moves _name_ before _identifier_ in that section of text to avoid a forward definition.

Note that this deviates from discussion in 2024-09-09's call in that we didn't discuss literals at length. It also doesn't discuss non-name/non-literal values, which I'll point out are limited to ASCII sequences such as keywords.

* Typo fix

* Add a note about not requiring implementations to actually normalize

* Implement changes dicussed in 2024-09-16 call.

- Make _key_ require NFC for uniqueness/comparison
- Add a note about NFC
- Make _literal_ **_not_** define equality
- Make text in _name_ identical to that in _key_ for consistency

* Update formatting.md to include keys in NFC

* Address comments

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Update list of normative changes during the LDML45 period (#890)

* Fix typos in data-model-errors tests (#892)

Fix #886

* Update note on exact numeric match for v46 (#891)

Addresses #887 

Non-normative changes to the notes specifically part of LDML46

* Fix attribute value to be literal (#894)

Fixes #893

* Create notes-2024-09-30.md

* Add Resolved Values and Function Handler sections to formatting (#728)

* Add Resolved Values section to formatting

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Tim Chevalier <[email protected]>

* Linkify "resolved value"

* Add some examples & explicitly allow wrapping input values

* No throw, only emit

Co-authored-by: Tim Chevalier <[email protected]>

* Add section on Function Handlers, defining the term

* Apply suggestions from code review

* Rephrase initial resolved value definition

* Update spec/formatting.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update resolved value definition again

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* Define function composition for :number and :integer values (#823)

* Define function composition for :number and :integer values

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Add operand option priority example

* Add apostrophes'

Co-authored-by: Tim Chevalier <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>

* Create notes-2024-10-07.md

* Apply NFC normalization during :string key comparison (#905)

* Apply NFC normalization during :string key comparison

* Add link to UAX#15

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Add tests for changes due to bidi/whitespace (#902)

* Add tests for changes due to bidi/whitespace

* Correct output

* Make erroneous test a syntax error

* Define function composition for date/time values (#814)

* Define function composition for date/time values

* Apply suggestions from code review

Co-authored-by: Stanisław Małolepszy <[email protected]>

* Drop the "only"

* Update spec/registry.md

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Make :date and :time composition implementation-defined

---------

Co-authored-by: Stanisław Małolepszy <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* DESIGN: Add alternative designs to the design doc on function composition (#806)

* DESIGN: Add a sequel to the design doc on function composition

This document sketches out some alternatives for the machinery
provided to enable function composition.

The goal is to provide an exhaustive list of alternatives.

* Remove 'part 2' document and move contents to the end of part 1

* Revise introduction to reflect the changed goal

* Edited for conciseness

* Further edits for conciseness

* Give a name to InputType and use it

* Refer to motivating examples

* Update function-composition-part-1.md status

Per 2024-10-14 telecon

* Create notes-2024-10-14.md

* Add test for :integer and :number composition (#907)

* Fix `:integer` option `useGrouping` values (#912)

I noticed that `:integer` does not include the "never" value for the option `useGrouping`. This is a bug.

* Drop syntax note on additional bidi changes (#910)

Drop syntax note on addition bidi changes

* Add tests for changes due to #885 (name/literal equality) (#904)

* Add tests for changes due to #885 (name/literal equality)

* Update test/tests/functions/string.json

Co-authored-by: Eemeli Aro <[email protected]>

* Update test/tests/syntax.json

Co-authored-by: Eemeli Aro <[email protected]>

* Update test/tests/functions/string.json

Co-authored-by: Eemeli Aro <[email protected]>

* Added tests for reordering and special case mapping

* Add another selection test

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Add u: options namespace (#846)

* Move spec/registry.md -> spec/registry/default.md

* Add Unicode Registry definition

* Refer to BCP47, add note about only requiring normal tags

* Call it a namespace

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Fix test file reference

Co-authored-by: Tim Chevalier <[email protected]>

* Apply suggestions from code review

* Update spec/u-namespace.md

Co-authored-by: Eemeli Aro <[email protected]>

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Add mention of functions to namespace description

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>

* Define function composition for :string values (#798)

* Define function composition for :string values

* Update spec/registry.md as suggested by @stasm in #814

* Drop the "only"

* Update text following code review comments

---------

Co-authored-by: Addison Phillips <[email protected]>

* Drop data model request for feedback on "name" (#909)

* Allow surrogates in content, issue #895 (#906)

* Allow surrogates in content, issue #895

* Grammar and typos, linkify terms, make into a note, and fix 2119 keywords

Thanks Addison!

Co-authored-by: Addison Phillips <[email protected]>

* Not using "localizable elements"

Co-authored-by: Addison Phillips <[email protected]>

* Keep syntax.md in sync with message.abnf

* Added note about surrogates to quoted literals

* Moved the note about surrogates from Security Considerations to The Message

* Update spec/syntax.md

* Update spec/syntax.md

* Italicize  in a couple of places

* Implemeted more (all?) feedback from review

---------

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Elango Cheran <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Mark Davis <[email protected]>
Co-authored-by: Danny Gleckler <[email protected]>
Co-authored-by: Steven R. Loomis <[email protected]>
Co-authored-by: Stanisław Małolepszy <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Mihai Nita <[email protected]>
@lucacasonato
Copy link
Contributor

Sorry to chime in after this was discussed and changed, but why does this logic hold for unpaired surrogates, but not for null chars? The spec disallows null chars, but allows unpaired surrogates - this makes things easier for some languages (like JS that has WTF-16 strings), but more difficult for others (like Rust), where strings are enforced to be UTF-8 (not WTF-8). Null chars are not needed for most languages, except for languages that use null terminated strings. I feel like both of these restrictions should go the same way:

  1. Either the spec enforces both unpaired surrogates disallowed, and null chars disallowed
  2. Or the spec enforces neither, but explicitly allows conforming implementations to only operate on byte sequences that forbid null chars or unpaired surrogates.

I don't care very much though :)

@duerst
Copy link

duerst commented Oct 29, 2024

Sorry to chime in after this was discussed and changed, but why does this logic hold for unpaired surrogates, but not for null chars? The spec disallows null chars, but allows unpaired surrogates - this makes things easier for some languages (like JS that has WTF-16 strings), but more difficult for others (like Rust), where strings are enforced to be UTF-8 (not WTF-8). Null chars are not needed for most languages, except for languages that use null terminated strings. I feel like both of these restrictions should go the same way:

  1. Either the spec enforces both unpaired surrogates disallowed, and null chars disallowed
  2. Or the spec enforces neither, but explicitly allows conforming implementations to only operate on byte sequences that forbid null chars or unpaired surrogates.

I don't like unpaired surrogates at all, I think there's a fundamental difference between them and null characters, in particular for null-terminated strings. The later appear at the end of a string only. They get inserted and eliminated automatically by libraries, e.g. when you extract a substring or concatenate two strings. Therefore, we don't need to allow them in MF2. MF2 data that fits the grammar will have a null character at the end of it if it's processed by a programming language that uses null characters as terminators. But this null char will be outside of what the grammar covers. The (intermediate or final) results of formatting,... will also have nulls at the end, but these are again part of the host language, not part of MF2, in the same way that general string length counts are not of interest in MF2 even though there are many languages that use length counts to manage string (essentially all those that don't use null bytes at the end).

@aphillips aphillips added resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. and removed Action-Item Action item assigned by the WG labels Nov 4, 2024
aphillips added a commit that referenced this issue Nov 4, 2024
* [DESIGN] Number selection design refinements

This is to build up and capture technical considerations for how to address the issues raised by @eemeli's PR #842.

* Update examples to match changes to syntax

Also responds to the long discussion with @eemeli about significant digits by removing from the example.

* Address 2024-09-16 call comments

This changes the status to "Re-Opened" and adds a link to the PR. Expect to merge this imminently, although discussion on number selection remains.

* Update exploration/number-selection.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update from main (#914)

* Create notes-2024-08-19.md

* Accept attributes design & remove spec note (#845)

* Accept attributes design & remove spec note

* Disallow duplicate attribute names (closes #756)

* Add link to contextual options PR

* Add more prose to tag example text

Co-authored-by: Addison Phillips <[email protected]>

* Mention attribute validity condition in the **_valid_** definition

---------

Co-authored-by: Addison Phillips <[email protected]>

* Update selection-declaration design doc based on mtg / issue discussion (#867)

* Add tests for pattern selection (#863)

* Add tests for pattern selection

* Add missing errors

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Add Duplicate Variant to table in test/README.md (#861)

* Add new selection-declaration alternative: Require annotation of selector variables in placeholders (#860)

* Add new selection-declaration alternative: Require annotation of selector variables in placeholders

* Improve examples

* Switch example order

* Update the stability policy (#834)

* Update the stability policy

Based on discussion in the 2024-07-22 call and in PR #829, update the stability policy.

* A deeper, more thorough rewrite

- Standardizes the phrasing completely.
- Moves all potential future changes (which are not, after all, stability policies) to an "important" block
- Removes duplication
- Separates functions, options, and option values into separate guarantees
- Clarifies the note about formatting changing over time

* Update spec/README.md

Co-authored-by: Tim Chevalier <[email protected]>

* Update spec/README.md

Co-authored-by: Eemeli Aro <[email protected]>

* remove well-formed

* Update spec/README.md

---------

Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>

* Refine error handling text (#816)

* Refine error handling text

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Update fallback text

* Turn bullet point list into paragraphs

* Be more mighty

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Create notes-2024-08-26.md

* Select "Match on variables instead of expressions" for selection-declarations (#824)

* Select "Match on variables instead of expressions" for selection-declarations

* Add hybrid option to selection-declaration.md (#870)

* Add hybrid option to selection-declaration.md

* Update selection-declaration.md

fixed glitch in original edit

* Update selection-declaration.md

* Apply suggestions from code review

Fixing typos

Co-authored-by: Addison Phillips <[email protected]>

* Update selection-declaration.md

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update exploration/selection-declaration.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>

* Update selection-declaration.md

---------

Co-authored-by: Mark Davis <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* Fix "Allow immutable input declarative selectors" example (#874)

* Update README.md (#875)

* Update README.md

* Update README.md

* [DESIGN] Update bidi design document to show proposed design (#871)

* [DESIGN] Update bidi design document to show proposed design

The design I actually think we should adopt is the "hybrid approaches" one. This is a necessary first step on the highway to UAX31 compliance and I think is responsibly contained/managed. It is a hybrid approach, in that it permits testable strict implementations to be created (particularly for message serialization).

This PR consists of moving text around. I added one "pro" to one option also.

* Address comments

* Miscellaneous test fixes (#862)

* Add missing expected bad-selector errors

* Fix expected parts for unsupported-statement test

* Add a few new tests for leading-whitespace and duplicate-variant

* Add tests for escaped-char changes made in #743

* Fix tests for attributes with variable values

* Update contributing and joining info (#876)

* Update contributing and joining info

* Update README.md

* Update CONTRIBUTING.md

* Restore CLA copy

* Clarify error & fallback handling (#879)

* Clarify error & fallback handling

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Select last rather than first attribute

* Drop mention of "starting with Pattern Selection"

* Attributes can't change the formatted output

* Use "nor" instead of "or" regarding attribute restrictions

---------

Co-authored-by: Addison Phillips <[email protected]>

* Clarify rule selection (#878)

* Clarify rule selection

Fixes #868 

This adds normative SHOULD language to using CLDR plural and ordinal data, which was intended originally.

- clarifies that keyword selection follows exact match
- clarifies the purpose of rule-based selection
- makes non-CLDR-based implementation permitted

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* [DESIGN] Maintaining the Standard, Optional and Unicode Namespace Function Sets (#634)

* Design doc to capture registry maintenance

* Update maintaining-registry.md

* Update exploration/maintaining-registry.md

Co-authored-by: Tim Chevalier <[email protected]>

* Update exploration/maintaining-registry.md

Co-authored-by: Tim Chevalier <[email protected]>

* Add user stories, small updates to RGI

* Update exploration/maintaining-registry.md

* Adding additional detail

* Remove machine readable registry; update prose

* Update maintaining-registry.md

* Further development work

* Update to change format and naming

Per the 2024-08-19 call, we decided to switch towards a specification-per-function model, with statuses. This commit includes the initial set of changes to try and implement this.

* Address some comments.

---------

Co-authored-by: Tim Chevalier <[email protected]>

* Create notes-2024-09-09.md

* Fix a typo in an example (#880)

The upcoming work to implement resolved value might make this patch unnecessary or obsolete, but fixing the typo (missing `{`/`}` around the variable in the pattern) just in case

* Remove forward-compatibility promise and all reserved & private syntax (#883)

* Remove forwards compatibility from stability guarantee

* Drop reserved statements and expressions

* Drop private-use annotations

* Update tests

* Clarify that deprecation is not removal

* Match on variables instead of expressions (#877)

* Match on variables instead of expressions

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Apply suggestions from code review

* Add missing test changes noticed during implementation

* Empty commit to re-trigger CLA check

---------

Co-authored-by: Addison Phillips <[email protected]>

* Create notes-2024-09-10.md

* Add bidi support and address UAX31/UTS55 requirements (#884)

* Add bidi support and address UAX31/UTS55 requirements

Adds the bidi strong marks ALM, RLM, and LRM plus the bidi isolate controls LRI, RLI, FSI, and PDI to the syntax.

Formally defines optional vs. non-optional whitespace.

Non-optional whitespace must include at least one whitespace character. Optional whitespace may contain only bidi marks (which are invisible)

* Update syntax.md including text from previous PR

* Repair the guidance on strongly directional marks

Include ALM and better specify how to use the marks.

* Fix formatting of the "important"

* Add bidi characters to description of whitespace.

* Permit bidi in a few more places

Add optional whitespace at the start of `variant`

Add optional whitespace around `quoted-pattern`

These changes result in allowing bidi around keys and quoted patterns as intended.

* Update syntax.md ABNF

* Update formatting.md

- Add a note about the difference between formatting and message syntax.
- Clarify the sentence about message directionality.

* Address comment about name/identifier

* Address comments related to bidi in `name`

* Fix variable's location

* Address comment about the list of LRI/PDI targets

* One character typo :-P

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Address comments about rule R3a-1

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Address comment about U+061C

* Change [o]wsp => `o` or `s`

* Match syntax spec to abnf

* Remove *

* Update syntax.md

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update syntax.md

* Update spec/message.abnf

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Specify `bad-option` for bad digit size option values (#882)

* Specify `bad-option` for bad digit size option values

Fixes #739

* adopt 'non-negative integer'

* Create notes-2024-09-16.md

* Address name and literal equality (#885)

* Address name and literal equality

This change defines equality as discussed in the 2024-09-09 teleconference in the following ways:

- It defines _name_ equality as being under NFC
- It defines _literal_ equality as explicitly **not** under NFC
- It moves _name_ before _identifier_ in that section of text to avoid a forward definition.

Note that this deviates from discussion in 2024-09-09's call in that we didn't discuss literals at length. It also doesn't discuss non-name/non-literal values, which I'll point out are limited to ASCII sequences such as keywords.

* Typo fix

* Add a note about not requiring implementations to actually normalize

* Implement changes dicussed in 2024-09-16 call.

- Make _key_ require NFC for uniqueness/comparison
- Add a note about NFC
- Make _literal_ **_not_** define equality
- Make text in _name_ identical to that in _key_ for consistency

* Update formatting.md to include keys in NFC

* Address comments

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/syntax.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Update list of normative changes during the LDML45 period (#890)

* Fix typos in data-model-errors tests (#892)

Fix #886

* Update note on exact numeric match for v46 (#891)

Addresses #887 

Non-normative changes to the notes specifically part of LDML46

* Fix attribute value to be literal (#894)

Fixes #893

* Create notes-2024-09-30.md

* Add Resolved Values and Function Handler sections to formatting (#728)

* Add Resolved Values section to formatting

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

Co-authored-by: Tim Chevalier <[email protected]>

* Linkify "resolved value"

* Add some examples & explicitly allow wrapping input values

* No throw, only emit

Co-authored-by: Tim Chevalier <[email protected]>

* Add section on Function Handlers, defining the term

* Apply suggestions from code review

* Rephrase initial resolved value definition

* Update spec/formatting.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update resolved value definition again

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* Define function composition for :number and :integer values (#823)

* Define function composition for :number and :integer values

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Add operand option priority example

* Add apostrophes'

Co-authored-by: Tim Chevalier <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>

* Create notes-2024-10-07.md

* Apply NFC normalization during :string key comparison (#905)

* Apply NFC normalization during :string key comparison

* Add link to UAX#15

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Addison Phillips <[email protected]>

* Add tests for changes due to bidi/whitespace (#902)

* Add tests for changes due to bidi/whitespace

* Correct output

* Make erroneous test a syntax error

* Define function composition for date/time values (#814)

* Define function composition for date/time values

* Apply suggestions from code review

Co-authored-by: Stanisław Małolepszy <[email protected]>

* Drop the "only"

* Update spec/registry.md

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Update spec/registry.md

Co-authored-by: Eemeli Aro <[email protected]>

* Make :date and :time composition implementation-defined

---------

Co-authored-by: Stanisław Małolepszy <[email protected]>
Co-authored-by: Addison Phillips <[email protected]>

* DESIGN: Add alternative designs to the design doc on function composition (#806)

* DESIGN: Add a sequel to the design doc on function composition

This document sketches out some alternatives for the machinery
provided to enable function composition.

The goal is to provide an exhaustive list of alternatives.

* Remove 'part 2' document and move contents to the end of part 1

* Revise introduction to reflect the changed goal

* Edited for conciseness

* Further edits for conciseness

* Give a name to InputType and use it

* Refer to motivating examples

* Update function-composition-part-1.md status

Per 2024-10-14 telecon

* Create notes-2024-10-14.md

* Add test for :integer and :number composition (#907)

* Fix `:integer` option `useGrouping` values (#912)

I noticed that `:integer` does not include the "never" value for the option `useGrouping`. This is a bug.

* Drop syntax note on additional bidi changes (#910)

Drop syntax note on addition bidi changes

* Add tests for changes due to #885 (name/literal equality) (#904)

* Add tests for changes due to #885 (name/literal equality)

* Update test/tests/functions/string.json

Co-authored-by: Eemeli Aro <[email protected]>

* Update test/tests/syntax.json

Co-authored-by: Eemeli Aro <[email protected]>

* Update test/tests/functions/string.json

Co-authored-by: Eemeli Aro <[email protected]>

* Added tests for reordering and special case mapping

* Add another selection test

---------

Co-authored-by: Eemeli Aro <[email protected]>

* Add u: options namespace (#846)

* Move spec/registry.md -> spec/registry/default.md

* Add Unicode Registry definition

* Refer to BCP47, add note about only requiring normal tags

* Call it a namespace

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Fix test file reference

Co-authored-by: Tim Chevalier <[email protected]>

* Apply suggestions from code review

* Update spec/u-namespace.md

Co-authored-by: Eemeli Aro <[email protected]>

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Apply suggestions from code review

Co-authored-by: Addison Phillips <[email protected]>

* Add mention of functions to namespace description

---------

Co-authored-by: Addison Phillips <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>

* Define function composition for :string values (#798)

* Define function composition for :string values

* Update spec/registry.md as suggested by @stasm in #814

* Drop the "only"

* Update text following code review comments

---------

Co-authored-by: Addison Phillips <[email protected]>

* Drop data model request for feedback on "name" (#909)

* Allow surrogates in content, issue #895 (#906)

* Allow surrogates in content, issue #895

* Grammar and typos, linkify terms, make into a note, and fix 2119 keywords

Thanks Addison!

Co-authored-by: Addison Phillips <[email protected]>

* Not using "localizable elements"

Co-authored-by: Addison Phillips <[email protected]>

* Keep syntax.md in sync with message.abnf

* Added note about surrogates to quoted literals

* Moved the note about surrogates from Security Considerations to The Message

* Update spec/syntax.md

* Update spec/syntax.md

* Italicize  in a couple of places

* Implemeted more (all?) feedback from review

---------

Co-authored-by: Addison Phillips <[email protected]>

---------

Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Elango Cheran <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Mark Davis <[email protected]>
Co-authored-by: Danny Gleckler <[email protected]>
Co-authored-by: Steven R. Loomis <[email protected]>
Co-authored-by: Stanisław Małolepszy <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Mihai Nita <[email protected]>

* Add serialization proposal

* Revert "Add serialization proposal"

This reverts commit 17af553.

* Revert "Update from main (#914)"

This reverts commit da9377b.

* Add serialization proposal

---------

Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Elango Cheran <[email protected]>
Co-authored-by: Tim Chevalier <[email protected]>
Co-authored-by: Mark Davis <[email protected]>
Co-authored-by: Danny Gleckler <[email protected]>
Co-authored-by: Steven R. Loomis <[email protected]>
Co-authored-by: Stanisław Małolepszy <[email protected]>
Co-authored-by: Eemeli Aro <[email protected]>
Co-authored-by: Mihai Nita <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LDML46.1 MF2.0 Draft Candidate Preview-Feedback Feedback gathered during the technical preview resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with MF Syntax
Projects
None yet
Development

No branches or pull requests

5 participants