Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tests and workaround for this ICU bug #1106

Closed
wants to merge 2 commits into from

Conversation

mbeckerle
Copy link
Contributor

This doesn't fix the issue in Daffodil textNumberPattern yet, because this change potentially affects many schemas, changing their number formatting behavior when unparsing. Lots of expected test output files may become invalid now because initial 0 digits will no longer be output.

What this does is add tests that characterize the bug, and add a daffodil-lib utility FixICUDecimalFormat, which shows the way we can workaround the issue, and tests of that fix.

A switch to enable/disable an actual fix to Daffodil may be needed. (Which we could enable for Daffodil 4.0.0 perhaps.)

No plan to integrate this until after Daffodil 3.6.0 is released, as this will need to be studied against many schemas to know the impact and whether a switch to enable/disable is needed or not.

The issue is that ICU doesn't treat the leading 0 as optional in ICU patterns where it should be optional.

For example, with pattern "#.##", the value 0.12 should unparse to ".12", but it doesn't suppress the leading digit. It unparses as "0.12". If you want "0.12" you should use pattern "0.##" showing a required leading digit.

The ICU issue is this: https://unicode-org.atlassian.net/browse/ICU-22558

DAFFODIL-2859

@mbeckerle
Copy link
Contributor Author

Oh yeah, forgot to mention I wrote this all in Java, not scala.

* so that there should be zero leading digits if the integer part is 0.
* <p>
* See https://unicode-org.atlassian.net/browse/ICU-22558
* <p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this might be intentional behavior, I found this in the ICU code:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/impl/number/PatternStringParser.java#L622-L635

So if you want no integral zeros, you have to use a pattern like .#, but that will require a fractional zero.

Copy link
Contributor Author

@mbeckerle mbeckerle Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically that code explicitly says "for backward compatibility" it always requires a 0 digit. That's in essence acknowledging that different behavior would also be useful, but then they're not offering any way to access that behavior.

This issue could be fixed in ICU by putting in a switch by which ICU API users have to explicitly enable it.

Then in Daffodil we could enable or disable this using a "dfdlx" property e.g., dfdlx:textNumberPatternOptions="allowLeadingDecimalPoint" or some such.

*/
static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) {
DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols);
Matcher matcher = pattern.matcher(patternString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this kindof regex, it looks like icu4j has a public class/functions (looks undocumented but stable) to get information from a pattern. Something like this:

val ppi = com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString)
val pi = ppi.positive

The pi variable is a ParsedSubpatternInfo that has a bunch of properties about the pattern, like integerTotal, integerNumerals, and integerLeadingHashSigns. This is probably all subject to change since I think it's an internal class, but looking at the history of the PatternStringParser, it looks like it changes pretty rarely.

In fact, I believe some of Daffodil inspects the pattern to detect things like if there is an E, comma, decimal, etc. and tries to deal with things like escaped and quoted characters like this pattern does. We should refactor all that to use this PatternStringParser if we need to ask any questions about the pattern.

// see https://unicode-org.atlassian.net/browse/ICU-22558
fail("Bug ICU-22558 has been fixed. This test set is no longer needed to illustrate the problem.");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically we've been putting tests that show ICU behavior here:

daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala

@Test
public void testDecimalFormatWithHash() {
// Create DecimalFormat instance with the pattern "#.##"
DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a patterm of .## gives the expected results

scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12)
res1: String = .12

Is that a reasonable workaround?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because 1.0 then fails with a truncation no-rounding possible error.

Making the user use two different format patterns is also no good as that would require the schema author to understand this bug, and literally conditionalize their schema for whether the value is between 0.0 and 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me:

scala> new com.ibm.icu.text.DecimalFormat(".##").format(1.0)
res0: String = 1.0

I guess we're setting some other property that is different than what a default DecimalFormat gets to cause the no-rounding error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the nested loop:

    @Test
    public void icuTesting() {
        List<String> stringList = Arrays.asList(".", "#.", ".#", "#.#");
        List<Double> numberList = Arrays.asList(0.0, 1.0, 0.1, 1.1);
        for (String str : stringList) {
            for (Double num : numberList) {
                DecimalFormat df = new DecimalFormat(str);
                String output = df.format(num);
                System.out.println(String.format("For value %f and pattern '%s' output is '%s' (length %d)", num, str, output, output.length()));
            }
        }
    }

Output is, plus my comments after the //.

For value 0.000000 and pattern '.' output is '0.' (length 2)  // so leaving off the # entirely means the fraction digit is optional, but not the decimal point. 
For value 1.000000 and pattern '.' output is '1.' (length 2)
For value 0.100000 and pattern '.' output is '0.' (length 2)
For value 1.100000 and pattern '.' output is '1.' (length 2)   // So "." behaves exactly like "#."

For value 0.000000 and pattern '#.' output is '0.' (length 2) // leaving off the # after "." =>  insisting on one in the output???
For value 1.000000 and pattern '#.' output is '1.' (length 2)  
For value 0.100000 and pattern '#.' output is '0.' (length 2)
For value 1.100000 and pattern '#.' output is '1.' (length 2)

For value 0.000000 and pattern '.#' output is '.0' (length 2) // notice not a leading integer digit here
For value 1.000000 and pattern '.#' output is '1.0' (length 3)
For value 0.100000 and pattern '.#' output is '.1' (length 2)
For value 1.100000 and pattern '.#' output is '1.1' (length 3)

For value 0.000000 and pattern '#.#' output is '0' (length 1) // suppresses trailing "."
For value 1.000000 and pattern '#.#' output is '1' (length 1) // suppresses trailing "."
For value 0.100000 and pattern '#.#' output is '0.1' (length 3)
For value 1.100000 and pattern '#.#' output is '1.1' (length 3)

I can come up with no rationale for this other than they have to preserve backward compatibility so this clearly ad-hoc behavior must stay.

}
return decimalFormat;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belong in src/main/test instead? It feels like it's more about showing a workaround than becoming part of our library for use. If we were to do this or something similar in Daffodil for the TextNumberEv, I feel like the logic would just go directly in there. We already do inspection of the pattern and setting of properties to "fix" ICU behavior (see PrimitivesTextNumber.scala and EvTextNumber.scala), so it feels like the logic wants to be along side that instead of a special class. And if the PatternStringParser can be used, the complexity of the pattern goes away.

@mbeckerle
Copy link
Contributor Author

Closing https://issues.apache.org/jira/browse/DAFFODIL-2859 as Won't Fix. So this PR is now moot. Closing the PR.

@mbeckerle mbeckerle closed this Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants