Added tests and workaround for this ICU bug #1106

mbeckerle · 2023-10-27T21:01:02Z

This doesn't fix the issue in Daffodil textNumberPattern yet, because this change potentially affects many schemas, changing their number formatting behavior when unparsing. Lots of expected test output files may become invalid now because initial 0 digits will no longer be output.

What this does is add tests that characterize the bug, and add a daffodil-lib utility FixICUDecimalFormat, which shows the way we can workaround the issue, and tests of that fix.

A switch to enable/disable an actual fix to Daffodil may be needed. (Which we could enable for Daffodil 4.0.0 perhaps.)

No plan to integrate this until after Daffodil 3.6.0 is released, as this will need to be studied against many schemas to know the impact and whether a switch to enable/disable is needed or not.

The issue is that ICU doesn't treat the leading 0 as optional in ICU patterns where it should be optional.

For example, with pattern "#.##", the value 0.12 should unparse to ".12", but it doesn't suppress the leading digit. It unparses as "0.12". If you want "0.12" you should use pattern "0.##" showing a required leading digit.

The ICU issue is this: https://unicode-org.atlassian.net/browse/ICU-22558

DAFFODIL-2859

https://unicode-org.atlassian.net/browse/ICU-22558 DAFFODIL-2859

mbeckerle · 2023-10-27T21:01:45Z

Oh yeah, forgot to mention I wrote this all in Java, not scala.

DAFFODIL-2859

stevedlawrence · 2023-10-30T11:13:17Z

daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java

+     * so that there should be zero leading digits if the integer part is 0.
+     * <p>
+     *     See https://unicode-org.atlassian.net/browse/ICU-22558
+     * <p>


Seems like this might be intentional behavior, I found this in the ICU code:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/impl/number/PatternStringParser.java#L622-L635

So if you want no integral zeros, you have to use a pattern like .#, but that will require a fractional zero.

Yeah, basically that code explicitly says "for backward compatibility" it always requires a 0 digit. That's in essence acknowledging that different behavior would also be useful, but then they're not offering any way to access that behavior.

This issue could be fixed in ICU by putting in a switch by which ICU API users have to explicitly enable it.

Then in Daffodil we could enable or disable this using a "dfdlx" property e.g., dfdlx:textNumberPatternOptions="allowLeadingDecimalPoint" or some such.

stevedlawrence · 2023-10-30T11:30:40Z

daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java

+     */
+    static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) {
+        DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols);
+        Matcher matcher = pattern.matcher(patternString);


Instead of using this kindof regex, it looks like icu4j has a public class/functions (looks undocumented but stable) to get information from a pattern. Something like this:

val ppi = com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString) val pi = ppi.positive

The pi variable is a ParsedSubpatternInfo that has a bunch of properties about the pattern, like integerTotal, integerNumerals, and integerLeadingHashSigns. This is probably all subject to change since I think it's an internal class, but looking at the history of the PatternStringParser, it looks like it changes pretty rarely.

In fact, I believe some of Daffodil inspects the pattern to detect things like if there is an E, comma, decimal, etc. and tries to deal with things like escaped and quoted characters like this pattern does. We should refactor all that to use this PatternStringParser if we need to ask any questions about the pattern.

stevedlawrence · 2023-10-30T11:38:43Z

daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java

+            // see https://unicode-org.atlassian.net/browse/ICU-22558
+            fail("Bug ICU-22558 has been fixed. This test set is no longer needed to illustrate the problem.");
+        }
+    }


Historically we've been putting tests that show ICU behavior here:

daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala

stevedlawrence · 2023-10-30T11:42:28Z

daffodil-lib/src/test/scala/org/apache/daffodil/lib/util/TestFixedICUDecimalFormat.java

+    @Test
+    public void testDecimalFormatWithHash() {
+        // Create DecimalFormat instance with the pattern "#.##"
+        DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols);


Seems like a patterm of .## gives the expected results

scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12) res1: String = .12

Is that a reasonable workaround?

No, because 1.0 then fails with a truncation no-rounding possible error.

Making the user use two different format patterns is also no good as that would require the schema author to understand this bug, and literally conditionalize their schema for whether the value is between 0.0 and 1.0.

This works for me:

scala> new com.ibm.icu.text.DecimalFormat(".##").format(1.0) res0: String = 1.0

I guess we're setting some other property that is different than what a default DecimalFormat gets to cause the no-rounding error?

I wrote the nested loop:

@Test public void icuTesting() { List<String> stringList = Arrays.asList(".", "#.", ".#", "#.#"); List<Double> numberList = Arrays.asList(0.0, 1.0, 0.1, 1.1); for (String str : stringList) { for (Double num : numberList) { DecimalFormat df = new DecimalFormat(str); String output = df.format(num); System.out.println(String.format("For value %f and pattern '%s' output is '%s' (length %d)", num, str, output, output.length())); } } }

Output is, plus my comments after the //.

For value 0.000000 and pattern '.' output is '0.' (length 2) // so leaving off the # entirely means the fraction digit is optional, but not the decimal point. For value 1.000000 and pattern '.' output is '1.' (length 2) For value 0.100000 and pattern '.' output is '0.' (length 2) For value 1.100000 and pattern '.' output is '1.' (length 2) // So "." behaves exactly like "#." For value 0.000000 and pattern '#.' output is '0.' (length 2) // leaving off the # after "." => insisting on one in the output??? For value 1.000000 and pattern '#.' output is '1.' (length 2) For value 0.100000 and pattern '#.' output is '0.' (length 2) For value 1.100000 and pattern '#.' output is '1.' (length 2) For value 0.000000 and pattern '.#' output is '.0' (length 2) // notice not a leading integer digit here For value 1.000000 and pattern '.#' output is '1.0' (length 3) For value 0.100000 and pattern '.#' output is '.1' (length 2) For value 1.100000 and pattern '.#' output is '1.1' (length 3) For value 0.000000 and pattern '#.#' output is '0' (length 1) // suppresses trailing "." For value 1.000000 and pattern '#.#' output is '1' (length 1) // suppresses trailing "." For value 0.100000 and pattern '#.#' output is '0.1' (length 3) For value 1.100000 and pattern '#.#' output is '1.1' (length 3)

I can come up with no rationale for this other than they have to preserve backward compatibility so this clearly ad-hoc behavior must stay.

stevedlawrence · 2023-10-30T11:51:37Z

daffodil-lib/src/main/scala/org/apache/daffodil/lib/util/FixedICUDecimalFormat.java

+        }
+        return decimalFormat;
+    }
+}


Does this belong in src/main/test instead? It feels like it's more about showing a workaround than becoming part of our library for use. If we were to do this or something similar in Daffodil for the TextNumberEv, I feel like the logic would just go directly in there. We already do inspection of the pattern and setting of properties to "fix" ICU behavior (see PrimitivesTextNumber.scala and EvTextNumber.scala), so it feels like the logic wants to be along side that instead of a special class. And if the PatternStringParser can be used, the complexity of the pattern goes away.

mbeckerle · 2023-10-30T19:02:39Z

Closing https://issues.apache.org/jira/browse/DAFFODIL-2859 as Won't Fix. So this PR is now moot. Closing the PR.

Commited tests for ICU bug

13e0fe0

https://unicode-org.atlassian.net/browse/ICU-22558 DAFFODIL-2859

Add APLv2 banners to satisfy rat check

f8d5fd8

DAFFODIL-2859

stevedlawrence reviewed Oct 30, 2023

View reviewed changes

mbeckerle closed this Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added tests and workaround for this ICU bug #1106

Added tests and workaround for this ICU bug #1106

mbeckerle commented Oct 27, 2023

mbeckerle commented Oct 27, 2023

stevedlawrence Oct 30, 2023

mbeckerle Oct 30, 2023 •

edited

Loading

stevedlawrence Oct 30, 2023

stevedlawrence Oct 30, 2023

stevedlawrence Oct 30, 2023

mbeckerle Oct 30, 2023

stevedlawrence Oct 30, 2023

mbeckerle Oct 30, 2023

stevedlawrence Oct 30, 2023

mbeckerle commented Oct 30, 2023

Added tests and workaround for this ICU bug #1106

Added tests and workaround for this ICU bug #1106

Conversation

mbeckerle commented Oct 27, 2023

mbeckerle commented Oct 27, 2023

Choose a reason for hiding this comment

mbeckerle Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbeckerle commented Oct 30, 2023

mbeckerle Oct 30, 2023 •

edited

Loading