-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added tests and workaround for this ICU bug #1106
Conversation
Oh yeah, forgot to mention I wrote this all in Java, not scala. |
DAFFODIL-2859
* so that there should be zero leading digits if the integer part is 0. | ||
* <p> | ||
* See https://unicode-org.atlassian.net/browse/ICU-22558 | ||
* <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this might be intentional behavior, I found this in the ICU code:
So if you want no integral zeros, you have to use a pattern like .#
, but that will require a fractional zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, basically that code explicitly says "for backward compatibility" it always requires a 0 digit. That's in essence acknowledging that different behavior would also be useful, but then they're not offering any way to access that behavior.
This issue could be fixed in ICU by putting in a switch by which ICU API users have to explicitly enable it.
Then in Daffodil we could enable or disable this using a "dfdlx" property e.g., dfdlx:textNumberPatternOptions="allowLeadingDecimalPoint"
or some such.
*/ | ||
static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) { | ||
DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols); | ||
Matcher matcher = pattern.matcher(patternString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using this kindof regex, it looks like icu4j has a public class/functions (looks undocumented but stable) to get information from a pattern. Something like this:
val ppi = com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString)
val pi = ppi.positive
The pi
variable is a ParsedSubpatternInfo
that has a bunch of properties about the pattern, like integerTotal
, integerNumerals
, and integerLeadingHashSigns
. This is probably all subject to change since I think it's an internal class, but looking at the history of the PatternStringParser, it looks like it changes pretty rarely.
In fact, I believe some of Daffodil inspects the pattern to detect things like if there is an E
, comma, decimal, etc. and tries to deal with things like escaped and quoted characters like this pattern does. We should refactor all that to use this PatternStringParser if we need to ask any questions about the pattern.
// see https://unicode-org.atlassian.net/browse/ICU-22558 | ||
fail("Bug ICU-22558 has been fixed. This test set is no longer needed to illustrate the problem."); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically we've been putting tests that show ICU behavior here:
daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala
@Test | ||
public void testDecimalFormatWithHash() { | ||
// Create DecimalFormat instance with the pattern "#.##" | ||
DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a patterm of .##
gives the expected results
scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12)
res1: String = .12
Is that a reasonable workaround?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because 1.0 then fails with a truncation no-rounding possible error.
Making the user use two different format patterns is also no good as that would require the schema author to understand this bug, and literally conditionalize their schema for whether the value is between 0.0 and 1.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for me:
scala> new com.ibm.icu.text.DecimalFormat(".##").format(1.0)
res0: String = 1.0
I guess we're setting some other property that is different than what a default DecimalFormat gets to cause the no-rounding error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the nested loop:
@Test
public void icuTesting() {
List<String> stringList = Arrays.asList(".", "#.", ".#", "#.#");
List<Double> numberList = Arrays.asList(0.0, 1.0, 0.1, 1.1);
for (String str : stringList) {
for (Double num : numberList) {
DecimalFormat df = new DecimalFormat(str);
String output = df.format(num);
System.out.println(String.format("For value %f and pattern '%s' output is '%s' (length %d)", num, str, output, output.length()));
}
}
}
Output is, plus my comments after the //.
For value 0.000000 and pattern '.' output is '0.' (length 2) // so leaving off the # entirely means the fraction digit is optional, but not the decimal point.
For value 1.000000 and pattern '.' output is '1.' (length 2)
For value 0.100000 and pattern '.' output is '0.' (length 2)
For value 1.100000 and pattern '.' output is '1.' (length 2) // So "." behaves exactly like "#."
For value 0.000000 and pattern '#.' output is '0.' (length 2) // leaving off the # after "." => insisting on one in the output???
For value 1.000000 and pattern '#.' output is '1.' (length 2)
For value 0.100000 and pattern '#.' output is '0.' (length 2)
For value 1.100000 and pattern '#.' output is '1.' (length 2)
For value 0.000000 and pattern '.#' output is '.0' (length 2) // notice not a leading integer digit here
For value 1.000000 and pattern '.#' output is '1.0' (length 3)
For value 0.100000 and pattern '.#' output is '.1' (length 2)
For value 1.100000 and pattern '.#' output is '1.1' (length 3)
For value 0.000000 and pattern '#.#' output is '0' (length 1) // suppresses trailing "."
For value 1.000000 and pattern '#.#' output is '1' (length 1) // suppresses trailing "."
For value 0.100000 and pattern '#.#' output is '0.1' (length 3)
For value 1.100000 and pattern '#.#' output is '1.1' (length 3)
I can come up with no rationale for this other than they have to preserve backward compatibility so this clearly ad-hoc behavior must stay.
} | ||
return decimalFormat; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this belong in src/main/test
instead? It feels like it's more about showing a workaround than becoming part of our library for use. If we were to do this or something similar in Daffodil for the TextNumberEv
, I feel like the logic would just go directly in there. We already do inspection of the pattern and setting of properties to "fix" ICU behavior (see PrimitivesTextNumber.scala
and EvTextNumber.scala
), so it feels like the logic wants to be along side that instead of a special class. And if the PatternStringParser can be used, the complexity of the pattern goes away.
Closing https://issues.apache.org/jira/browse/DAFFODIL-2859 as Won't Fix. So this PR is now moot. Closing the PR. |
This doesn't fix the issue in Daffodil textNumberPattern yet, because this change potentially affects many schemas, changing their number formatting behavior when unparsing. Lots of expected test output files may become invalid now because initial 0 digits will no longer be output.
What this does is add tests that characterize the bug, and add a daffodil-lib utility FixICUDecimalFormat, which shows the way we can workaround the issue, and tests of that fix.
A switch to enable/disable an actual fix to Daffodil may be needed. (Which we could enable for Daffodil 4.0.0 perhaps.)
No plan to integrate this until after Daffodil 3.6.0 is released, as this will need to be studied against many schemas to know the impact and whether a switch to enable/disable is needed or not.
The issue is that ICU doesn't treat the leading 0 as optional in ICU patterns where it should be optional.
For example, with pattern "#.##", the value 0.12 should unparse to ".12", but it doesn't suppress the leading digit. It unparses as "0.12". If you want "0.12" you should use pattern "0.##" showing a required leading digit.
The ICU issue is this: https://unicode-org.atlassian.net/browse/ICU-22558
DAFFODIL-2859