Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tests and workaround for this ICU bug #1106

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.daffodil.lib.util;

import com.ibm.icu.text.DecimalFormat;
import com.ibm.icu.text.DecimalFormatSymbols;

import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class FixedICUDecimalFormat {

/**
* This regex enables us to correct a problem in ICUs interpretation of the pattern in the case
* where the pattern has no digits before the decimal points, such as "#.##"
* so that there should be zero leading digits if the integer part is 0.
* <p>
* See https://unicode-org.atlassian.net/browse/ICU-22558
* <p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this might be intentional behavior, I found this in the ICU code:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/impl/number/PatternStringParser.java#L622-L635

So if you want no integral zeros, you have to use a pattern like .#, but that will require a fractional zero.

Copy link
Contributor Author

@mbeckerle mbeckerle Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, basically that code explicitly says "for backward compatibility" it always requires a 0 digit. That's in essence acknowledging that different behavior would also be useful, but then they're not offering any way to access that behavior.

This issue could be fixed in ICU by putting in a switch by which ICU API users have to explicitly enable it.

Then in Daffodil we could enable or disable this using a "dfdlx" property e.g., dfdlx:textNumberPatternOptions="allowLeadingDecimalPoint" or some such.

* It captures unescaped digits or '#' characters
* preceding a non-escaped decimal point in an ICU4J pattern string.
* <p>
* The whole regex (without java escapes) is {@code (?<!\\)([0-9#]*)(?<!\\)(\.) }
* <ul>
* <li>{@code (?<!\\)}: Negative lookbehind to ensure the character is not preceded by a backslash.</li>
* <li>{@code ([0-9#]*)}: Capture zero or more consecutive digits or '#' characters in group 1.</li>
* <li>{@code (?<!\\)(\.)}: The negative lookbehind ensures that only a non-escaped decimal point
* will be captured in group 2.</li>
* </ul>
*
* With this pattern:
* <ul>
* <li>Group 1 will capture any preceding unescaped digits or '#' characters.</li>
* <li>Group 2 will capture the non-escaped decimal point.</li>
* </ul>
*
* The match will fail only if there is no non-escaped decimal point.
*
* Package private for unit testing.
*/
static Pattern pattern = Pattern.compile("(?<!\\\\)([0-9#]*)(?<!\\\\)(\\.)");

/**
* Construct a DecimalFormat, but correct for the ICU mis-interpretation of "#." where it
* mistakenly sets the minimum number of integer digits to 1, not 0.
*
* This is not fooled by things like "#5#.0" which is silly, but behaves like "0.0".
* <p>
* See https://unicode-org.atlassian.net/browse/ICU-22558
* </p>
*
* @param patternString - an ICU pattern for a DecimalFormat, to be created and fixed (if needed)
* @return a DecimalFormat properly initialized by the pattern.
*/
public static DecimalFormat fromPattern(String patternString) {
DecimalFormatSymbols symbols = DecimalFormatSymbols.getInstance(); // default locale
return fromPattern(patternString, symbols);
}
/**
* Package private helper version that takes symbols so that you can define a locale.
* This allows tests to not have to adjust for locale decimal points "." vs. ",",
* by specifying a locale. But it isn't needed for regular usage.
*
* @param patternString - ICU pattern to fix (if needed)
* @param symbols - used to specify locale (because decimal points can be "." or ","
* @return a DecimalFormat properly initialized by the pattern.
*/
static DecimalFormat fromPattern(String patternString, DecimalFormatSymbols symbols) {
DecimalFormat decimalFormat = new DecimalFormat(patternString, symbols);
Matcher matcher = pattern.matcher(patternString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using this kindof regex, it looks like icu4j has a public class/functions (looks undocumented but stable) to get information from a pattern. Something like this:

val ppi = com.ibm.icu.impl.number.PatternStringParser.parseToPatternInfo(patternString)
val pi = ppi.positive

The pi variable is a ParsedSubpatternInfo that has a bunch of properties about the pattern, like integerTotal, integerNumerals, and integerLeadingHashSigns. This is probably all subject to change since I think it's an internal class, but looking at the history of the PatternStringParser, it looks like it changes pretty rarely.

In fact, I believe some of Daffodil inspects the pattern to detect things like if there is an E, comma, decimal, etc. and tries to deal with things like escaped and quoted characters like this pattern does. We should refactor all that to use this PatternStringParser if we need to ask any questions about the pattern.

if (matcher.find()) {
// There IS a decimal point.
String digits = matcher.group(1); // digits or # unescaped, before the decimal point
assert(Objects.equals(".", matcher.group(2))); // 2nd group matches the decimal point itself
long count = digits.chars().filter(Character::isDigit).count();
if (count == 0) {
// The decimal point is not preceded by any required digits.
// This is the fix ICU is getting wrong.
//
// Note even when ICU fixes this, this code won't break. It's just
// unnecessary.
//
//
decimalFormat.setMinimumIntegerDigits(0);
}
}
return decimalFormat;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belong in src/main/test instead? It feels like it's more about showing a workaround than becoming part of our library for use. If we were to do this or something similar in Daffodil for the TextNumberEv, I feel like the logic would just go directly in there. We already do inspection of the pattern and setting of properties to "fix" ICU behavior (see PrimitivesTextNumber.scala and EvTextNumber.scala), so it feels like the logic wants to be along side that instead of a special class. And if the PatternStringParser can be used, the complexity of the pattern goes away.

Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.daffodil.lib.util;

import com.ibm.icu.text.DecimalFormat;
import com.ibm.icu.text.DecimalFormatSymbols;
import org.junit.Test;
import static junit.framework.TestCase.assertEquals;
import static org.junit.Assert.*;

import java.util.Locale;
import java.util.Objects;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* Tests about ICU-22558
*
* https://unicode-org.atlassian.net/browse/ICU-22558
*/
public class TestFixedICUDecimalFormat {

// Create DecimalFormatSymbols for a Locale where the decimal separator is a dot
DecimalFormatSymbols symbols = new DecimalFormatSymbols(Locale.US);

/**
* Illustrates the bug in ICU.
*/
@Test
public void testDecimalFormatWithHash() {
// Create DecimalFormat instance with the pattern "#.##"
DecimalFormat decimalFormat = new DecimalFormat("#.##", symbols);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a patterm of .## gives the expected results

scala> new com.ibm.icu.text.DecimalFormat(".##").format(0.12)
res1: String = .12

Is that a reasonable workaround?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because 1.0 then fails with a truncation no-rounding possible error.

Making the user use two different format patterns is also no good as that would require the schema author to understand this bug, and literally conditionalize their schema for whether the value is between 0.0 and 1.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for me:

scala> new com.ibm.icu.text.DecimalFormat(".##").format(1.0)
res0: String = 1.0

I guess we're setting some other property that is different than what a default DecimalFormat gets to cause the no-rounding error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the nested loop:

    @Test
    public void icuTesting() {
        List<String> stringList = Arrays.asList(".", "#.", ".#", "#.#");
        List<Double> numberList = Arrays.asList(0.0, 1.0, 0.1, 1.1);
        for (String str : stringList) {
            for (Double num : numberList) {
                DecimalFormat df = new DecimalFormat(str);
                String output = df.format(num);
                System.out.println(String.format("For value %f and pattern '%s' output is '%s' (length %d)", num, str, output, output.length()));
            }
        }
    }

Output is, plus my comments after the //.

For value 0.000000 and pattern '.' output is '0.' (length 2)  // so leaving off the # entirely means the fraction digit is optional, but not the decimal point. 
For value 1.000000 and pattern '.' output is '1.' (length 2)
For value 0.100000 and pattern '.' output is '0.' (length 2)
For value 1.100000 and pattern '.' output is '1.' (length 2)   // So "." behaves exactly like "#."

For value 0.000000 and pattern '#.' output is '0.' (length 2) // leaving off the # after "." =>  insisting on one in the output???
For value 1.000000 and pattern '#.' output is '1.' (length 2)  
For value 0.100000 and pattern '#.' output is '0.' (length 2)
For value 1.100000 and pattern '#.' output is '1.' (length 2)

For value 0.000000 and pattern '.#' output is '.0' (length 2) // notice not a leading integer digit here
For value 1.000000 and pattern '.#' output is '1.0' (length 3)
For value 0.100000 and pattern '.#' output is '.1' (length 2)
For value 1.100000 and pattern '.#' output is '1.1' (length 3)

For value 0.000000 and pattern '#.#' output is '0' (length 1) // suppresses trailing "."
For value 1.000000 and pattern '#.#' output is '1' (length 1) // suppresses trailing "."
For value 0.100000 and pattern '#.#' output is '0.1' (length 3)
For value 1.100000 and pattern '#.#' output is '1.1' (length 3)

I can come up with no rationale for this other than they have to preserve backward compatibility so this clearly ad-hoc behavior must stay.


// Convert the number 0.12 to String
String formattedNumber = decimalFormat.format(0.12);

// This assert tests for the current behavior. (which is buggy)
assertEquals("0.12", formattedNumber); // Wrong answer: See ICU-22558
// Assert that the formatted string is ".12"
if (formattedNumber == ".12") {
// see https://unicode-org.atlassian.net/browse/ICU-22558
fail("Bug ICU-22558 has been fixed. This test set is no longer needed to illustrate the problem.");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically we've been putting tests that show ICU behavior here:

daffodil-runtime1/src/test/scala/org/apache/daffodil/runtime1/processors/input/TestICU.scala



Pattern pattern = FixedICUDecimalFormat.pattern;

/**
* Tests to be sure the regex that is used in the fix is robust.
*/
@Test
public void testExtractDecimalPointAndPrecedingUnescapedDigits0() {
Matcher matcher = pattern.matcher("\\.");
assertFalse(matcher.find());
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedDigits1() {
Matcher matcher = pattern.matcher("#.##");
assertTrue(matcher.find());
assertEquals("#", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedDigits2() {
Matcher matcher = pattern.matcher("\\#.##");
assertTrue(matcher.find());
assertEquals("", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar3() {
Matcher matcher = pattern.matcher("'text'#.##");
assertTrue(matcher.find());
assertEquals("#", matcher.group(1));
assertEquals(".", matcher.group(2)); }

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar4() {
Matcher matcher = pattern.matcher(".##");
assertTrue(matcher.find());
assertEquals("", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar5() {
Matcher matcher = pattern.matcher(".");
assertTrue(matcher.find());
assertEquals("", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar6() {
Matcher matcher = pattern.matcher("5.#");
assertTrue(matcher.find());
assertEquals("5", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar7() {
Matcher matcher = pattern.matcher("5#.#");
assertTrue(matcher.find());
assertEquals("5#", matcher.group(1));
assertEquals(".", matcher.group(2));
}

@Test
public void testExtractDecimalPointAndPrecedingUnescapedChar8() {
Matcher matcher = pattern.matcher("$\\.$#7\\##9#5#.#");
assertTrue(matcher.find());
assertEquals("#9#5#", matcher.group(1));
assertEquals(".", matcher.group(2));
}


/**
* Show that the fix works.
*/
@Test
public void testFixedDecimalFormatWithHash() {
DecimalFormat decimalFormat = FixedICUDecimalFormat.fromPattern("#.##", symbols);
String formattedNumber = decimalFormat.format(0.12);
assertEquals(".12", formattedNumber);
}

/**
* Show that the fix doesn't break patterns with leading digits.
*/
@Test
public void testFixedDecimalFormatWithZero() {
DecimalFormat decimalFormat = FixedICUDecimalFormat.fromPattern("0.##", symbols);
String formattedNumber = decimalFormat.format(0.12);
assertEquals("0.12", formattedNumber);
}
}




Loading