Fix xs:float text number parse/unparse edge cases #1133

stevedlawrence · 2024-01-03T17:44:35Z

When parsing text floats, ICU sometimes returns a BigInteger, such as the case when parsing the maximum single precision floating-point value (i.e. 3.40282347E+38).

The current logic for checking if these parsed BigIntegers are in range of a float is to convert it to a double and then compare that against the maximum value of a float (promoted to a double). However, converting a BigInteger to a double can lead to changes in precision, making this comparison inaccurate. Looking at bit representations makes this more clear:

Float.MaxValue.doubleValue converted to bits is:

0 10001111110 1111111111111111111111100000000000000000000000000000

But BigInteger(Float.MaxValue).doubleValue converted to bits is:

0 10001111110 1111111111111111111111100101010011011010111111111000

The rules for converting a float to a double are different than converting a BigInteger to a double, which leads to changes in precision and a double value that is slightly bigger than the actual float max value, leading to out of range errors.

Because converting a BigInteger to a double is unreliable, this adds a new case for BigInteger, which converts it to a BigDecimal and compares that against the max value, which avoids precision issues. Note that this also changes the min/maxBD values to be created from a string rather than a double, since converting a double to a BigDecimal also has related precision issues.

Additionally, when unparsing text floats, ICU does not have a way to unparse single precision floats. Instead, it converts the float to a double and then unparses that. However, this leads to text numbers with extra precision that implies more accuracy than exists for floats, and leads to errors with some round trip tests. To fix this, this converts the float to a BigDecimal using the String constructor (which maintains prcision), and then has ICU format the BigDecimal. This is more expensive, but leads to more correct text floats.

DAFFODIL-2867

mbeckerle

+1 minor comments only

Great that this gets rid of a bunch of twoPass tests that were highly unmotivated.

mbeckerle · 2024-01-03T18:26:51Z

daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/dpath/NodeInfo.scala

@@ -698,6 +706,8 @@ object NodeInfo extends Enum {
      protected override def fromNumberNoCheck(n: Number): DataValueDouble = n.doubleValue
      override val min = -JDouble.MAX_VALUE
      override val max = JDouble.MAX_VALUE
+      override val minStr = "-" + JDouble.MAX_VALUE.toString


Is there a reason these aren't just min.toString and max.toString? The fact that they are not suggests to me something funny is going on that requires us not to call toString on the min because it's negated or something.

Yeah, the min/max vals are Doubles, and Double.toString results in more decimal places in the string. When we create the BigDecimal for range comparison, we can't have those extra bits of precision or it changes the valid range for a Float. For example,

scala> new java.math.BigDecimal(Float.MaxValue.doubleValue.toString) val res0: java.math.BigDecimal = 3.4028234663852886E+38

That value is much smaller than the actual Float max of 3.4028235E38--we must call toString on the Float, e.g.

scala> new java.math.BigDecimal(Float.MaxValue.toString) val res0: java.math.BigDecimal = 3.4028235E+38

mbeckerle · 2024-01-03T18:27:31Z

daffodil-runtime1/src/main/scala/org/apache/daffodil/runtime1/dpath/NodeInfo.scala

@@ -681,6 +687,8 @@ object NodeInfo extends Enum {
      protected override def fromNumberNoCheck(n: Number): DataValueFloat = n.floatValue
      override val min = -JFloat.MAX_VALUE.doubleValue
      override val max = JFloat.MAX_VALUE.doubleValue
+      override val minStr = "-" + JFloat.MAX_VALUE.toString


See comment below about minStr

mbeckerle · 2024-01-03T18:47:31Z

...rc/main/scala/org/apache/daffodil/unparsers/runtime1/ConvertTextStandardNumberUnparser.scala

+            // can also lead to failures to exactly round. To solve this, we convert the float
+            // to a String, convert that String to a BigDecimal, and then format that. This is
+            // more expensive, but ensures we unparse the exact same precision as represented by
+            // the float.


I don't get it. Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

If so why didn't they provide a constructor that takes a float?

I guess it has to work, or your tests would be failing.

We could bypass all of this if the float is an integer because none of this is required if it's an integer. The issue with double conversion only matters if there are real non-zero fraction digits. That said, I think the only way to tell if a float is an integer is to round it to one and see if it is equal.

Nah. Not worth it. Plus I am not sure the interactions with the number pattern "@" character stuff, etc.

Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

Yep, here's the grammar it supports: https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-java.lang.String-

If so why didn't they provide a constructor that takes a float?

ICU "natively" supports BigInteger, BigDecimal, long, and double. Any Number that isn't one of those is converted to a double (via Number.doubleValue) and formats it using the double logic:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/text/NumberFormat.java#L269-L289

This is fine for int/short/byte since they can all be exactly represented by a double. And it means float is supported, by when a float is converted to a double, more decimal places are sometimes needed when converted to a string. This Java and not ICU, but it shows the same issue more clearly:

scala> Float.MaxValue.toString val res0: String = 3.4028235E38 scala> Float.MaxValue.doubleValue.toString val res1: String = 3.4028234663852886E38

So ICU's logic of converting to a double is maybe questionable--you can't just convert a float to a double and expect to get the same string. Note that if you then parse that double string as a float you should get the original float back, so the thing in the infoset should come out the same, but the data doesn't round trip.

I couldn't find a way to convert a float to a double in a way that would result in the same string value (without doing what I did), though it does seem like a better algorithm should exist.

We could bypass all of this if the float is an integer...Nah. Not worth it...

Also, if a user is expecting floats, it probably is less likely integers to appear in the data/infoset, so it might be an optimization that isn't hit that often.

Probably the best thing to do is suggest that if you're dealing with text numbers containing decimals, then just always use xs:double--its much less likely to have precision issues and it avoids this hack and should be more performant.

mbeckerle · 2024-01-03T19:04:06Z

daffodil-test/src/test/resources/org/apache/daffodil/section05/simple_types/SimpleTypes.tdml

                       model="SimpleTypes-Embedded.dfdl.xsd" description="Section 5 Simple type-float - DFDL-5-008R">

-    <tdml:document><![CDATA[3.4028235E38]]></tdml:document>
+    <tdml:document><![CDATA[340,282,350,000,000,000,000,000,000,000,000,000,000]]></tdml:document>


I had to stare at this a while and go look at GeneralFormatBase to decide this is ok.

Insert a comment that this is 39 digits long because the textNumberPattern is "#,###.###" which doesn't allow for use of an exponent, so it has to output all the integer digits.

Will do. This was needed to roundtrip--it parses correctly since it's lax parsing and supports exponents, but since the textNumberPattern is what you say, it unparses without exponents.

tuxji

+1

Looks fine.

When parsing text floats, ICU sometimes returns a BigInteger, such as the case when parsing the maximum single precision floating-point value (i.e. 3.40282347E+38). The current logic for checking if these parsed BigIntegers are in range of a float is to convert it to a double and then compare that against the maximum value of a float (promoted to a double). However, converting a BigInteger to a double can lead to changes in precision, making this comparison inaccurate. Looking at bit representations makes this more clear: Float.MaxValue.doubleValue converted to bits is: 0 10001111110 1111111111111111111111100000000000000000000000000000 But BigInteger(Float.MaxValue).doubleValue converted to bits is: 0 10001111110 1111111111111111111111100101010011011010111111111000 The rules for converting a float to a double are different than converting a BigInteger to a double, which leads to changes in precision and a double value that is slightly bigger than the actual float max value, leading to out of range errors. Because converting a BigInteger to a double is unreliable, this adds a new case for BigInteger, which converts it to a BigDecimal and compares that against the max value, which avoids precision issues. Note that this also changes the min/maxBD values to be created from a string rather than a double, since converting a double to a BigDecimal also has related precision issues. Additionally, when unparsing text floats, ICU does not have a way to unparse single precision floats. Instead, it converts the float to a double and then unparses that. However, this leads to text numbers with extra precision that implies more accuracy than exists for floats, and leads to errors with some round trip tests. To fix this, this converts the float to a BigDecimal using the String constructor (which maintains precision), and then has ICU format the BigDecimal. This is more expensive, but leads to more correct text floats. In practice, xs:float should probably not be used for text numbers, favoring xs:double instead, which avoids this issue. DAFFODIL-2867

mbeckerle approved these changes Jan 3, 2024

View reviewed changes

tuxji approved these changes Jan 3, 2024

View reviewed changes

stevedlawrence force-pushed the daffodil-2867-float-comparison branch from 3782e79 to 6267374 Compare January 4, 2024 13:59

stevedlawrence merged commit 20d6ea3 into apache:main Jan 4, 2024
10 checks passed

stevedlawrence deleted the daffodil-2867-float-comparison branch January 4, 2024 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix xs:float text number parse/unparse edge cases #1133

Fix xs:float text number parse/unparse edge cases #1133

stevedlawrence commented Jan 3, 2024 •

edited

Loading

mbeckerle left a comment

mbeckerle Jan 3, 2024

stevedlawrence Jan 3, 2024

mbeckerle Jan 3, 2024

mbeckerle Jan 3, 2024

stevedlawrence Jan 3, 2024

mbeckerle Jan 3, 2024

stevedlawrence Jan 3, 2024

tuxji left a comment

Fix xs:float text number parse/unparse edge cases #1133

Fix xs:float text number parse/unparse edge cases #1133

Conversation

stevedlawrence commented Jan 3, 2024 • edited Loading

mbeckerle left a comment

Choose a reason for hiding this comment

mbeckerle Jan 3, 2024

Choose a reason for hiding this comment

stevedlawrence Jan 3, 2024

Choose a reason for hiding this comment

mbeckerle Jan 3, 2024

Choose a reason for hiding this comment

mbeckerle Jan 3, 2024

Choose a reason for hiding this comment

stevedlawrence Jan 3, 2024

Choose a reason for hiding this comment

mbeckerle Jan 3, 2024

Choose a reason for hiding this comment

stevedlawrence Jan 3, 2024

Choose a reason for hiding this comment

tuxji left a comment

Choose a reason for hiding this comment

stevedlawrence commented Jan 3, 2024 •

edited

Loading