Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xs:float text number parse/unparse edge cases #1133

Merged

Conversation

stevedlawrence
Copy link
Member

@stevedlawrence stevedlawrence commented Jan 3, 2024

When parsing text floats, ICU sometimes returns a BigInteger, such as the case when parsing the maximum single precision floating-point value (i.e. 3.40282347E+38).

The current logic for checking if these parsed BigIntegers are in range of a float is to convert it to a double and then compare that against the maximum value of a float (promoted to a double). However, converting a BigInteger to a double can lead to changes in precision, making this comparison inaccurate. Looking at bit representations makes this more clear:

Float.MaxValue.doubleValue converted to bits is:

0 10001111110 1111111111111111111111100000000000000000000000000000

But BigInteger(Float.MaxValue).doubleValue converted to bits is:

0 10001111110 1111111111111111111111100101010011011010111111111000

The rules for converting a float to a double are different than converting a BigInteger to a double, which leads to changes in precision and a double value that is slightly bigger than the actual float max value, leading to out of range errors.

Because converting a BigInteger to a double is unreliable, this adds a new case for BigInteger, which converts it to a BigDecimal and compares that against the max value, which avoids precision issues. Note that this also changes the min/maxBD values to be created from a string rather than a double, since converting a double to a BigDecimal also has related precision issues.

Additionally, when unparsing text floats, ICU does not have a way to unparse single precision floats. Instead, it converts the float to a double and then unparses that. However, this leads to text numbers with extra precision that implies more accuracy than exists for floats, and leads to errors with some round trip tests. To fix this, this converts the float to a BigDecimal using the String constructor (which maintains prcision), and then has ICU format the BigDecimal. This is more expensive, but leads to more correct text floats.

DAFFODIL-2867

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 minor comments only

Great that this gets rid of a bunch of twoPass tests that were highly unmotivated.

@@ -698,6 +706,8 @@ object NodeInfo extends Enum {
protected override def fromNumberNoCheck(n: Number): DataValueDouble = n.doubleValue
override val min = -JDouble.MAX_VALUE
override val max = JDouble.MAX_VALUE
override val minStr = "-" + JDouble.MAX_VALUE.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these aren't just min.toString and max.toString? The fact that they are not suggests to me something funny is going on that requires us not to call toString on the min because it's negated or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the min/max vals are Doubles, and Double.toString results in more decimal places in the string. When we create the BigDecimal for range comparison, we can't have those extra bits of precision or it changes the valid range for a Float. For example,

scala> new java.math.BigDecimal(Float.MaxValue.doubleValue.toString)
val res0: java.math.BigDecimal = 3.4028234663852886E+38

That value is much smaller than the actual Float max of 3.4028235E38--we must call toString on the Float, e.g.

scala> new java.math.BigDecimal(Float.MaxValue.toString)
val res0: java.math.BigDecimal = 3.4028235E+38

@@ -681,6 +687,8 @@ object NodeInfo extends Enum {
protected override def fromNumberNoCheck(n: Number): DataValueFloat = n.floatValue
override val min = -JFloat.MAX_VALUE.doubleValue
override val max = JFloat.MAX_VALUE.doubleValue
override val minStr = "-" + JFloat.MAX_VALUE.toString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below about minStr

// can also lead to failures to exactly round. To solve this, we convert the float
// to a String, convert that String to a BigDecimal, and then format that. This is
// more expensive, but ensures we unparse the exact same precision as represented by
// the float.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

If so why didn't they provide a constructor that takes a float?

I guess it has to work, or your tests would be failing.

We could bypass all of this if the float is an integer because none of this is required if it's an integer. The issue with double conversion only matters if there are real non-zero fraction digits. That said, I think the only way to tell if a float is an integer is to round it to one and see if it is equal.

Nah. Not worth it. Plus I am not sure the interactions with the number pattern "@" character stuff, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does constructing BigDecimal(string) work if the string uses exponential notation as in "3.4E38" ?

Yep, here's the grammar it supports: https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html#BigDecimal-java.lang.String-

If so why didn't they provide a constructor that takes a float?

ICU "natively" supports BigInteger, BigDecimal, long, and double. Any Number that isn't one of those is converted to a double (via Number.doubleValue) and formats it using the double logic:

https://github.com/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/text/NumberFormat.java#L269-L289

This is fine for int/short/byte since they can all be exactly represented by a double. And it means float is supported, by when a float is converted to a double, more decimal places are sometimes needed when converted to a string. This Java and not ICU, but it shows the same issue more clearly:

scala> Float.MaxValue.toString
val res0: String = 3.4028235E38

scala> Float.MaxValue.doubleValue.toString
val res1: String = 3.4028234663852886E38

So ICU's logic of converting to a double is maybe questionable--you can't just convert a float to a double and expect to get the same string. Note that if you then parse that double string as a float you should get the original float back, so the thing in the infoset should come out the same, but the data doesn't round trip.

I couldn't find a way to convert a float to a double in a way that would result in the same string value (without doing what I did), though it does seem like a better algorithm should exist.

We could bypass all of this if the float is an integer...Nah. Not worth it...

Also, if a user is expecting floats, it probably is less likely integers to appear in the data/infoset, so it might be an optimization that isn't hit that often.

Probably the best thing to do is suggest that if you're dealing with text numbers containing decimals, then just always use xs:double--its much less likely to have precision issues and it avoids this hack and should be more performant.

model="SimpleTypes-Embedded.dfdl.xsd" description="Section 5 Simple type-float - DFDL-5-008R">

<tdml:document><![CDATA[3.4028235E38]]></tdml:document>
<tdml:document><![CDATA[340,282,350,000,000,000,000,000,000,000,000,000,000]]></tdml:document>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to stare at this a while and go look at GeneralFormatBase to decide this is ok.

Insert a comment that this is 39 digits long because the textNumberPattern is "#,###.###" which doesn't allow for use of an exponent, so it has to output all the integer digits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. This was needed to roundtrip--it parses correctly since it's lax parsing and supports exponents, but since the textNumberPattern is what you say, it unparses without exponents.

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Looks fine.

When parsing text floats, ICU sometimes returns a BigInteger, such as
the case when parsing the maximum single precision floating-point value
(i.e. 3.40282347E+38).

The current logic for checking if these parsed BigIntegers are in range
of a float is to convert it to a double and then compare that against
the maximum value of a float (promoted to a double). However, converting
a BigInteger to a double can lead to changes in precision, making this
comparison inaccurate. Looking at bit representations makes this more
clear:

Float.MaxValue.doubleValue converted to bits is:

    0 10001111110 1111111111111111111111100000000000000000000000000000

But BigInteger(Float.MaxValue).doubleValue converted to bits is:

    0 10001111110 1111111111111111111111100101010011011010111111111000

The rules for converting a float to a double are different than
converting a BigInteger to a double, which leads to changes in precision
and a double value that is slightly bigger than the actual float max
value, leading to out of range errors.

Because converting a BigInteger to a double is unreliable, this adds a
new case for BigInteger, which converts it to a BigDecimal and compares
that against the max value, which avoids precision issues. Note that
this also changes the min/maxBD values to be created from a string
rather than a double, since converting a double to a BigDecimal also has
related precision issues.

Additionally, when unparsing text floats, ICU does not have a way to
unparse single precision floats. Instead, it converts the float to a
double and then unparses that. However, this leads to text numbers with
extra precision that implies more accuracy than exists for floats, and
leads to errors with some round trip tests. To fix this, this converts
the float to a BigDecimal using the String constructor (which maintains
precision), and then has ICU format the BigDecimal. This is more
expensive, but leads to more correct text floats. In practice, xs:float
should probably not be used for text numbers, favoring xs:double
instead, which avoids this issue.

DAFFODIL-2867
@stevedlawrence stevedlawrence force-pushed the daffodil-2867-float-comparison branch from 3782e79 to 6267374 Compare January 4, 2024 13:59
@stevedlawrence stevedlawrence merged commit 20d6ea3 into apache:main Jan 4, 2024
10 checks passed
@stevedlawrence stevedlawrence deleted the daffodil-2867-float-comparison branch January 4, 2024 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants