Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Bug with valueLength being overwritten after Trim #1338

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

olabusayoT
Copy link
Contributor

@olabusayoT olabusayoT commented Oct 15, 2024

  • currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by only setting it if it's a non-choice comlplexType and simpleTypes are handled elsewhere
  • we also incorrectly use valuelength for prefixed length calculations when we ought to be using content length per the spec
  • we also do not ensure valuelength isn't getting overwritten so we add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to verify that
  • fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding
  • fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942)
  • fix bug where we were using the main element's qname instead of the prefixed element qname in the Unparse Error message
  • refactor Prefixed parsers to use state's bitLimit to get the prefix length (PrefixedLengthParserMixin2) since the specifiedLengthPrefixedParser will take care of parsing the prefix length
  • refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser
  • refactored prefixed parsers and unparsers to remove unused prefixed length parser related members
  • add tests

DAFFODIL-2658

Copy link
Contributor

@jadams-tresys jadams-tresys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

(isSimpleType && (impliedRepresentation == Representation.Text || lengthKind == LengthKind.Delimited)) ||
val capturedByValueParsers =
(isSimpleType && (
primType == PrimType.String || lengthKind == LengthKind.Delimited)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this change is correct?

For example, say we have this element:

<xs:element name="foo" type="xs:int" dfdl:representation="text" dfdl:trimKind="padChar" dfdl:lengthKind="explicit" dfdl:length="10" ... />

So a fixed length text integer with padding. In this case I think what Daffodil does is it create a String parser to parse the fixed length string and remove padding, and then creates another parser to convert that string to an actual integer.

So in that case, even though the primType is not String, I think the String parser will still be used to capture the value length after padding is removed. So I think impliedRepresentation == Text is still needed?

@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch 3 times, most recently from 1a0b7cc to 78050c8 Compare November 1, 2024 04:36
Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable and the right approach for correctly implement prefixed length. Using things like valueLength was clearly wrong. Just a few questions.

@@ -186,21 +185,11 @@ case class HexBinaryEndOfBitLimit(e: ElementBase) extends Terminal(e, true) {
case class HexBinaryLengthPrefixed(e: ElementBase) extends Terminal(e, true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is now exactly the same as HexBinaryEndOfBitLimit, suggest we just delete it an use that for prefixed hex binary.

new SpecifiedLengthExplicit(this, body, bitsMultiplier)
if (isSimpleType && primType == PrimType.HexBinary) {
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit
// gets in the way of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this comment, we can say HexBinary has it's own HexBinarySpecifiedLength parser that handles calculating the length, so we do not need the SpecifiedLengthExplicit parser?

In fact, do we need to exclude a number of other primitive types that do their own explicit length handling? Looking at the current code base, I think maybe only simple types that are strings and complex types use the SpecifiedLengthExplicit parser? I think all other primitives implement their own specified length handling?

So maybe this wants to be

if (isComplexType || primType == PrimType.String) {
  SpecifiedLengthExplicit(...)
} else {
  // non-string simple types have their own custom parsers/unparsers for handling explicit lengths
  body
}

In fact, I wonder if we eventually want to refactor all of this to completely get rid of all the custom explicit/implicit length parsers? We just have various SpecifiedLength parser that sets a bit limit (based on a pattern, a prefix length, evaluaating a length expression etc) and then we just have a single parser that just reads all bit up until that current bit limit. Separation of concerns kind of thing. It would get rid of this condiation and all these BinaryIntegerKnownLength/RuntimeLength/PrefixLength/etc parsers. There's just a single BinaryNumberParser, and it just gets the length from the bitLimit.

Maybe that generality would take performance hit? I'm also not exactly sure how that would work with unparsing--the SpecifiedLengthUnparser would need to somehow pass the calculated length to the child unparser, I guess it could still use bitLimit since that is a thing in UState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbeckerle , any thoughts on refactoring the code to have various specified length parser as described above, and any idea on how that would work with unparsing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.

Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.

If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.

Fixing these bugs does seem worth it however.

body
} else {
new SpecifiedLengthExplicit(this, body, bitsMultiplier)
}
case LengthKind.Explicit => {
Assert.invariant(!knownEncodingIsFixedWidth)
Assert.invariant(lengthUnits eq LengthUnits.Characters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below this we have cases for implicit lengths. Do we need to do anything special for non-string simple types? I think those primitives have custom parsers that handle the implict length logic and don't need a SpecifiedLengthImplicit gramar? My concern is we could be adding that grammar and it would do something like set a bit limit, but the child paser that actually parsrers a the thing would just use it's own calculate and wouldn't need the bit limit, so we are just wasting effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest create a cleanup issue ticket for this. It's a performance (and maintainability) improvement but not specifically about this PR's primary goal is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a cleanup of existing issues. I think this change could possibly lead to adding duplicate parsers and might slow things down. So it's not that we already have duplicates, but this change could lead to new duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I believe this bug just fixes issues with prefixed lengths, which I think has a number of other outstanding issues. So even if we fix this one, I'm not sure prefixedLength is still going to be safe to use. Feels like it's less criticial to get into this release, and due to possible regressions is worth pushing to the next release.

e.lengthUnits,
e.prefixedLengthAdjustmentInUnits
)
override lazy val parser = new BCDIntegerPrefixedLengthParser(e.elementRuntimeData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to rename these something like BCDIntegerBitLimitLengthParser and BCDIntegerMinimumLengthUnparser, making it clear that they aren't really doing anything specifically with prefixed length, and better describes the behavior of how they actuall parse/unparse things?

And when we implement things like lengthKind endOfParent, we could probably just use the same BitLimitParser, for example.

* This mixin doesn't require parsing the prefix length element and just uses
* the state's bitLimit and position to get the bitLength instead
*/
trait PrefixedLengthParserMixin2 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a different name for this. Numbers are not descriptive enough. Maybe we renames these something like PrefixLengthFromParserMixin and PrefixLengthFromBitLimitMixin? Something to make it more clear how they differ without having to read the code.

case Prefixed => true
case _ => false
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be lengthKind eq LengthKind.Prefixed. Don't really need a match/case if we are just going to return true/false.

import LengthKind._
lengthKind match {
case Delimited =>
true // don't test for hasDelimiters because it might not be our delimiter, but a surrounding group's separator, or it's terminator, etc.
case Pattern => true
case Prefixed => true
case Prefixed => isPrefixed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest just returning true here, isPrefixed doesn't do anything except check lengthKind == Prefixed, which is exactly what this match case does.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not clear to me that we are correctly adding/not adding SpecifiedLength* parsers. Do we have any documentation or any easy way to tell that we are doing things correctly, and only adding those parsers when they are actually needed?

@@ -289,7 +289,8 @@ object DaffodilCCodeGenerator
case g: ElementParseAndUnspecifiedLength =>
elementParseAndUnspecifiedLengthGenerateCode(g, cgState)
case g: ElementUnused => noop(g)
case g: HexBinaryLengthPrefixed => hexBinaryLengthPrefixedGenerateCode(g.e, cgState)
case g: HexBinaryEndOfBitLimit if g.e.isPrefixed =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe my suggestion was wrong and the primitive still wants to be something like HexBInaryLengthPrefixed so that the grammar is obvious and things like code generators/etc can use the obvious grammar names? The parsers generated don't necessarily have to match the names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...do you mean the suggestion to remove HexBinaryLengthPrefixed or the suggestion to rename all PrefixedLength parsers/unparser to BitLimitLength/MinimumLength?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that wasn't clear. I'm suggesting that we should keep the old grammar name HexBinaryLengthPrefixed so this change isn't needed. This way the grammar is a more clear for things like the code generator that examine it. But it's still fine to call the parsers BitLimitLength/MinimumLength etc from that grammar.

if isSimpleType && impliedRepresentation == Representation.Binary =>
if isSimpleType &&
impliedRepresentation == Representation.Binary &&
primType != PrimType.HexBinary =>
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition feels like it doesn't exclude enough things. For example, this matches all binary types except hex binary. But, for example, integers with length kind implicit use something like BinaryIntegerKnownLength primitive/parser, which doesn't rely on SpecifiedLengthImplicit parser. It doesn't necessarily hurt, but it's just unnecessary work and might slow things down, since the BinaryIntegerKnownLengthParser is going to do that work.

I'm wonder if it's just string types that really make use of specified length stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So It look like nillable binary types need SpecifiedLengthImplicit., they are the only tests failing when I comment out that chunk of code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that means we only need the SpecifiedLengthImplicit when we are trying to parse the nillable part of a number. For example, the way things currently work is we have something like a SimpleNilOrValueParser and that has two children parsers, one is the nil parser (which sounds like it requires SpecifiedLengthImplicit) and the other is the binary paser (which probably does not need SpecifiedLenghtImplicit).

I wonder if captureLengthRegion needs a new paramater (e.g. forNilContent: Boolean) which is passed into specifiedLength. This way specifiedLength can do different things depending on if it's trying to represent nil content or non-nil content.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.

- currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by checking if the valueLength has already been set, and only setting it in SpecifiedLengthParserBase.parse if it hasn't
- add tests

DAFFODIL-2658
- update check to captureValueLength in base parser if it's a complex element
- update shouldCaptureParseValueLength to use PrimType.String instead of representation Text, since strings valueLengths (and delimited valueLengths) are always handled by value parsers

DAFFODIL-2658
- add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to ensure they're not being overwritten
- undo change to capturedByValueParsers to put back im[liedRepresentation == Text
- do not capture the value lengths of choices in specified length parser base
- fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding
- fix bug where we use the valuelength to calculate the prefix length, according to the spec it should be the content length
- fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942)
- fix bug where we were using the main element's qname instead of the prefixed element qname in the Unparse Error message
- refactor Prefixed parsers to use state's bitLimit to get the prefix length (PrefixedLengthParserMixin2) since the specifiedLengthPrefixedParser will take care of parsing the prefix length
- refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser
- refactored prefixed parsers and unparsers to remove unused prefixed length parser related members

DAFFODIL-2658
… Trim

- rename custom prefixedlength parsers to *BitLengthParsers to more accurately reflect what they're doing
- rename custom prefixedlength unparsers to MinimumLengthUnparsers to more accurately reflect what they're doing
- rename PrefixedLengthParserMixin2 to BitLengthFromBitLimitMixin to more accurately reflect what it is
- got rid of HexBinaryLengthPrefixed since it is the same as HexBinaryEndOfBitLimit

DAFFODIL-2658
@olabusayoT olabusayoT force-pushed the daf-2658-paddingNotRemoved branch from ccdc433 to 8d30149 Compare January 9, 2025 20:56
Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Code looks good. Questions on how to organize this code better long term remain, but that requires more than a PR review.

new SpecifiedLengthExplicit(this, body, bitsMultiplier)
if (isSimpleType && primType == PrimType.HexBinary) {
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit
// gets in the way of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.

Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.

If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.

Fixing these bugs does seem worth it however.

if isSimpleType && impliedRepresentation == Representation.Binary =>
if isSimpleType &&
impliedRepresentation == Representation.Binary &&
primType != PrimType.HexBinary =>
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.

@olabusayoT
Copy link
Contributor Author

Created DAFFODIL-2967 to ensure we're properly adding/not adding the schemas

@mbeckerle , the piece of code you highlighted as being better off removed, removing that bit caused test failures iirc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants