-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Bug with valueLength being overwritten after Trim #1338
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
daffodil-test/src/test/resources/org/apache/daffodil/section12/lengthKind/PrefixedTests.tdml
Outdated
Show resolved
Hide resolved
.../src/main/scala/org/apache/daffodil/runtime1/processors/parsers/SpecifiedLengthParsers.scala
Show resolved
Hide resolved
.../src/main/scala/org/apache/daffodil/runtime1/processors/parsers/SpecifiedLengthParsers.scala
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/core/runtime1/ElementBaseRuntime1Mixin.scala
Show resolved
Hide resolved
(isSimpleType && (impliedRepresentation == Representation.Text || lengthKind == LengthKind.Delimited)) || | ||
val capturedByValueParsers = | ||
(isSimpleType && ( | ||
primType == PrimType.String || lengthKind == LengthKind.Delimited)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this change is correct?
For example, say we have this element:
<xs:element name="foo" type="xs:int" dfdl:representation="text" dfdl:trimKind="padChar" dfdl:lengthKind="explicit" dfdl:length="10" ... />
So a fixed length text integer with padding. In this case I think what Daffodil does is it create a String parser to parse the fixed length string and remove padding, and then creates another parser to convert that string to an actual integer.
So in that case, even though the primType is not String, I think the String parser will still be used to capture the value length after padding is removed. So I think impliedRepresentation == Text
is still needed?
1a0b7cc
to
78050c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable and the right approach for correctly implement prefixed length. Using things like valueLength was clearly wrong. Just a few questions.
@@ -186,21 +185,11 @@ case class HexBinaryEndOfBitLimit(e: ElementBase) extends Terminal(e, true) { | |||
case class HexBinaryLengthPrefixed(e: ElementBase) extends Terminal(e, true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is now exactly the same as HexBinaryEndOfBitLimit, suggest we just delete it an use that for prefixed hex binary.
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
if (isSimpleType && primType == PrimType.HexBinary) { | ||
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit | ||
// gets in the way of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of this comment, we can say HexBinary has it's own HexBinarySpecifiedLength parser that handles calculating the length, so we do not need the SpecifiedLengthExplicit parser?
In fact, do we need to exclude a number of other primitive types that do their own explicit length handling? Looking at the current code base, I think maybe only simple types that are strings and complex types use the SpecifiedLengthExplicit parser? I think all other primitives implement their own specified length handling?
So maybe this wants to be
if (isComplexType || primType == PrimType.String) {
SpecifiedLengthExplicit(...)
} else {
// non-string simple types have their own custom parsers/unparsers for handling explicit lengths
body
}
In fact, I wonder if we eventually want to refactor all of this to completely get rid of all the custom explicit/implicit length parsers? We just have various SpecifiedLength parser that sets a bit limit (based on a pattern, a prefix length, evaluaating a length expression etc) and then we just have a single parser that just reads all bit up until that current bit limit. Separation of concerns kind of thing. It would get rid of this condiation and all these BinaryIntegerKnownLength/RuntimeLength/PrefixLength/etc parsers. There's just a single BinaryNumberParser, and it just gets the length from the bitLimit.
Maybe that generality would take performance hit? I'm also not exactly sure how that would work with unparsing--the SpecifiedLengthUnparser would need to somehow pass the calculated length to the child unparser, I guess it could still use bitLimit since that is a thing in UState?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbeckerle , any thoughts on refactoring the code to have various specified length parser as described above, and any idea on how that would work with unparsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.
Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.
If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.
Fixing these bugs does seem worth it however.
body | ||
} else { | ||
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
} | ||
case LengthKind.Explicit => { | ||
Assert.invariant(!knownEncodingIsFixedWidth) | ||
Assert.invariant(lengthUnits eq LengthUnits.Characters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below this we have cases for implicit lengths. Do we need to do anything special for non-string simple types? I think those primitives have custom parsers that handle the implict length logic and don't need a SpecifiedLengthImplicit gramar? My concern is we could be adding that grammar and it would do something like set a bit limit, but the child paser that actually parsrers a the thing would just use it's own calculate and wouldn't need the bit limit, so we are just wasting effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest create a cleanup issue ticket for this. It's a performance (and maintainability) improvement but not specifically about this PR's primary goal is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a cleanup of existing issues. I think this change could possibly lead to adding duplicate parsers and might slow things down. So it's not that we already have duplicates, but this change could lead to new duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I believe this bug just fixes issues with prefixed lengths, which I think has a number of other outstanding issues. So even if we fix this one, I'm not sure prefixedLength is still going to be safe to use. Feels like it's less criticial to get into this release, and due to possible regressions is worth pushing to the next release.
e.lengthUnits, | ||
e.prefixedLengthAdjustmentInUnits | ||
) | ||
override lazy val parser = new BCDIntegerPrefixedLengthParser(e.elementRuntimeData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to rename these something like BCDIntegerBitLimitLengthParser
and BCDIntegerMinimumLengthUnparser
, making it clear that they aren't really doing anything specifically with prefixed length, and better describes the behavior of how they actuall parse/unparse things?
And when we implement things like lengthKind endOfParent, we could probably just use the same BitLimitParser, for example.
* This mixin doesn't require parsing the prefix length element and just uses | ||
* the state's bitLimit and position to get the bitLength instead | ||
*/ | ||
trait PrefixedLengthParserMixin2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a different name for this. Numbers are not descriptive enough. Maybe we renames these something like PrefixLengthFromParserMixin
and PrefixLengthFromBitLimitMixin
? Something to make it more clear how they differ without having to read the code.
case Prefixed => true | ||
case _ => false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just be lengthKind eq LengthKind.Prefixed
. Don't really need a match/case if we are just going to return true/false.
import LengthKind._ | ||
lengthKind match { | ||
case Delimited => | ||
true // don't test for hasDelimiters because it might not be our delimiter, but a surrounding group's separator, or it's terminator, etc. | ||
case Pattern => true | ||
case Prefixed => true | ||
case Prefixed => isPrefixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest just returning true here, isPrefixed doesn't do anything except check lengthKind == Prefixed, which is exactly what this match case does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not clear to me that we are correctly adding/not adding SpecifiedLength* parsers. Do we have any documentation or any easy way to tell that we are doing things correctly, and only adding those parsers when they are actually needed?
@@ -289,7 +289,8 @@ object DaffodilCCodeGenerator | |||
case g: ElementParseAndUnspecifiedLength => | |||
elementParseAndUnspecifiedLengthGenerateCode(g, cgState) | |||
case g: ElementUnused => noop(g) | |||
case g: HexBinaryLengthPrefixed => hexBinaryLengthPrefixedGenerateCode(g.e, cgState) | |||
case g: HexBinaryEndOfBitLimit if g.e.isPrefixed => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe my suggestion was wrong and the primitive still wants to be something like HexBInaryLengthPrefixed so that the grammar is obvious and things like code generators/etc can use the obvious grammar names? The parsers generated don't necessarily have to match the names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm...do you mean the suggestion to remove HexBinaryLengthPrefixed or the suggestion to rename all PrefixedLength parsers/unparser to BitLimitLength/MinimumLength?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that wasn't clear. I'm suggesting that we should keep the old grammar name HexBinaryLengthPrefixed
so this change isn't needed. This way the grammar is a more clear for things like the code generator that examine it. But it's still fine to call the parsers BitLimitLength/MinimumLength
etc from that grammar.
daffodil-core/src/main/scala/org/apache/daffodil/core/grammar/ElementBaseGrammarMixin.scala
Show resolved
Hide resolved
if isSimpleType && impliedRepresentation == Representation.Binary => | ||
if isSimpleType && | ||
impliedRepresentation == Representation.Binary && | ||
primType != PrimType.HexBinary => | ||
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition feels like it doesn't exclude enough things. For example, this matches all binary types except hex binary. But, for example, integers with length kind implicit use something like BinaryIntegerKnownLength
primitive/parser, which doesn't rely on SpecifiedLengthImplicit parser. It doesn't necessarily hurt, but it's just unnecessary work and might slow things down, since the BinaryIntegerKnownLengthParser is going to do that work.
I'm wonder if it's just string types that really make use of specified length stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So It look like nillable binary types need SpecifiedLengthImplicit., they are the only tests failing when I comment out that chunk of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that means we only need the SpecifiedLengthImplicit when we are trying to parse the nillable part of a number. For example, the way things currently work is we have something like a SimpleNilOrValueParser
and that has two children parsers, one is the nil parser (which sounds like it requires SpecifiedLengthImplicit) and the other is the binary paser (which probably does not need SpecifiedLenghtImplicit).
I wonder if captureLengthRegion needs a new paramater (e.g. forNilContent: Boolean
) which is passed into specifiedLength
. This way specifiedLength
can do different things depending on if it's trying to represent nil content or non-nil content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.
- currently after trimming the value of the element, we set the valueLength, and then overwrite it after returning from the parse that does the trimming. This results in the wrong value for value length. This fixes it by checking if the valueLength has already been set, and only setting it in SpecifiedLengthParserBase.parse if it hasn't - add tests DAFFODIL-2658
- update check to captureValueLength in base parser if it's a complex element - update shouldCaptureParseValueLength to use PrimType.String instead of representation Text, since strings valueLengths (and delimited valueLengths) are always handled by value parsers DAFFODIL-2658
- add asserts to setAbsStartPos0bInBits and setAbsEndPos0bInBits to ensure they're not being overwritten - undo change to capturedByValueParsers to put back im[liedRepresentation == Text - do not capture the value lengths of choices in specified length parser base - fix bug where padding is being added around prefixed length element (DAFFODIL-2943) by changing CaptureLengthRegion to wrap around contentlengthStart and padding - fix bug where we use the valuelength to calculate the prefix length, according to the spec it should be the content length - fix bug where we were missing return after PE for Out of Range Binary Integers (DAFFODIL-2942) - fix bug where we were using the main element's qname instead of the prefixed element qname in the Unparse Error message - refactor Prefixed parsers to use state's bitLimit to get the prefix length (PrefixedLengthParserMixin2) since the specifiedLengthPrefixedParser will take care of parsing the prefix length - refactored Prefixed unparsers to not try to unparse prefix length since that is taken care of by SpecifiedLengthPrefixedUnparser - refactored prefixed parsers and unparsers to remove unused prefixed length parser related members DAFFODIL-2658
… Trim - rename custom prefixedlength parsers to *BitLengthParsers to more accurately reflect what they're doing - rename custom prefixedlength unparsers to MinimumLengthUnparsers to more accurately reflect what they're doing - rename PrefixedLengthParserMixin2 to BitLengthFromBitLimitMixin to more accurately reflect what it is - got rid of HexBinaryLengthPrefixed since it is the same as HexBinaryEndOfBitLimit DAFFODIL-2658
ccdc433
to
8d30149
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Code looks good. Questions on how to organize this code better long term remain, but that requires more than a PR review.
new SpecifiedLengthExplicit(this, body, bitsMultiplier) | ||
if (isSimpleType && primType == PrimType.HexBinary) { | ||
// hexBinary has some checks that need to be done that SpecifiedLengthExplicit | ||
// gets in the way of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not refactor this now. This code has gotten kind of undisciplined and been that way for a long time now.
Our performance last we checked was in the ballpark of 2x what a programmer would write by hand for simple binary data.
If we start reorganizing this sort of thing we need to rerun those tests. We're too close to a release for that.
Fixing these bugs does seem worth it however.
if isSimpleType && impliedRepresentation == Representation.Binary => | ||
if isSimpleType && | ||
impliedRepresentation == Representation.Binary && | ||
primType != PrimType.HexBinary => | ||
new SpecifiedLengthImplicit(this, body, implicitBinaryLengthInBits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest parking this cleanup as a separate ticket. Unnecessary parsers are not optimized out in many cases I bet, and this is just one such case where specified length stuff is unnecessary but not removed.
Created DAFFODIL-2967 to ensure we're properly adding/not adding the schemas @mbeckerle , the piece of code you highlighted as being better off removed, removing that bit caused test failures iirc! |
DAFFODIL-2658