Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for string values #58

Merged
merged 11 commits into from
Jun 11, 2020
5 changes: 0 additions & 5 deletions ion/integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,6 @@ var malformedIonsSkipList = []string{
"localSymbolTableWithMultipleSymbolsAndImportsFields.ion",
"localSymbolTableWithMultipleSymbolsFields.10n",
"localSymbolTableWithMultipleSymbolsFields.ion",
"longStringRawControlCharacter.ion",
"minLongWithLenTooLarge.10n",
"minLongWithLenTooSmall.10n",
"negativeIntZero.10n",
Expand All @@ -203,8 +202,6 @@ var malformedIonsSkipList = []string{
"nopPadWithAnnotations.10n",
"nullDotCommentInt.ion",
"sexpOperatorAnnotation.ion",
"stringLenTooLarge.10n",
"stringRawControlCharacter.ion",
"stringWithLatinEncoding.10n",
"structOrderedEmpty.10n",
"surrogate_1.ion",
Expand All @@ -225,7 +222,6 @@ var malformedIonsSkipList = []string{

var equivsSkipList = []string{
"annotatedIvms.ion",
"clobs.ion",
"localSymbolTableAppend.ion",
"localSymbolTableNullSlots.ion",
"localSymbolTableWithAnnotations.ion",
Expand All @@ -234,7 +230,6 @@ var equivsSkipList = []string{
"nonIVMNoOps.ion",
"sexps.ion",
"stringUtf8.ion",
"strings.ion",
"structsFieldsDiffOrder.ion",
"structsFieldsRepeatedNames.ion",
"systemSymbols.ion",
Expand Down
42 changes: 35 additions & 7 deletions ion/tokenizer.go
Original file line number Diff line number Diff line change
Expand Up @@ -542,11 +542,12 @@ func (t *tokenizer) readString() (string, error) {
if err != nil {
return "", err
}

switch c {
case -1, '\n':
// -1 denotes EOF, and new lines are not allowed in short string
if c == -1 || c == '\n' || isProhibitedControlChar(c) {
return "", t.invalidChar(c)
}

switch c {
case '"':
return ret.String(), nil

Expand Down Expand Up @@ -582,20 +583,25 @@ func (t *tokenizer) readLongString() (string, error) {
if err != nil {
return "", err
}

switch c {
case -1:
// -1 denotes EOF
if c == -1 || isProhibitedControlChar(c) {
return "", t.invalidChar(c)
}

switch c {
case '\'':
startPosition := t.pos
ok, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
return "", err
}
if ok {
return ret.String(), nil
}

if startPosition == t.pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like skipEndOfLongString reports whether it consumed the string ending, so I think you can replace this position check with an else {...} attached to the if ok {...}.

wasConsumed, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
	return "", err
}
if wasConsumed {
	return ret.String(), nil
} else {
        // The ' was not part of a long string ending.
        ret.writeByte(byte(c))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll also return false if it skipped over a ''' /* pause in the string */ ''' sequence, in which case you don't want to keep the '. :( This seems correct to me in the short term, skipEndOfLongString should probably should be refactored to return a tri-state in the longer term.

Copy link
Contributor Author

@R-maan R-maan Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I almost refactored skipEndOfLongString but then I decided to KISS.

#61 to change skipEndOfLongString

// No character has been consumed. It is single '.
ret.WriteByte(byte(c))
}
case '\\':
c, err = t.peek()
if err != nil {
Expand Down Expand Up @@ -1263,3 +1269,25 @@ func (t *tokenizer) unread(c int) {
t.pos--
t.buffer = append(t.buffer, c)
}

func isProhibitedControlChar(c int) bool {
// Values between 0 to 31 are non-displayable ASCII characters; except for new line and white space characters.
if c < 0x00 || c > 0x1F {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a bit of investigation to understand why a negative int was considered a valid character. Taking a look at read(), I found that it's returning -1 instead of an EOF err like Buffer.ReadByte does (docs). I'd be in favor of refactoring this behavior to align with the standard library in the future, but in the meantime I think that isInvalidCharacter should treat EOF (-1) as an invalid character, simplifying this check to:

// Values lower than this are non-displayable ASCII characters
if c > 0x1F {
  return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple things easier treating EOF as a character instead of an error, and it's an internal-only API so it felt like a good trade-off. It's very possible that's just my old C habits speaking though. :) Feel free to refactor it if you think it improves readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for having this treat -1 as an invalid char. In both places we're calling it we're checking for -1 immediately after, which you could then remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified and created #60 to refactor read()

return false
}
if isStringWhitespace(c) || isNewLineChar(c) {
return false
}
return true
}

func isStringWhitespace(c int) bool {
return c == 0x09 || //horizontal tab
c == 0x0B || //vertical tab
c == 0x0C // form feed
}

func isNewLineChar(c int) bool {
return c == 0x0A || //new line
c == 0x0D //carriage return
}