Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for string values #58

Merged
merged 11 commits into from
Jun 11, 2020
Merged

Conversation

R-maan
Copy link
Contributor

@R-maan R-maan commented Jun 6, 2020

This pull request contains fixes for 2 cases:

  • Single quote in a long string: ''' ion's fun! '''
  • Strings with control characters

@codecov-commenter
Copy link

codecov-commenter commented Jun 6, 2020

Codecov Report

Merging #58 into master will increase coverage by 0.21%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   75.96%   76.17%   +0.21%     
==========================================
  Files          22       22              
  Lines        4609     4625      +16     
==========================================
+ Hits         3501     3523      +22     
+ Misses        682      679       -3     
+ Partials      426      423       -3     
Impacted Files Coverage Δ
ion/tokenizer.go 79.68% <100.00%> (+0.44%) ⬆️
ion/binaryreader.go 80.95% <0.00%> (+0.68%) ⬆️
ion/bitstream.go 79.17% <0.00%> (+0.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88b9ce5...39f92b5. Read the comment docs.

fernomac
fernomac previously approved these changes Jun 6, 2020
Copy link
Contributor

@fernomac fernomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catches!

(It might keep the commit history a bit cleaner to rebase your branch if you need to pick up changes from master, rather than repeatedly merging? No real problem from my end if you prefer the merge approach.)

ion/tokenizer.go Outdated
@@ -542,6 +542,9 @@ func (t *tokenizer) readString() (string, error) {
if err != nil {
return "", err
}
if t.isProhibitedControlChar(c) {
return "", &SyntaxError{"Invalid character", t.pos}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.invalidChar(c)?

ion/tokenizer.go Outdated
@@ -668,6 +678,27 @@ func (t *tokenizer) readEscapedChar(clob bool) (rune, error) {
return 0, &SyntaxError{fmt.Sprintf("bad escape sequence '\\%c'", c), t.pos - 2}
}

func (t *tokenizer) isProhibitedControlChar(c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t is unused here, so maybe worth dropping it and adding these functions to textutils.go?

@R-maan
Copy link
Contributor Author

R-maan commented Jun 7, 2020

Thanks @fernomac , I applied your suggestions and updated the PR.

I do not have a preference between merge versus rebase (even though as you mentioned, rebase keeps the history cleaner).
Which ever we decide, I am fine with it. Any thoughts on this @zslayton ?

@zslayton
Copy link
Contributor

zslayton commented Jun 9, 2020

I do not have a preference between merge versus rebase (even though as you mentioned, rebase keeps the history cleaner).
Which ever we decide, I am fine with it. Any thoughts on this @zslayton ?

My preference is to rebase while the PR is a draft and merge once the first review has happened. This means that reviewers will see a version of the code that was up-to-date with master when the (non-draft) PR was opened, but that the discussion around the code doesn't get blown away by rebasing after comments have been added.

In short, if you have merges in your commit history when you go to request your first review, rebase to clean it up. When the PR is approved, we'll merge from master and re-review if large changes were required. Finally, we'll squash+merge back into master.

@@ -1263,3 +1273,24 @@ func (t *tokenizer) unread(c int) {
t.pos--
t.buffer = append(t.buffer, c)
}

func isInvalidChar(c int) bool {
if c < 0x00 || c > 0x1F {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took a bit of investigation to understand why a negative int was considered a valid character. Taking a look at read(), I found that it's returning -1 instead of an EOF err like Buffer.ReadByte does (docs). I'd be in favor of refactoring this behavior to align with the standard library in the future, but in the meantime I think that isInvalidCharacter should treat EOF (-1) as an invalid character, simplifying this check to:

// Values lower than this are non-displayable ASCII characters
if c > 0x1F {
  return false;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a couple things easier treating EOF as a character instead of an error, and it's an internal-only API so it felt like a good trade-off. It's very possible that's just my old C habits speaking though. :) Feel free to refactor it if you think it improves readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for having this treat -1 as an invalid char. In both places we're calling it we're checking for -1 immediately after, which you could then remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified and created #60 to refactor read()

ion/tokenizer.go Outdated
@@ -582,20 +585,27 @@ func (t *tokenizer) readLongString() (string, error) {
if err != nil {
return "", err
}
if isInvalidChar(c) {
return "", &SyntaxError{"Invalid character", t.pos}
}

switch c {
case -1:
return "", t.invalidChar(c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're reporting an invalid character here when we encounter an EOF, which may be confusing. Looks like we do the same thing in readString above. In both cases, I suggest handling the EOF case immediately after read() out so we can provide a more informative error to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.invalidChar, as previously suggested, does the right thing here (and should probably have a better name :)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

ok, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
return "", err
}
if ok {
return ret.String(), nil
}

if startPosition == t.pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like skipEndOfLongString reports whether it consumed the string ending, so I think you can replace this position check with an else {...} attached to the if ok {...}.

wasConsumed, err := t.skipEndOfLongString(t.skipCommentsHandler)
if err != nil {
	return "", err
}
if wasConsumed {
	return ret.String(), nil
} else {
        // The ' was not part of a long string ending.
        ret.writeByte(byte(c))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll also return false if it skipped over a ''' /* pause in the string */ ''' sequence, in which case you don't want to keep the '. :( This seems correct to me in the short term, skipEndOfLongString should probably should be refactored to return a tri-state in the longer term.

Copy link
Contributor Author

@R-maan R-maan Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I almost refactored skipEndOfLongString but then I decided to KISS.

#61 to change skipEndOfLongString

ion/tokenizer.go Outdated
@@ -1263,3 +1273,24 @@ func (t *tokenizer) unread(c int) {
t.pos--
t.buffer = append(t.buffer, c)
}

func isInvalidChar(c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name's a bit ambiguous at the top level. Maybe isInvalidStringChar? It'd fit nicely alongside the other isXxx functions in textutils.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that's embarrassing! I read this comment and thought you're suggesting to change the name of the function, totally misread it.
I originally had named it isProhibitedControlChar. Changing it back

ion/tokenizer.go Outdated
return true
}

func isWhiteSpaceChar(c int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isWhitespace is similarly-named but works a bit differently. Maybe worth combining them? Otherwise maybe rename this one to isStringWhitespace or something like that.

ion/tokenizer.go Outdated

switch c {
case -1, '\n':
if isProhibitedControlChar(c) || c == '\n' {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explicitly handle the EOF/-1 case here and below. isProhibitedControlChar will catch it, but the method name makes me think it's only looking for ASCII control characters, which doesn't include -1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, but that means going back to this comment.
We can either change the function name to something more generic like invalidStringCharacter or put the logic before this commit back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle difference here. We've gone from:

if isInvalidChar(c) {
    // Doesn't handle EOF, but could have based on the name
}
switch c {
case -1: // Handles EOF

to

if isProhibitedControlChar(c) {
     // Handles EOF, but shouldn't based on the name.
}

I think isProhibitedControlChar is a more precise/communicative name, so I'd like to keep it. But that means we need the explicit EOF check:

if c == -1 || c == '\n' || isProhibitedControlChar(c) {
    // Handles EOF
}

I think this is especially helpful since someone new to the codebase could reasonably expect EOF to be handled by the

if err != nil {
// ...
}

above, since err is how the standard library reports EOF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Updated the pull request. Thanks.

@zslayton zslayton merged commit 82b4807 into amazon-ion:master Jun 11, 2020
@R-maan R-maan deleted the string-triple-quote branch June 12, 2020 18:51
@R-maan R-maan mentioned this pull request Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants