Fix for string values #58

R-maan · 2020-06-06T00:12:36Z

This pull request contains fixes for 2 cases:

Single quote in a long string: ''' ion's fun! '''
Strings with control characters

codecov-commenter · 2020-06-06T00:14:04Z

Codecov Report

Merging #58 into master will increase coverage by 0.21%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #58      +/-   ##
==========================================
+ Coverage   75.96%   76.17%   +0.21%     
==========================================
  Files          22       22              
  Lines        4609     4625      +16     
==========================================
+ Hits         3501     3523      +22     
+ Misses        682      679       -3     
+ Partials      426      423       -3

Impacted Files	Coverage Δ
ion/tokenizer.go	`79.68% <100.00%> (+0.44%)`	⬆️
ion/binaryreader.go	`80.95% <0.00%> (+0.68%)`	⬆️
ion/bitstream.go	`79.17% <0.00%> (+0.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88b9ce5...39f92b5. Read the comment docs.

fernomac

Nice catches!

(It might keep the commit history a bit cleaner to rebase your branch if you need to pick up changes from master, rather than repeatedly merging? No real problem from my end if you prefer the merge approach.)

fernomac · 2020-06-06T01:58:24Z

ion/tokenizer.go

@@ -542,6 +542,9 @@ func (t *tokenizer) readString() (string, error) {
 		if err != nil {
 			return "", err
 		}
+		if t.isProhibitedControlChar(c) {
+			return "", &SyntaxError{"Invalid character", t.pos}


t.invalidChar(c)?

fernomac · 2020-06-06T02:03:16Z

ion/tokenizer.go

@@ -668,6 +678,27 @@ func (t *tokenizer) readEscapedChar(clob bool) (rune, error) {
 	return 0, &SyntaxError{fmt.Sprintf("bad escape sequence '\\%c'", c), t.pos - 2}
 }

+func (t *tokenizer) isProhibitedControlChar(c int) bool {


t is unused here, so maybe worth dropping it and adding these functions to textutils.go?

…-triple-quote # Conflicts: # ion/tokenizer.go

R-maan · 2020-06-07T19:59:20Z

Thanks @fernomac , I applied your suggestions and updated the PR.

I do not have a preference between merge versus rebase (even though as you mentioned, rebase keeps the history cleaner).
Which ever we decide, I am fine with it. Any thoughts on this @zslayton ?

zslayton · 2020-06-09T13:29:01Z

I do not have a preference between merge versus rebase (even though as you mentioned, rebase keeps the history cleaner).
Which ever we decide, I am fine with it. Any thoughts on this @zslayton ?

My preference is to rebase while the PR is a draft and merge once the first review has happened. This means that reviewers will see a version of the code that was up-to-date with master when the (non-draft) PR was opened, but that the discussion around the code doesn't get blown away by rebasing after comments have been added.

In short, if you have merges in your commit history when you go to request your first review, rebase to clean it up. When the PR is approved, we'll merge from master and re-review if large changes were required. Finally, we'll squash+merge back into master.

zslayton · 2020-06-09T13:52:53Z

ion/tokenizer.go

@@ -1263,3 +1273,24 @@ func (t *tokenizer) unread(c int) {
 	t.pos--
 	t.buffer = append(t.buffer, c)
 }
+
+func isInvalidChar(c int) bool {
+	if c < 0x00 || c > 0x1F {


It took a bit of investigation to understand why a negative int was considered a valid character. Taking a look at read(), I found that it's returning -1 instead of an EOF err like Buffer.ReadByte does (docs). I'd be in favor of refactoring this behavior to align with the standard library in the future, but in the meantime I think that isInvalidCharacter should treat EOF (-1) as an invalid character, simplifying this check to:

// Values lower than this are non-displayable ASCII characters if c > 0x1F { return false; }

I found a couple things easier treating EOF as a character instead of an error, and it's an internal-only API so it felt like a good trade-off. It's very possible that's just my old C habits speaking though. :) Feel free to refactor it if you think it improves readability.

👍 for having this treat -1 as an invalid char. In both places we're calling it we're checking for -1 immediately after, which you could then remove.

Modified and created #60 to refactor read()

zslayton · 2020-06-09T13:58:15Z

ion/tokenizer.go

@@ -582,20 +585,27 @@ func (t *tokenizer) readLongString() (string, error) {
 		if err != nil {
 			return "", err
 		}
+		if isInvalidChar(c) {
+			return "", &SyntaxError{"Invalid character", t.pos}
+		}

 		switch c {
 		case -1:
 			return "", t.invalidChar(c)


We're reporting an invalid character here when we encounter an EOF, which may be confusing. Looks like we do the same thing in readString above. In both cases, I suggest handling the EOF case immediately after read() out so we can provide a more informative error to the user.

t.invalidChar, as previously suggested, does the right thing here (and should probably have a better name :)).

zslayton · 2020-06-09T14:11:16Z

ion/tokenizer.go

 			ok, err := t.skipEndOfLongString(t.skipCommentsHandler)
 			if err != nil {
 				return "", err
 			}
 			if ok {
 				return ret.String(), nil
 			}
-
+			if startPosition == t.pos {


It looks like skipEndOfLongString reports whether it consumed the string ending, so I think you can replace this position check with an else {...} attached to the if ok {...}.

wasConsumed, err := t.skipEndOfLongString(t.skipCommentsHandler) if err != nil { return "", err } if wasConsumed { return ret.String(), nil } else { // The ' was not part of a long string ending. ret.writeByte(byte(c)) }

It'll also return false if it skipped over a ''' /* pause in the string */ ''' sequence, in which case you don't want to keep the '. :( This seems correct to me in the short term, skipEndOfLongString should probably should be refactored to return a tri-state in the longer term.

Yes, I almost refactored skipEndOfLongString but then I decided to KISS.

#61 to change skipEndOfLongString

fernomac · 2020-06-09T15:39:11Z

ion/tokenizer.go

@@ -1263,3 +1273,24 @@ func (t *tokenizer) unread(c int) {
 	t.pos--
 	t.buffer = append(t.buffer, c)
 }
+
+func isInvalidChar(c int) bool {


This name's a bit ambiguous at the top level. Maybe isInvalidStringChar? It'd fit nicely alongside the other isXxx functions in textutils.go.

Now that's embarrassing! I read this comment and thought you're suggesting to change the name of the function, totally misread it.
I originally had named it isProhibitedControlChar. Changing it back

fernomac · 2020-06-09T15:55:58Z

ion/tokenizer.go

+	return true
+}
+
+func isWhiteSpaceChar(c int) bool {


isWhitespace is similarly-named but works a bit differently. Maybe worth combining them? Otherwise maybe rename this one to isStringWhitespace or something like that.

zslayton · 2020-06-09T23:48:51Z

ion/tokenizer.go

-
-		switch c {
-		case -1, '\n':
+		if isProhibitedControlChar(c) || c == '\n' {


I think we should explicitly handle the EOF/-1 case here and below. isProhibitedControlChar will catch it, but the method name makes me think it's only looking for ASCII control characters, which doesn't include -1.

I see your point, but that means going back to this comment.
We can either change the function name to something more generic like invalidStringCharacter or put the logic before this commit back.

There's a subtle difference here. We've gone from:

if isInvalidChar(c) { // Doesn't handle EOF, but could have based on the name } switch c { case -1: // Handles EOF

to

if isProhibitedControlChar(c) { // Handles EOF, but shouldn't based on the name. }

I think isProhibitedControlChar is a more precise/communicative name, so I'd like to keep it. But that means we need the explicit EOF check:

if c == -1 || c == '\n' || isProhibitedControlChar(c) { // Handles EOF }

I think this is especially helpful since someone new to the codebase could reasonably expect EOF to be handled by the

if err != nil { // ... }

above, since err is how the standard library reports EOF.

Cool. Updated the pull request. Thanks.

Arman A and others added 4 commits June 3, 2020 22:19

Fix triple quote strings containing single quote

d0d3412

Merge branch 'master' into string-triple-quote

2a54d28

string values with control characters

460b748

Merge branch 'master' into string-triple-quote

e53022a

fernomac previously approved these changes Jun 6, 2020

View reviewed changes

Arman A added 4 commits June 7, 2020 12:44

Fix triple quote strings containing single quote

984fe66

string values with control characters

7b85e7d

Rename helper function

e241322

Merge remote-tracking branch 'origin/string-triple-quote' into string…

64990f8

…-triple-quote # Conflicts: # ion/tokenizer.go

R-maan dismissed fernomac’s stale review via 64990f8 June 7, 2020 19:53

zslayton requested changes Jun 9, 2020

View reviewed changes

fernomac reviewed Jun 9, 2020

View reviewed changes

Minor refactors

43f7b28

R-maan requested a review from zslayton June 9, 2020 17:22

This was referenced Jun 9, 2020

Refactor tokenizer.read() #60

Open

Refactor skipper.skipEndOfLongString() #61

Closed

zslayton requested changes Jun 9, 2020

View reviewed changes

Handle -1 for string values

4a1635c

zslayton approved these changes Jun 11, 2020

View reviewed changes

Merge branch 'master' into string-triple-quote

39f92b5

zslayton merged commit 82b4807 into amazon-ion:master Jun 11, 2020

R-maan deleted the string-triple-quote branch June 12, 2020 18:51

R-maan mentioned this pull request Aug 12, 2020

Fix malformed clobs #130

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for string values #58

Fix for string values #58

R-maan commented Jun 6, 2020 •

edited

Loading

codecov-commenter commented Jun 6, 2020 •

edited

Loading

fernomac left a comment

fernomac Jun 6, 2020

fernomac Jun 6, 2020

R-maan commented Jun 7, 2020

zslayton commented Jun 9, 2020

zslayton Jun 9, 2020

fernomac Jun 9, 2020

fernomac Jun 9, 2020

R-maan Jun 9, 2020

zslayton Jun 9, 2020

fernomac Jun 9, 2020

R-maan Jun 9, 2020

zslayton Jun 9, 2020

fernomac Jun 9, 2020

R-maan Jun 9, 2020 •

edited

Loading

fernomac Jun 9, 2020

R-maan Jun 9, 2020

fernomac Jun 9, 2020

zslayton Jun 9, 2020

R-maan Jun 10, 2020

zslayton Jun 11, 2020

R-maan Jun 11, 2020

Fix for string values #58

Fix for string values #58

Conversation

R-maan commented Jun 6, 2020 • edited Loading

codecov-commenter commented Jun 6, 2020 • edited Loading

Codecov Report

fernomac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

R-maan commented Jun 7, 2020

zslayton commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

R-maan Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

R-maan commented Jun 6, 2020 •

edited

Loading

codecov-commenter commented Jun 6, 2020 •

edited

Loading

R-maan Jun 9, 2020 •

edited

Loading