-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM-61148 Rewrite table parsing based off of cmark-gfm #20
Conversation
The previous version of this was a custom implementation that I might have written based on Marked's implementation. That version continued to have issues, and I know regular expressions were likely insufficient for large parts of this, so I decided to start from scratch and attempt to port GitHub's cmark-gfm implementation. Overall, it seems like that's gone pretty well. I had to learn to read some generated C code and understand GFM's extension system, but I'm confident that I got the implementation right. This initial version: - Fixes MM-61148 - Adds support for nesting tables inside of block quotes matching GFM. The web app supports this but is less strict than GFM, and mobile doesn't currently support this. - Adds support for nesting tables inside of lists, also matching GFM. Neither app currently supports this. This almost fixes MM-61148, but the specific example in that ticket has the table directly following a paragraph without a blank line in between which is something we don't support yet.
I also had to update the output of the HTML renderer to match changes made in github/cmark-gfm@1470c30, and our existing table tests needed to be updated for that as well.
@@ -459,9 +467,6 @@ var blocks = { | |||
for (var row = block.firstChild; row; row = row.next) { | |||
var i = 0; | |||
for (var cell = row.firstChild; cell; cell = cell.next) { | |||
// copy column alignment to each cell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled when we add the node for the row instead of doing it now
cell.isHeading = true; | ||
} | ||
} | ||
finalize: function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this is handled when we first add the row instead of doing it after the fact
return 0; | ||
} | ||
|
||
|
||
if (parser.indented) { | ||
if (container._tableVisited) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding multiple lines to a paragraph, we keep track that we've already checked if that paragraph is a table to save some time
lib/blocks.js
Outdated
[parser.lineNumber - 1, parser.offset + 1], | ||
[parser.lineNumber - 1, parser.offset + headerCharacters], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tracks which parts of the original text corresponds to the parsed Markdown source tree. We don't use it ourselves, but I wanted to be consistent with the existing code
return 2; | ||
} | ||
]; | ||
|
||
const parseDelimiterRow = function(row) { | ||
if (row.indexOf("|") === -1) { | ||
const parseTableRow = function(line, startAt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the changes to how we start a table block, this function for identifying and parsing a table row is the other big change here since, instead of using a regular expression, we parse it manually by advancing through the text one character at a time which is tracked below by offset
.
This function is intentionally very C-like and less JS-like than normal to make it easier to port.
Some things in this function are pretty tricky because both the opening and closing pipes of the table are optional meaning a row could look like |aaa|bbb|
, aaa|bbb
, or even aaa
. This also has to be able to handle empty cells like ||
which can either be a single empty cell or two empty cells which is very fun. This is why the old regex was ugly and why this logic isn't much better looking 😅
@@ -288,12 +288,15 @@ function table_row(node, entering) { | |||
this.cr(); | |||
} else { | |||
this.tag("/tr"); | |||
this.cr(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmark-gfm handles whitespace slightly differently in its HTML output, so I had to change that for our tests as well so that we could use the tests from cmark-gfm (test/gfm_extensions.txt
)
@@ -1,7 +1,7 @@ | |||
{ | |||
"name": "@mattermost/commonmark", | |||
"description": "the Mattermost fork of a strongly specified, highly compatible variant of Markdown", | |||
"version": "0.30.1-2", | |||
"version": "0.30.1-3", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to be on master already, but I wasn't able to push the changes when I made them a couple months ago
</table> | ||
```````````````````````````````` | ||
|
||
### Tables inside of blockquotes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These aren't required for MM-61148, but the new parsing logic makes this behave the same as GitHub, so I added a bunch of tests to make sure it works like GitHub
8dbd880
to
bdd88ab
Compare
@devinbinnie @crspeller I wasn't sure who to assign reviews to since this code is pretty different than most of our JS code. Let me know if want me to reassign them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an synchronous review with @hmhealey and @devinbinnie
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @hmhealey :)
Summary
The current table parsing for the mobile app is either entirely custom or based on the implementation in our old version of marked.js, and it has repeatedly run into problems and bugs because of that. I definitely tried to do far too much in it using regular expressions. For example, my last PR #19 caused MM-61148.
Instead, I decided to relearn some C and try to copy https://github.com/github/cmark-gfm/blob/master/extensions/table.c to ensure our table parsing is as close to GitHub's version as possible minus the painful memory management.
How table parsing worked before/works now
The main changes for this are in
lib/blocks.js
(the files indist
are compiled versions of it). The most important parts of this file areblocks
which defines the types of block elements (paragraphs, lists, tables, etc) andblockStarts
which contains an array of functions that identify the starts of different types of block elements (paragraphs, lists, tables, etc).For each line of text, the parser calls the functions in
blockStarts
one at a time until one of them returns that the text starts a new block. Otherwise, the text is treated as regular text in a paragraph.The old
blockStart
function for tables would read each line of text, and if it found something that looked like a header row, it'd peek ahead to see if the next row is a matching delimiter row. If both of those are true, it'd open atable
node, move the parser into it, and then restart the processing for that line. On the next pass, theblockStart
for the table row would look at the header row again, add it into thetable
as a child, and then move onto the next line. It'd repeat that process until it found something that wasn't a table row at which point, it'd close the table and start a new block for whatever it found in the row.That worked okay, but that required a lot of changes to keep track of the next line of text, and I didn't understand how the parser handled things like indentation which caused some other issues (eg. MM-46943).
Instead of reading two rows at once, the way that cmark-gfm handles tables is that it reads the header row as plain text, and then, when adding lines to that paragraph, it looks for if the line that it's adding is a delimiter row. If it is, it looks back to see if the previous line is a header row. If both of those are found, then it converts the current
paragraph
into atable
and adds the header row as its child. After that, it continues on as it did before.That works a bit better since it doesn't require us to modify the parser as heavily, and since it starts as a paragraph, it makes it easier to handle tables that are indented or inside of things like list items and block quotes the same way that GitHub does.
Ticket Link
https://mattermost.atlassian.net/browse/MM-61148