Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-61148 Rewrite table parsing based off of cmark-gfm #20

Merged
merged 7 commits into from
Nov 12, 2024

Conversation

hmhealey
Copy link
Member

@hmhealey hmhealey commented Oct 28, 2024

Summary

The current table parsing for the mobile app is either entirely custom or based on the implementation in our old version of marked.js, and it has repeatedly run into problems and bugs because of that. I definitely tried to do far too much in it using regular expressions. For example, my last PR #19 caused MM-61148.

Instead, I decided to relearn some C and try to copy https://github.com/github/cmark-gfm/blob/master/extensions/table.c to ensure our table parsing is as close to GitHub's version as possible minus the painful memory management.

How table parsing worked before/works now

The main changes for this are in lib/blocks.js (the files in dist are compiled versions of it). The most important parts of this file are blocks which defines the types of block elements (paragraphs, lists, tables, etc) and blockStarts which contains an array of functions that identify the starts of different types of block elements (paragraphs, lists, tables, etc).

For each line of text, the parser calls the functions in blockStarts one at a time until one of them returns that the text starts a new block. Otherwise, the text is treated as regular text in a paragraph.

|column 1|column 2|column 3| <---- header row
|--------|--------|--------| <---- delimiter row
|apple   |banana  |carrot  |

The old blockStart function for tables would read each line of text, and if it found something that looked like a header row, it'd peek ahead to see if the next row is a matching delimiter row. If both of those are true, it'd open a table node, move the parser into it, and then restart the processing for that line. On the next pass, the blockStart for the table row would look at the header row again, add it into the table as a child, and then move onto the next line. It'd repeat that process until it found something that wasn't a table row at which point, it'd close the table and start a new block for whatever it found in the row.

That worked okay, but that required a lot of changes to keep track of the next line of text, and I didn't understand how the parser handled things like indentation which caused some other issues (eg. MM-46943).

Instead of reading two rows at once, the way that cmark-gfm handles tables is that it reads the header row as plain text, and then, when adding lines to that paragraph, it looks for if the line that it's adding is a delimiter row. If it is, it looks back to see if the previous line is a header row. If both of those are found, then it converts the current paragraph into a table and adds the header row as its child. After that, it continues on as it did before.

That works a bit better since it doesn't require us to modify the parser as heavily, and since it starts as a paragraph, it makes it easier to handle tables that are indented or inside of things like list items and block quotes the same way that GitHub does.

Ticket Link

https://mattermost.atlassian.net/browse/MM-61148

The previous version of this was a custom implementation that I might have written based on Marked's implementation. That version continued to have issues, and I know regular expressions were likely insufficient for large parts of this, so I decided to start from scratch and attempt to port GitHub's cmark-gfm implementation.

Overall, it seems like that's gone pretty well. I had to learn to read some generated C code and understand GFM's extension system, but I'm confident that I got the implementation right. This initial version:
- Fixes MM-61148
- Adds support for nesting tables inside of block quotes matching GFM. The web app supports this but is less strict than GFM, and mobile doesn't currently support this.
- Adds support for nesting tables inside of lists, also matching GFM. Neither app currently supports this. This almost fixes MM-61148, but the specific example in that ticket has the table directly following a paragraph without a blank line in between which is something we don't support yet.
I also had to update the output of the HTML renderer to match changes made in github/cmark-gfm@1470c30, and our existing table tests needed to be updated for that as well.
@hmhealey hmhealey added the 2: Dev Review Requires review by a core committer label Oct 28, 2024
@@ -459,9 +467,6 @@ var blocks = {
for (var row = block.firstChild; row; row = row.next) {
var i = 0;
for (var cell = row.firstChild; cell; cell = cell.next) {
// copy column alignment to each cell
Copy link
Member Author

@hmhealey hmhealey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled when we add the node for the row instead of doing it now

cell.isHeading = true;
}
}
finalize: function() {
Copy link
Member Author

@hmhealey hmhealey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this is handled when we first add the row instead of doing it after the fact

return 0;
}


if (parser.indented) {
if (container._tableVisited) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding multiple lines to a paragraph, we keep track that we've already checked if that paragraph is a table to save some time

lib/blocks.js Outdated
Comment on lines 822 to 823
[parser.lineNumber - 1, parser.offset + 1],
[parser.lineNumber - 1, parser.offset + headerCharacters],
Copy link
Member Author

@hmhealey hmhealey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tracks which parts of the original text corresponds to the parsed Markdown source tree. We don't use it ourselves, but I wanted to be consistent with the existing code

return 2;
}
];

const parseDelimiterRow = function(row) {
if (row.indexOf("|") === -1) {
const parseTableRow = function(line, startAt) {
Copy link
Member Author

@hmhealey hmhealey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the changes to how we start a table block, this function for identifying and parsing a table row is the other big change here since, instead of using a regular expression, we parse it manually by advancing through the text one character at a time which is tracked below by offset.

This function is intentionally very C-like and less JS-like than normal to make it easier to port.

Some things in this function are pretty tricky because both the opening and closing pipes of the table are optional meaning a row could look like |aaa|bbb|, aaa|bbb, or even aaa. This also has to be able to handle empty cells like || which can either be a single empty cell or two empty cells which is very fun. This is why the old regex was ugly and why this logic isn't much better looking 😅

@@ -288,12 +288,15 @@ function table_row(node, entering) {
this.cr();
} else {
this.tag("/tr");
this.cr();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmark-gfm handles whitespace slightly differently in its HTML output, so I had to change that for our tests as well so that we could use the tests from cmark-gfm (test/gfm_extensions.txt)

@@ -1,7 +1,7 @@
{
"name": "@mattermost/commonmark",
"description": "the Mattermost fork of a strongly specified, highly compatible variant of Markdown",
"version": "0.30.1-2",
"version": "0.30.1-3",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be on master already, but I wasn't able to push the changes when I made them a couple months ago

</table>
````````````````````````````````

### Tables inside of blockquotes
Copy link
Member Author

@hmhealey hmhealey Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't required for MM-61148, but the new parsing logic makes this behave the same as GitHub, so I added a bunch of tests to make sure it works like GitHub

@hmhealey hmhealey added Work In Progress Not yet ready for review and removed 2: Dev Review Requires review by a core committer labels Oct 28, 2024
@hmhealey hmhealey force-pushed the hh_oct25-rewrite-table-parsing branch from 8dbd880 to bdd88ab Compare October 28, 2024 23:02
@hmhealey hmhealey added 2: Dev Review Requires review by a core committer and removed Work In Progress Not yet ready for review labels Oct 29, 2024
@hmhealey
Copy link
Member Author

@devinbinnie @crspeller I wasn't sure who to assign reviews to since this code is pretty different than most of our JS code. Let me know if want me to reassign them

Copy link
Member

@crspeller crspeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an synchronous review with @hmhealey and @devinbinnie
Looks good!

Copy link
Member

@devinbinnie devinbinnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @hmhealey :)

@devinbinnie devinbinnie added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Nov 11, 2024
@hmhealey hmhealey merged commit c50146e into master Nov 12, 2024
12 checks passed
@hmhealey hmhealey deleted the hh_oct25-rewrite-table-parsing branch November 12, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants