MM-61148 Rewrite table parsing based off of cmark-gfm #20

hmhealey · 2024-10-28T20:05:06Z

Summary

The current table parsing for the mobile app is either entirely custom or based on the implementation in our old version of marked.js, and it has repeatedly run into problems and bugs because of that. I definitely tried to do far too much in it using regular expressions. For example, my last PR #19 caused MM-61148.

Instead, I decided to relearn some C and try to copy https://github.com/github/cmark-gfm/blob/master/extensions/table.c to ensure our table parsing is as close to GitHub's version as possible minus the painful memory management.

How table parsing worked before/works now

The main changes for this are in lib/blocks.js (the files in dist are compiled versions of it). The most important parts of this file are blocks which defines the types of block elements (paragraphs, lists, tables, etc) and blockStarts which contains an array of functions that identify the starts of different types of block elements (paragraphs, lists, tables, etc).

For each line of text, the parser calls the functions in blockStarts one at a time until one of them returns that the text starts a new block. Otherwise, the text is treated as regular text in a paragraph.

|column 1|column 2|column 3| <---- header row
|--------|--------|--------| <---- delimiter row
|apple   |banana  |carrot  |

The old blockStart function for tables would read each line of text, and if it found something that looked like a header row, it'd peek ahead to see if the next row is a matching delimiter row. If both of those are true, it'd open a table node, move the parser into it, and then restart the processing for that line. On the next pass, the blockStart for the table row would look at the header row again, add it into the table as a child, and then move onto the next line. It'd repeat that process until it found something that wasn't a table row at which point, it'd close the table and start a new block for whatever it found in the row.

That worked okay, but that required a lot of changes to keep track of the next line of text, and I didn't understand how the parser handled things like indentation which caused some other issues (eg. MM-46943).

Instead of reading two rows at once, the way that cmark-gfm handles tables is that it reads the header row as plain text, and then, when adding lines to that paragraph, it looks for if the line that it's adding is a delimiter row. If it is, it looks back to see if the previous line is a header row. If both of those are found, then it converts the current paragraph into a table and adds the header row as its child. After that, it continues on as it did before.

That works a bit better since it doesn't require us to modify the parser as heavily, and since it starts as a paragraph, it makes it easier to handle tables that are indented or inside of things like list items and block quotes the same way that GitHub does.

Ticket Link

https://mattermost.atlassian.net/browse/MM-61148

The previous version of this was a custom implementation that I might have written based on Marked's implementation. That version continued to have issues, and I know regular expressions were likely insufficient for large parts of this, so I decided to start from scratch and attempt to port GitHub's cmark-gfm implementation. Overall, it seems like that's gone pretty well. I had to learn to read some generated C code and understand GFM's extension system, but I'm confident that I got the implementation right. This initial version: - Fixes MM-61148 - Adds support for nesting tables inside of block quotes matching GFM. The web app supports this but is less strict than GFM, and mobile doesn't currently support this. - Adds support for nesting tables inside of lists, also matching GFM. Neither app currently supports this. This almost fixes MM-61148, but the specific example in that ticket has the table directly following a paragraph without a blank line in between which is something we don't support yet.

… properly

I also had to update the output of the HTML renderer to match changes made in github/cmark-gfm@1470c30, and our existing table tests needed to be updated for that as well.

hmhealey · 2024-10-28T20:06:55Z

lib/blocks.js

@@ -459,9 +467,6 @@ var blocks = {
            for (var row = block.firstChild; row; row = row.next) {
                var i = 0;
                for (var cell = row.firstChild; cell; cell = cell.next) {
-                    // copy column alignment to each cell


This is handled when we add the node for the row instead of doing it now

hmhealey · 2024-10-28T20:50:12Z

lib/blocks.js

-                    cell.isHeading = true;
-                }
-            }
+        finalize: function() {


Similarly, this is handled when we first add the row instead of doing it after the fact

hmhealey · 2024-10-28T20:51:25Z

lib/blocks.js

            return 0;
        }

-
-        if (parser.indented) {
+        if (container._tableVisited) {


When adding multiple lines to a paragraph, we keep track that we've already checked if that paragraph is a table to save some time

hmhealey · 2024-10-28T20:53:12Z

lib/blocks.js

+            [parser.lineNumber - 1, parser.offset + 1],
+            [parser.lineNumber - 1, parser.offset + headerCharacters],


This tracks which parts of the original text corresponds to the parsed Markdown source tree. We don't use it ourselves, but I wanted to be consistent with the existing code

hmhealey · 2024-10-28T20:55:09Z

lib/blocks.js

        return 2;
    }
 ];

-const parseDelimiterRow = function(row) {
-    if (row.indexOf("|") === -1) {
+const parseTableRow = function(line, startAt) {


Apart from the changes to how we start a table block, this function for identifying and parsing a table row is the other big change here since, instead of using a regular expression, we parse it manually by advancing through the text one character at a time which is tracked below by offset.

This function is intentionally very C-like and less JS-like than normal to make it easier to port.

Some things in this function are pretty tricky because both the opening and closing pipes of the table are optional meaning a row could look like |aaa|bbb|, aaa|bbb, or even aaa. This also has to be able to handle empty cells like || which can either be a single empty cell or two empty cells which is very fun. This is why the old regex was ugly and why this logic isn't much better looking 😅

hmhealey · 2024-10-28T21:10:51Z

lib/render/html.js

@@ -288,12 +288,15 @@ function table_row(node, entering) {
        this.cr();
    } else {
        this.tag("/tr");
+        this.cr();


cmark-gfm handles whitespace slightly differently in its HTML output, so I had to change that for our tests as well so that we could use the tests from cmark-gfm (test/gfm_extensions.txt)

hmhealey · 2024-10-28T21:11:09Z

package.json

@@ -1,7 +1,7 @@
 {
  "name": "@mattermost/commonmark",
  "description": "the Mattermost fork of a strongly specified, highly compatible variant of Markdown",
-  "version": "0.30.1-2",
+  "version": "0.30.1-3",


This is supposed to be on master already, but I wasn't able to push the changes when I made them a couple months ago

hmhealey · 2024-10-28T21:11:58Z

test/tables.txt

+</table>
+````````````````````````````````
+
+### Tables inside of blockquotes


These aren't required for MM-61148, but the new parsing logic makes this behave the same as GitHub, so I added a bunch of tests to make sure it works like GitHub

hmhealey · 2024-10-29T17:16:10Z

@devinbinnie @crspeller I wasn't sure who to assign reviews to since this code is pretty different than most of our JS code. Let me know if want me to reassign them

crspeller

Did an synchronous review with @hmhealey and @devinbinnie
Looks good!

devinbinnie

Nice work @hmhealey :)

hmhealey added 6 commits August 26, 2024 15:58

Update version and regenerate compiled files

91674fe

Consistently use double quotes

06ace6e

Correct sourcepos for table nodes and ensure table rows are finalized…

831ef64

… properly

Remove finalize method for table rows that's not needed any more

6d239b1

Add tests for tables from cmark-gfm

2148a1a

I also had to update the output of the HTML renderer to match changes made in github/cmark-gfm@1470c30, and our existing table tests needed to be updated for that as well.

hmhealey added the 2: Dev Review Requires review by a core committer label Oct 28, 2024

hmhealey commented Oct 28, 2024

View reviewed changes

hmhealey added Work In Progress Not yet ready for review and removed 2: Dev Review Requires review by a core committer labels Oct 28, 2024

MM-46943 Parse tables directly following text and remove remaining TODOs

bdd88ab

hmhealey force-pushed the hh_oct25-rewrite-table-parsing branch from 8dbd880 to bdd88ab Compare October 28, 2024 23:02

hmhealey added 2: Dev Review Requires review by a core committer and removed Work In Progress Not yet ready for review labels Oct 29, 2024

hmhealey requested review from crspeller and devinbinnie October 29, 2024 17:13

hmhealey mentioned this pull request Oct 29, 2024

MM-61148 Rewrite table parsing and improve error handling around Markdown code mattermost/mattermost-mobile#8300

Merged

4 tasks

crspeller approved these changes Nov 8, 2024

View reviewed changes

devinbinnie approved these changes Nov 11, 2024

View reviewed changes

devinbinnie added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Nov 11, 2024

devinbinnie assigned hmhealey Nov 11, 2024

hmhealey merged commit c50146e into master Nov 12, 2024
12 checks passed

hmhealey deleted the hh_oct25-rewrite-table-parsing branch November 12, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-61148 Rewrite table parsing based off of cmark-gfm #20

MM-61148 Rewrite table parsing based off of cmark-gfm #20

hmhealey commented Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024

hmhealey Oct 28, 2024

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey commented Oct 29, 2024

crspeller left a comment

devinbinnie left a comment

		[parser.lineNumber - 1, parser.offset + 1],
		[parser.lineNumber - 1, parser.offset + headerCharacters],

MM-61148 Rewrite table parsing based off of cmark-gfm #20

MM-61148 Rewrite table parsing based off of cmark-gfm #20

Conversation

hmhealey commented Oct 28, 2024 • edited Loading

Summary

How table parsing worked before/works now

Ticket Link

hmhealey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

hmhealey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

hmhealey Oct 28, 2024

Choose a reason for hiding this comment

hmhealey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

hmhealey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

hmhealey Oct 28, 2024

Choose a reason for hiding this comment

hmhealey Oct 28, 2024

Choose a reason for hiding this comment

hmhealey Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

hmhealey commented Oct 29, 2024

crspeller left a comment

Choose a reason for hiding this comment

devinbinnie left a comment

Choose a reason for hiding this comment

hmhealey commented Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading

hmhealey Oct 28, 2024 •

edited

Loading