On the order of things with regards to pandoc's AST #541

baptiste · 2022-04-05T20:20:17Z

baptiste
Apr 5, 2022

(Tom pointed me to this discussion group so I'm just discovering it – hopefully I didn't miss a thread in my search.)

Some years ago I remember discussing the knitr process with Yihui and suggesting that perhaps code chunks etc. should instead be captured in pandoc's AST, after pandoc has parsed the input, evaluated, and the output injected in the AST. Historically this access to pandoc's AST wasn't available when knitr first started (IIRC), so the main reason for the current workflow seemed historical.

Seeing that Quarto is a reimplementation centered around pandoc, I was really hoping this would be the new strategy, removing the knitting step before pandoc parsing, and moving it to become an intermediate transform of the AST.

A key reason I see this as beneficial is the opportunity to merge changes made to the output file. For example, if a collaborator edits the text in the final Word document, but doesn't touch the results of computations, it should be possible to merge changes by simply comparing the 2 trees, as all the necessary information is present. In contrast, the knitr step is currently a one-way street (cannot infer back the input document given the output).
I can imagine other benefits along those lines.

jjallaire · 2022-04-06T01:13:01Z

jjallaire
Apr 6, 2022
Maintainer

One issue we need to deal with here is that Rmd files are actually not parseable by Pandoc (the traditional chunk option syntax is actually not valid pandoc code block markup, as it requires strict #id, .class, key=value attributes (in that order). With Quarto we have remedied this to an extent (if you create a code block with e.g. ```{r} (no chunk options on the top line) then the document is actually parseable by Pandoc). However, we also need to support the old syntax so that users can easily migrate their Rmd files without changing their chunk options.

So while Quarto is technically a ground-up re-implementation we've made compatibility a a cornerstone, so we can't just have carte blanche with new syntax/rules. We could create a world where one execution order exists if you refrain from options on the main chunk header and a different one if you don't, but that could be quite confusing. At the end of the day we thought that the principle of executing existing Rmds unmodified was more important than any features we might be able to introduce by breaking compatibility.

It seems to me that for the scenario you describe that we could achieve this outcome with either execution order (i.e. we end up with an .md file with cells/output and content, and if the user edits only content the merge can be done). I am probably missing something though! What are the peculiarities introduced by execution order that make this type of merge more straightforward with execution after the AST is resolved?

0 replies

baptiste · 2022-04-06T01:45:00Z

baptiste
Apr 6, 2022
Author

Thanks for the explanation; I think there are ways to work around the issue of back-compatibility with Rmd. A couple of thoughts:

In principle, one could have optional processes taking place at each step in the chain
Rmd -> md -> AST -> AST -> output format
Some of those can be a no-operation, e.g with the workflow I'm proposing, nothing would happen between the source file (".qmd") and the md input. But if an older Rmd format is detected, the first step would be to convert this to a valid ".qmd" input (by stripping those chunk options and rewriting them in the new format), similar to what knitr has down with the older Sweave syntax (IIRC).
Instead of splitting the document into code vs non-code at the start, this would happen by inspecting the AST, which is likely more robust as it alleviates the need for fancy regex etc. (I still remember Yihui's first iterations with knitr, where the code chunk options were separated by ;...) – I believe there remain some tricky cases with inline comments within commented sections, for example.
The problem with the pre-pandoc processing step, i.e. knitting the document, is that the resulting file has no memory of what has been inserted by knitr, and what was originally prose. Whereas if code chunks were first detected within the AST, their relation to the document is fully described (by their position in the tree and the pandoc attributes attached) and the process can be reversed. This would enable merging changes made to the output (say, a collaborator unwilling to use these technologies, editing the output manuscript in Word). Their changes to the text could be traced back to the original branch in the AST, and if the modifications don't interfere with the automated steps (computation output) then they can be safely merged in the original file, which is pandoc-compatible and fully described by the pandoc AST. If, however, as is currently the case, the AST does not fully describe the input file, then such merging cannot be automated.

Does that make sense?

0 replies

jjallaire · 2022-04-06T01:54:39Z

jjallaire
Apr 6, 2022
Maintainer

One thing to consider is that the markdown we write from cell output is considerably more structured than what is written by knitr in Rmd (we have hooks that override nearly all of the default output behavior). Try executing with keep-md to see what I am talking about. Cells and their code/outputs are clearly marked up with divs/classes. So you can definitely programmatically determine what was knitr generated and user generated.

So I think with the current markup a content-only change could indeed be successfully merged back into the source qmd (it would as you said rely on the marked up output being paired with the original code cells inputs by their order).

0 replies

baptiste · 2022-04-06T22:22:00Z

baptiste
Apr 6, 2022
Author

OK, thanks, I get your perspective, especially given the existing knitr codebase.
I'll think about it some more, as I still think the alternative approach of reimplementing knitr from scratch as a pandoc filter, essentially, has value (for one thing, I believe it would simplify the code; it would also integrate more naturally with the pandoc ways).

As a side note, I believe there's a need for a "patching" framework, where one could apply a set of structured changes (patch) to a document – be it Rmd/qmd or even generic pandoc. This is best done at the AST level, e.g in JSON form or similar. The patch would be some change made to the output manuscript. Having tools to diff, merge, etc. such processes would be very useful, I believe.
This idea of "patching" a document – basically automating a post-processing step and merging it back into the document creation process – is also something that could be useful in the realm of graphics: annotations, minor tweaks, etc. done e.g. in Illustrator, could be stored in a standard format, and applied automatically if the basic graphic has been updated. (in this case the AST is basically the grid gTree, but that's another discussion altogether)

0 replies

jjallaire · 2022-04-07T11:48:40Z

jjallaire
Apr 7, 2022
Maintainer

I agree about the patching framework. I have seen some JSON oriented diff/patch libraries and there is of course Google's diff-match-patch library for operating at the markdown source level: https://github.com/google/diff-match-patch

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the order of things with regards to pandoc's AST #541

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

On the order of things with regards to pandoc's AST #541

baptiste Apr 5, 2022

Replies: 5 comments

jjallaire Apr 6, 2022 Maintainer

baptiste Apr 6, 2022 Author

jjallaire Apr 6, 2022 Maintainer

baptiste Apr 6, 2022 Author

jjallaire Apr 7, 2022 Maintainer

baptiste
Apr 5, 2022

jjallaire
Apr 6, 2022
Maintainer

baptiste
Apr 6, 2022
Author

jjallaire
Apr 6, 2022
Maintainer

baptiste
Apr 6, 2022
Author

jjallaire
Apr 7, 2022
Maintainer