Skip to content

Commit

Permalink
Merge pull request #25 from nextjournal/commonmark-java
Browse files Browse the repository at this point in the history
This switches the implementation for when using this library from the JVM from markdown-it to commonmark-java. We use generative testing to compare both implementations return the same markup.

This comes with an approximate speedup of 10x and fixes #23.
  • Loading branch information
mk authored Sep 24, 2024
2 parents ab08059 + 5ae6323 commit 1690dfe
Show file tree
Hide file tree
Showing 25 changed files with 1,841 additions and 275 deletions.
2 changes: 2 additions & 0 deletions .dir-locals.el
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
((clojure-mode
(cider-clojure-cli-aliases . ":test:repl")))
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Changelog

## Unreleased
## 0.6 Unreleased

* We're swapping out GraalJS in favour of [commonmark-java](https://github.com/markdown-it/markdown-it) on the JVM side. The cljs implementation stays the same.
* Comply with commonmark's suggested rendering of images by default ([#18](https://github.com/nextjournal/markdown/issues/18)). This is a breaking change.

## 0.5.148
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,12 @@

A cross-platform clojure library for [Markdown](https://en.wikipedia.org/wiki/Markdown) parsing and transformation.


🚧 _ALPHA_ status, subject to frequent change. For a richer reading experience [read this readme as a clerk notebook](https://nextjournal.github.io/markdown/README).

## Features

* _Focus on data_: parsing yields an AST ([à la Pandoc](https://nextjournal.github.io/markdown/notebooks/pandoc)) of nested data representing a structured document.
* _Cross Platform_: clojurescript native, we target the JVM using [Graal's Polyglot Library](https://www.graalvm.org/22.1/reference-manual/js/JavaInteroperability/#polyglot-context).
* _Cross Platform_: using [commonmark-java](https://github.com/commonmark/commonmark-java) on the JVM and [markdown-it](https://github.com/markdown-it/markdown-it) for clojurescript
* _Configurable [Hiccup](https://github.com/weavejester/hiccup) conversion_.

## Try
Expand Down
4 changes: 2 additions & 2 deletions bb.edn
Original file line number Diff line number Diff line change
Expand Up @@ -36,12 +36,12 @@
build:notebooks
{:doc "builds a Clerk static with notebooks specified in deps.edn given a specified git SHA"
:depends [cljs:sci:release]
:task (clojure (str "-X:nextjournal/clerk :git/sha '\"" (first *command-line-args*) "\"' :browse? false"))}
:task (clojure (str "-X:dev:nextjournal/clerk :git/sha '\"" (or (first *command-line-args*) "SHASHASHA") "\"' :browse? false"))}

dev
{:doc "Boots and watches both shadow browser test and sci builds"
:depends [yarn-install]
:task (clojure "-M:test:nextjournal/clerk:sci watch sci browser-test")}
:task (clojure "-M:dev:test:nextjournal/clerk:sci watch sci browser-test")}

cljs:sci
{:doc "watches cljs build"
Expand Down
23 changes: 18 additions & 5 deletions deps.edn
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
{:paths ["src" "resources"]
{:paths ["src" "resources" "classes"]
:mvn/repos {"jitpack.io" {:url "https://jitpack.io"}}
:deps {applied-science/js-interop {:mvn/version "0.3.3"}
org.clojure/data.json {:mvn/version "2.4.0"}
org.graalvm.js/js {:mvn/version "21.3.2.1"}}
org.commonmark/commonmark {:mvn/version "0.23.0"}
org.commonmark/commonmark-ext-autolink {:mvn/version "0.23.0"}
org.commonmark/commonmark-ext-footnotes {:mvn/version "0.23.0"}
org.commonmark/commonmark-ext-task-list-items {:mvn/version "0.23.0"}
org.commonmark/commonmark-ext-gfm-tables {:mvn/version "0.23.0"}
org.commonmark/commonmark-ext-gfm-strikethrough {:mvn/version "0.23.0"}}

:aliases
{:nextjournal/clerk
{:extra-paths ["notebooks"]
:extra-deps {io.github.nextjournal/clerk {:git/sha "e8f275b5cf077ec9441e404c1885ff0b6ee0aef9"
{:extra-paths ["notebooks" "dev"]
:extra-deps {io.github.nextjournal/clerk {:git/sha "f7b47ebb639ea157a0f72d6a63e7f263179c72a9"
:exclusions [io.github.nextjournal/markdown]}}
:jvm-opts ["-Dclojure.main.report=stderr"
"-Dclerk.resource_manifest={\"/js/viewer.js\" \"js/viewer.js\"}"] ;;
Expand All @@ -24,8 +30,15 @@
:quiet
{:jvm-opts ["-Dpolyglot.engine.WarnInterpreterOnly=false"]}

:dev
{:extra-paths ["dev"]
:extra-deps {org.babashka/http-client {:mvn/version "0.3.11"}
org.clojure/test.check {:mvn/version "1.1.1"}
org.graalvm.js/js {:mvn/version "21.3.2.1"}}}

:test
{:extra-paths ["test"]
:jvm-opts ["-Dclojure.main.report=stderr"]
:extra-deps {nubank/matcher-combinators {:mvn/version "3.8.3"}}
:exec-fn test-runner/run}

Expand All @@ -40,7 +53,7 @@
:main-opts ["-m" "shadow.cljs.devtools.cli"]
:jvm-opts ["-Dclerk.resource_manifest={\"/js/viewer.js\" \"http://localhost:8021/viewer.js\"}"]
:extra-deps {io.github.nextjournal/clerk.render {:git/url "https://github.com/nextjournal/clerk"
:git/sha "e8f275b5cf077ec9441e404c1885ff0b6ee0aef9"
:git/sha "f7b47ebb639ea157a0f72d6a63e7f263179c72a9"
:deps/root "render"}}}

:build
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
(ns nextjournal.markdown
(ns nextjournal.markdown.graaljs
"Facility functions for handling markdown conversions"
(:require [clojure.java.io :as io]
[clojure.data.json :as json]
Expand Down Expand Up @@ -67,6 +67,13 @@
- [x] ~~thing~~
")

(parse "some text[^note] and some other[^other] but again[^note]
[^other]: some other
[^note]: some story
")


(->hiccup "# Hello Markdown
* What's _going_ on?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
(:require [clojure.string :as str]
[clojure.zip :as z]
[nextjournal.markdown.transform :as md.transform]
[nextjournal.markdown.parser.emoji :as emoji]
[nextjournal.markdown.utils.emoji :as emoji]
#?@(:cljs [[applied-science.js-interop :as j]
[cljs.reader :as reader]])))

Expand Down Expand Up @@ -101,7 +101,9 @@

(defn parse-fence-info [info-str]
(try
(when (string? info-str)
;; NOTE: this fix is backported
;; from the new implementation 👇
(when (and (string? info-str) (seq info-str))
(let [tokens (-> info-str
str/trim
(str/replace #"[\{\}\,]" "") ;; remove Pandoc/Rmarkdown brackets and commas
Expand Down Expand Up @@ -417,6 +419,8 @@ end"
(let [new-loc (-> loc (z/replace {:type :sidenote-container :content []})
(z/append-child node)
(z/append-child {:type :sidenote-column
;; TODO: broken in the old implementation
;; should be :content (mapv #(footnote->sidenote (get footnotes %)) (distinct refs))}))]
:content (mapv #(footnote->sidenote (get footnotes %)) refs)}))]
(recur (z/right new-loc) (z/up new-loc)))
(recur (z/right loc) parent))
Expand Down
59 changes: 59 additions & 0 deletions dev/old_vs_new.clj
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
(ns old-vs-new
(:require [clojure.string :as str]
[nextjournal.markdown.graaljs :as md-old]
[nextjournal.markdown :as md]
[babashka.http-client :as http]
[clojure.test.check :as tc]
[clojure.test.check.generators :as gen]
[clojure.test.check.properties :as prop]
[clojure.test.check.clojure-test :refer [defspec]]))

(comment
(= (md/parse "https://github.com")
(md-old/parse "https://github.com"))

(:body (http/get "https://jaspervdj.be/lorem-markdownum/markdown.txt?fenced-code-blocks=on"))

(let [sample (:body (http/get "https://jaspervdj.be/lorem-markdownum/markdown.txt?num-blocks=1000&fenced-code-blocks=on"))]
[(with-out-str (time (md-old/parse sample)))
(with-out-str (time (md/parse sample)))]))

(def feature-keys
(list "no-headers"
"no-code"
"no-quotes"
"no-lists"
"no-inline-markup"
"no-external-links"
"underline-headers"
"underscore-em"
"underscore-strong"
"no-wrapping"
"fenced-code-blocks"
"reference-links"))

(defn opts->query [opts]
(str/join "&" (keep (fn [[k v]] (when v (str k "=on"))) opts)))

(defn size+opts->random-md-str [size opts]
(:body (http/get (format "https://jaspervdj.be/lorem-markdownum/markdown.txt?%s&num-blocks=%s" (opts->query opts) size))))

(def md-generator
(gen/sized (fn [size]
#_(prn :size size)
(gen/fmap #(size+opts->random-md-str size %)
(gen/fmap (partial zipmap feature-keys) (gen/vector gen/boolean 12))))))

#_(gen/sample md-generator)

(def compare-old-vs-new-parse-implementations
(prop/for-all [s md-generator]
(= (md/parse s)
(md-old/parse s))))

#_(tc/quick-check 100 compare-old-vs-new-parse-implementations)

(defspec test-old-vs-new-implem 100
compare-old-vs-new-parse-implementations)

#_(test-old-vs-new-implem)
36 changes: 20 additions & 16 deletions notebooks/benchmarks.clj
Original file line number Diff line number Diff line change
@@ -1,29 +1,37 @@
;; # ⏱ Some Naïve Benchmarks
(ns ^:nextjournal.clerk/no-cache benchmarks
(ns benchmarks
{:nextjournal.clerk/no-cache true}
(:require [clojure.test :refer :all]
[nextjournal.clerk :as clerk]
[nextjournal.clerk.eval :as clerk.eval]
[nextjournal.markdown :as md]
parsing-extensibility
[nextjournal.markdown.parser :as md.parser]))
[nextjournal.markdown.graaljs :as old-md]
[nextjournal.markdown.utils :as u]
[parsing-extensibility]))

(def reference-text (slurp "notebooks/reference.md"))

(defmacro time-ms [& expr]
`(-> (clerk.eval/time-ms (dotimes [_# 100] ~@expr)) :time-ms (/ 100)))

(comment
(macroexpand '(time-ms do-this)))

;; Compare with different set of tokenizers
(defn parse
([text] (parse [] text))
([extra-tokenizers text]
(md.parser/parse (update md.parser/empty-doc :text-tokenizers concat extra-tokenizers)
(md/tokenize text))))
(md/parse* (assoc u/empty-doc :text-tokenizers extra-tokenizers)
text)))

;; Default set of tokenizers
(time-ms (parse reference-text))
(-> (parse reference-text) :content count)

;; Default set of tokenizers, warmup
[(time-ms (parse reference-text))
(time-ms (parse reference-text))
(time-ms (parse reference-text))]

;; GraalJS based implementation
[(time-ms (old-md/parse reference-text))
(time-ms (old-md/parse reference-text))
(time-ms (old-md/parse reference-text))]

;; With an extra brace-brace parser
(time-ms (parse [{:regex #"\{\{([^\{]+)\}\}"
Expand All @@ -39,14 +47,10 @@

;; With hashtags and internal links
(time-ms
(parse [md.parser/hashtag-tokenizer
md.parser/internal-link-tokenizer
(parse [u/hashtag-tokenizer
u/internal-link-tokenizer
{:regex #"\{\{([^\{]+)\}\}"
:handler (fn [m] {:type :var :text (m 1)})}
{:tokenizer-fn parsing-extensibility/losange-tokenizer-fn
:handler (fn [data] {:type :losange :data data})}]
reference-text))

^{::clerk/visibility {:code :hide :result :hide}}
(comment
(clerk/serve! {:port 8888}))
13 changes: 5 additions & 8 deletions notebooks/pandoc.clj
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
[nextjournal.clerk :as clerk]
[nextjournal.clerk.viewer :as v]
[nextjournal.markdown :as md]
[nextjournal.markdown.parser :as md.parser]
[nextjournal.markdown.utils :as u]
[nextjournal.markdown.transform :as md.transform]))

;; From the [docs](https://pandoc.org/MANUAL.html#description):
Expand Down Expand Up @@ -161,20 +161,20 @@ this _is_ a
{:type :list-item
:content (keep pandoc->md li)}) (second (:c node)))})

:Math (fn [node] (let [[_meta latex] (:c node)] (md.parser/block-formula latex)))
:Math (fn [node] (let [[_meta latex] (:c node)] (u/block-formula latex)))
:Code (fn [node]
(let [[_meta code] (:c node)]
{:type :monospace :content [(md.parser/text-node code)]}))
{:type :monospace :content [(u/text-node code)]}))
:CodeBlock (fn [node]
(let [[[_id classes _meta] code] (:c node)]
{:type :code
:content [(md.parser/text-node code)]}))
:content [(u/text-node code)]}))
:SoftBreak (constantly {:type :softbreak})
:RawBlock (constantly nil)
:RawInline (fn [{:keys [c]}]
(cond
(and (vector? c) (= "latex" (first c)))
(md.parser/formula (second c))))})
(u/formula (second c))))})

^{::clerk/visibility {:result :hide}}
(defn pandoc->md [{:as node :keys [t pandoc-api-version blocks]}]
Expand Down Expand Up @@ -226,9 +226,6 @@ this _is_ a

^{::clerk/visibility {:result :hide :code :hide}}
(comment
(clerk/serve! {:port 9999})
(clerk/clear-cache!)
(-> *e ex-cause ex-data)
(json/read-str
(:out
(shell/sh "pandoc" "-f" "markdown" "-t" "json" :in markdown-text))
Expand Down
Loading

0 comments on commit 1690dfe

Please sign in to comment.