XML Workflow Image

Version: 1.4 (5.6.2024)
Developed by: Ronald Steffen

This repo contains a container image assembling different open source tools and a range of open source template files into a flexible framework for single source Jats/Bits XML publishing. The framework allows to define custom conversion toolchains that combine different conversion tools into an automized conversion workflow. Currently it provides a docx -> (Jats) XML -> html -> pdf toolchain by default.

NOTE: This repo is under development. This repo is intended as a proof-of-concept tool. In particular the templates require significant revision and improvement. I don't play around with Latex. So don't count on any developments of the pandoc-pdf Latex template. Documantation is incomplete.

The tools included are:

Pandoc 3.1.12.2
luarocks 3.9.2
Saxon HE 12 4J
mathjax-full 3.2.2
Pagedjs and pagedjs-cli 0.4.3 (with puppeteer 22.4.1)
WeasyPrint version 57.2
docxToJats
just 2.0.1 (for task execution)

Template files and other sources (e.g. css) are dereived from:

Pandoc default templates
Preview of NISO JATS Publishing 1.0 XML
NLM/NCBI Journal Publishing 3.0 Preview HTML
Journal Publishing 3.0 APA-like Citation
NCBI Book Tag Set Version 2.1

Installation & Usage

Installation

Download all files required to build and run the image/container (To use this repo in "production mode" you don't need to clone it to your host machine. This might only be necessary in case you want to develop new templates.): bash wget https://raw.githubusercontent.com/ronste/xmlworkflow/main/download.sh source download.sh
Build the conainter image from inside the download directory according to your platform (e.g. Docker, Podman, ...) with the image name xmlworkflow:latest, eg:
- docker build -t xmlworkflow:latest .
- podman build -t xmlworkflow:latest .

Further examples will be for podman but should also work with docker

Prepare working environment

Create a directory to hold all your working directories and files
From inside this directory start a conatiner with: . <path-to-your-download-directory>/xmlworkflow-run-prod.sh <your-container-name> to start a container
Prepare your working directory by either copying a docx file into the folder work or, alternatively, run podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx reset-example" to use the demo docx file

Perfrom a docx converion

To run a full docx to pdf conversion a specifically prepared MS Word docx document needs to be placed inside the working directory. The docx document is requiered to use specific formating templates as explained in the documentation. Please note that the default conversion chain (which handles metadata from the docx document) is optimized for docx -> Jats XML conversion via Pandoc (for customized conversion chains see below).

Start the docx conversion with the following command: - podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx"

This will create a range of intermediate and final files in the working directory which will be named according to the conversion step (tool) they are corresponding to. E.g. <docx-filename>_SaxonHE.html or <docx-filename>_weasyprint.pdf.

To get an overview of all options run:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx help

which will output:

Usage: [ processDocx | just ] [ theme=<theme> | debug=true | validate=true | develop=true  | pagedjs-polyfill=true ] [ docx-file | xml-file | html-file =<filename> ] [ <recipe> ]

To run a recipe without its dependencies add a "-" to the recipes name, e.g. "weasyprint-"
If no docx file is provided the first docx file in the working dirctory will be taken.
If no recipe is provided the full conversion chain is run and all available versions of PDFs will be created.

Options:
    theme = <theme>: Selects a set of templates (i.e. a sub folder of the themes folder) to be used for conversion.
    debug = true: Enables the debug options of the different tools.
    validate = true: Runs an XML validation with xmllint.
    develop = true: Adds specific css rules to the conversion of HTML and PDF files to provide features for debugging and development of Print CSS conversion steps.
    docx-file | xml-file | html-file = <filename>: Specifies the path and name of the docx, xml or html source file to be converted.
    pagedjs-polyfill = true: Adds the Pagedjs polyfill to the final HTML output for debugging purpose.

Available recipes:
    all                   # run all conversions (default)
    bitsxml               # Convert docx to Bits XML (experimental, metadata not yet supported).
    cleanup-work          # Clean up the working directory removing all files in work and in work/media
    copy-work destination # Copies the full content of the work folder into a new folder inside the configured COPY_PATH folder
    docxtojats            # Generate Jats XML using the docxtojats converter
    html                  # Convert XML to HTML using Saxon HE 12
    mathjax               # Convert Mathl to CHTML using mathjax
    pagedjs               # Generate PDF using Pagedjs
    pandoc                # Convert docx to Jats XML using Pandoc.
    pandoc-bits           # Convert docx to Bits XML using Pandoc (experimental, metadata not yet supported).
    pandoc-pdf-html       # Generate PDF from HTML using Pandoc
    pandoc-pdf-xml        # Generate PDF from XML using Pandoc
    pdf                   # Generate PDF using Pandoc, Pagedjs and Weasyprint
    reset-bits-example    # Clean up the working directory and reset Bits XML example file
    reset-jats-example    # Clean up the working directory and reset Jats XML example file
    runtests              # Run different test scripts
    weasyprint            # Generate PDF using Weasyprint
    xml                   # Convert docx to XML using Pandoc + Saxon HE 12.
    xml-validate          # Validate XML file against DTD provided by DOCTYPE tag. Usage: "processDocx xml-validate <filename>"

More information on individual conversion steps may be available by running:

docker exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx help <recipe-name>

The default (docx-based) conversion chain

Running the processDocx command without any parameters will start the full chain of conversion from docx -> xml -> html -> pdf, including all versions of PDFs (created with Pandoc, PagedJs and Weasyprint).

However, each of these conversion steps can be selected individually. To only run a conversion from docx to HTML use:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx html"

If, due to a previous run, the Jats XML buffer file requried to create the HTML output is already present in the working directory you can run the HTML conversion without its prior conversion steps by using:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx html-"

Note that it is also possible to specify specific conversion chains directly by passing multiple recipe names to theprocessDocx command:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx pagedjs weasyprint"

Using custom themes

The xmlworkflow container supports the definition of custom themes, i.e. different sets of templates, css and media files. To create a custom theme create a copy of the folder themes/default within the themes folder. You can then use the theme by providing the option theme=<folder name> and customize the templates inside this folder.

A theme may also contain its own justfile to define custom conversion steps. Not surprisingly, it should (but is not required to) import the main justfile. To run a conversion from this theme-specific justfile you directly specifiy the path to your custom justfile, e.g.:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx -f /root/xmlworkflow/themes/default/justfile help"

If you run the above command you find the additional recipes custom-help and custom-example were added.

You could e.g. define a step doing a Markdown to XML conversion with Pandoc to build your own Markdown -> Jats XML -> ... conversion chain for one particular theme.

How to run a custom conversion chain

To run a custom conversion chain, e.g. not using the default Pandoc docx to Jats XML conversion, you can specifiy your conversion steps directly on the command line. In the following example docx to Jats XML conversion is facilitated by the docxToJats converter and the resulting Jats XML file is subsequently passed through the default Jats XML -> HTML -> Weasyprint conversion steps. The minus sign at the end of the subsequent recipe names is important in this use case, since it indicates that only the recipe without its default dependecies (which would be Pandoc) should be run.

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx docxtojats html- weasyprint-"

How to run conversion chains not based on docx documents

By specifying a XML or HTML input file on the command line you can inject your own source file into a custom conversion chain. The following command e.g. performs an xml -> html -> pdf conversion without requiring a docx file:

podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx xml-file='Dummy_Article_Template.docx_SaxonHE.xml' html- weasyprint-"

Please don't forget to put all dependent files into the media folder.

How to develop, debug and test Print CSS stylesheets

To help with developing and debugging Print CSS stylesheets the develop=true option is available. This will add special css to the conversion process that may help you to identify possible issues.

The applied CSS, e.g., will color backgrounds for the html and body tags and add borders to the Print CSS page margin boxes. You can see these in the Pdf files and with the Pagedjs polyfill option.

You can also view the HTML file in print mode in Firefox. To do so open the developer tools and go to the Inspector tab. In the upper right corner of the styles section enable the icon "Toggle print media simulation for the page". This will show the HTML file with all @media print styles enabled (but without pagination) and you can inspect any issues interactively.

For specifically debugging Pagedjs issues you can add the Pagedjs polyfill to you HTML file by using the option pagedjs-polyfill=true. However, please not that the Pagedjs polyfill and pagedjs-cli (as currently used for the conversion) might behave differently under some circumstances (see below).

An empty HTML template that loads the development CSS and the Pagedjs polyfill is provided with the file themes\default\templates\html_template_printcss.html. You can inspect this file with your browser to learn about the basic Print CSS page layout. Note however, that Pagedjs and Weasyprint handle these layout elements differently!

To run a set of different tests use:

`podman exec <your-container-name> /bin/bash -c "cd /root/xmlworkflow/work && processDocx runtests"

Next steps

Continuous imporovements of template files
Implementation of a web app to provide a REST-API to handle Dar packages
considering pagedjs polyfill vs pagedjs-cli
revise handling of resource files between themes

Known issues

As of Pandoc version 3.1.12.2 the Pandoc HTML reader complains about unclosed HTML tags (used to work with Pandoc 3.1.11). Therefore the HTML -> PDF conversion with Pandoc is currently broken.

Version history

1.1 New recipes added 1.2 Added feature to define theme-specific custom conversion steps 1.3 Added docxtojats converter, added mathjay, runtests and xml-validate recipes

Disclaimer

This repo is developed in the context of the services offered by Berlin Universities Publishing Journals (operated by the University Library FU Berlin) and licensed under GNU GENERAL PUBLIC LICENSE Version 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

XML Workflow Image

Installation & Usage

Installation

Prepare working environment

Perfrom a docx converion

The default (docx-based) conversion chain

Using custom themes

How to run a custom conversion chain

How to run conversion chains not based on docx documents

How to develop, debug and test Print CSS stylesheets

Next steps

Known issues

Version history

Disclaimer

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

XML Workflow Image

Installation & Usage

Installation

Prepare working environment

Perfrom a docx converion

The default (docx-based) conversion chain

Using custom themes

How to run a custom conversion chain

How to run conversion chains not based on docx documents

How to develop, debug and test Print CSS stylesheets

Next steps

Known issues

Version history

Disclaimer