-
Notifications
You must be signed in to change notification settings - Fork 5
tex4ht tutorial
tex4ht
is a system which converts LaTeX to various output formats, including html
, xhtml
, odt
, docbook
or tei
. html
and odt
are the most common and best-supported conversion targets.
tex4ht
allows authors to use LaTeX input--widely employed for high-quality typography, especially mathematical typography--to produce output in other formats, especially html
(for web pages) and xhtml
(for ebooks and other applications).
tex4ht
consists of three basic building blocks and various scripts which tie these blocks together.
-
tex4ht.sty
is a TeX package which inserts configured output codes (i.e., html tags) into TeX's.dvi
output file. Many documents can be translated to html without users needing to supply tags explicitly, but there are macros to insert html directly into the output if the need arises. -
tex4ht
is an executable (program), which extracts information stored in the.dvi
file including text and output codes, and prepares auxiliary files for image conversion and other tasks. Note that although whole system is namedtex4ht
, this command cannot be executed on.tex
file; it works only with.dvi
file -
t4ht
is a program which converts images, generatescss
file, and runs various commands requested in the.tex
file
A number of helper shell scripts (commands) exist, so that users do not need to invoke these commands manually. The best known of these is htlatex
, which by default converts LaTeX to html
. Using different options, you can convert to any output format supported by the tex4ht
system.
In fact, you can convert to almost any format using tex4ht
, even to formats not based on xml
, but to do so involves providing extensive configuration files.
The basic usage of the htlatex
command (script) is as follows:
htlatex filename "options for tex4ht.sty" "options for tex4ht" "options for t4ht" "LaTeX options"
As you can see, htlatex
has five parameters; only first one, the filename, is mandatory. Also note that options must be generally be enclosed in quotes so that they can be passed literally to the underlying commands.
The calling command "driver" is mk4ht
, which is similar to htlatex
, but slips in a new first parameter indicating the system to be used. Values for this parameter include htlatex
which produces the same results as the htlatex script, oolatex
for Open Document Format conversion, dblatex
for docbook,
or teilatex
for TEI. The mk4ht
command is quite general, allowing user-generated configuration files. For further information, see calling commands on the tex4ht
website.
As an example, to compile to Open Document Format, you would type this at the commmand prompt:
mk4ht oolatex sample.tex
A more recent option is to use Michal Hoftich's make4ht
build system for tex4ht
. It allows the user to call various commands during compilation, such as bibtex
, biber
, or xindy
; to postprocess output files with Lua
scripts or commands such as tidy
or xslt
processors; and to specify the command to be used for image conversion.
In this tutorial, we will show usage of both htlatex
and make4ht
.
Lets start with conversion of simple LaTeX file to html. Let's say we have following multilingual LaTeX file:
\documentclass{article}
\usepackage[english,czech]{babel}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english}
Some text in English
\end{otherlanguage}
\end{document}
things to notice are use of two languages, Czech is the main document language,
English is secondary. Note usage of otherlanguage
environment. It is provided
by babel
package and locally switches document languages, so correct
hyphenation and other language dependent stuff are used. We could use
\selectlanguage
command, but I would like to discourage usage of switching
commands such is this one, or font switching commands like \bfseries
, for one
reason: it is impossible to configure them correctly for end element insertion.
For font switching commands, situation is saved by tex4ht
command, which
inserts formatting instructions for each font change. But generally, such
commands don't play nice with nature of xml
based formats, where every started
element must be closed on the same hierarchical level. So they must have same
parent element.
Usage of otherlanguage
environment will allow us to make proper configuration
and insert opening and closing tags at correct places.
But beware of following situation:
Hello world.
\begin{someenv}
Just start some environment.
But run it through several paragraphs
\end{someenv}
say that we insert <div class="someenv">
and </div>
tags around someenv
environment. By default this may produce following structure:
<p>Hello world.
<div class="someenv">Just start some environment.
</p>
<p>But run it through several paragraphs
</div></p>
as you can see, generated html code is incorrect, as opening and closing div
tags have different parent elements. someenv
can be configured to close
current paragraph, but it may be not what you want.
Best way to prevent tag mismatch may be something like:
Hello world.
\begin{someenv}
Just start some environment.
\end{someenv}
\begin{someenv}
But run it through several paragraphs
\end{someenv}
But stop talking about traps you may fall into and lets compile our example!
For start use of both of htlatex
and make4ht
will be showed, we will
focus on make4ht
later.
With htlatex
, we may use
htlatex sample1
and with make4ht
make4ht sample1
lets look on text part generated by htlatex
:
<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span
class="ecti-1000">ď</span><span
class="ecti-1000">ábelsk</span><span
class="ecti-1000">é </span>ódy. Some text in English
and by make4ht
:
<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span
class="ecti-1000">ď</span><span
class="ecti-1000">ábelsk</span><span
class="ecti-1000">é </span>ódy. Some text in English
</p>
only difference is missing </p>
tag in output of htlatex
, because
html 4.01
is produced by htlatex
by default. make4ht
on the other hand
produces xhtml
by default, so closing tag must be presented.
To get xhtml
output from htlatex
, use tex4ht.sty
option xhtml
. This
option must be first option in the option list passed to tex4ht.sty
. Value
of the first option must be either html
, xhtml
or name of custom config
file. We will cover these config files later, as they are key component in
customization of tex4ht
output.
So in order to get same output as from make4ht
, we must use following command:
htlatex sample1 xhtml
Now we should get rid of ugly entities which encode accented letters. This is
somewhat ugly with htlatex
:
htlatex sample1 "xhtml,charset=utf-8" " -cunihtf -utf8"
charset=utf-8"
produces meta element which declares document to be in utf-8
encoding. Important are two options for tex4ht
command, -c
and -utf8
.
ToDo: add description of process of conversion from htf
fonts to utf8 using
unicode.4hf. It is directed from tex4ht.env
file.
With make4ht
, situation is easier, as all we need to do is to add -u
option:
make4ht -u sample1.tex
resulting file:
<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span
class="ecti-1000">ď</span><span
class="ecti-1000">ábelsk</span><span
class="ecti-1000">é </span>ódy. Some text in English
</p>
Entities are gone, but other persists. What we see is caused by a bug in
tex4ht
command. It decorates text which is set in non-default font with
<span>
elements. Unfortunately it doesn't play well with accented letters
as we can see. This has easy solution, fortunately. We just need to dive
into tex4ht
configuration. Yay!
We already saw that we can use command line options to configure the output.
For full list of options for tex4ht.sty
, see an
article on CVR's blog. These options mainly
influence appearance or math, footnotes, tables, etc. Note that these options
aren't fixed set, anyone can add new options and not all options are supported
in each output format supported by tex4ht
. Generally these options work
with html
(and xhtml
) output.
Other option is to use custom config file (.cfg
). This is a TeX file with some
basic structure:
optional stuff like requiring LaTeX packages etc
...
\Preamble{xhtml,tex4ht.sty options}
...
tex4ht configurations
...
\begin{document}
...
more tex4ht configurations
...
\EndPreamble
Most important command for configuring is \Configure
. This command has
variable number of arguments, in the simplest form it does have two arguments:
\Configure{configname}{insert for a first hook}
.
At this place we should talk about hooks. In order to insert html tags,
LaTeX macros are redefined and in the definitions special hooks are inserted.
These hooks are declared with \NewConfigure{configname}{number of hooks}
in special file named as redefined package name with suffix .4ht
. These hooks are then seeded in configure files for particular output formats, or in the .cfg
file.
To illustrate that, we can show some simple example. Lets say we have simple package hello.sty
:
\ProvidesPackage{hello}
\newcommand\hello{\textbf{hello world}}
\endinput
we can provide hooks in file named hello.4ht
. Say we just want to insert tags at beginning and at end of \hello
command:
% provide configure for \hello command. we can choose any name
% but most convenient is to name hooks after redefined command
% we declare two hooks, to be inserted before and after the command
\NewConfigure{hello}{2}
% now we need to redefine \hello. save it to tmp command
\let\tmp:hello\hello
% note that `:` can be part of command name in `.4ht` files.
% now insert the hooks. they are named as \a:hook, \b:hook, ..., \h:hook
% depending on how many hooks were declared
\renewcommand\hello{\a:hello\tmp:hello\b:hello}
because we want to surround contents produced by \hello
with tags, we need to declare two hooks. This is the most usual case for normal commands which just produce some text. Old contents of macro are saved in temporary macro and then command is redefined to insert hooks and original contents stored in temporary macro.
Now we can change our sample to use hello
package:
\documentclass{article}
\usepackage[english,czech]{babel}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{hello}
\begin{document} Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english} Some text in English, \hello
\end{otherlanguage}
\end{document}
we haven't provided any configurations for hello
yet, but you can see that
text hello world
is in bold font anyway. This is the same case as \textit
which is converted as italic. Basic font styles are inserted by tex4ht
command during extraction of text from dvi
to a output format. So it is the right time to finally show how to configure both textit
and hello
to produce some better tags than they provide by default.
Basic structure of a config file has been shown before, so now we will just add basic configurations for \textit
and \hello
:
\Preamble{xhtml}
\Configure{textit}{\HCode{<span class="textit">}}{\HCode{</span>}}
\Configure{hello}{\HCode{<span class="hello">}}{\HCode{</span>}}
\Css{.textit{font-style:italic;}}
\Css{.hello{font-weight:bold;}}
\begin{document}
\EndPreamble
For documentation of default configurations, see tex4ht info, most useful are LaTeX and tex4ht sections. Documentation for basic font commands such as \textit
or \textbf
is provided in LaTeX section. We can see that configuration takes two
parameters, insertion before and after content.
Same situation is with hello
configuration we defined earlier, hooks are
inserted before and after the content.
To insert html
tags, we need to use \HCode
commands, special characters
such as <
,>
or &
are escaped otherwise. In our example we insert span
elements with some class
attribute to distinguish them. Because these classes
doesn't have any visual appearance by default, we use \Css
commands to
add some styling. Yes, you need to know both html
and css
to effectively
configure tex4ht
!
If we look at html
output now, we can see that things don't look much better
than initially:
<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span class="textit"><span
class="ecti-1000">ď</span><span
class="ecti-1000">ábelsk</span><span
class="ecti-1000">é</span></span> ódy. Some text in English, <span class="hello"><span
class="ecbx-1000">hello world</span></span>
</p>
our new tags were inserted, but unnecessary elements inserted by tex4ht
processor are still present. Fortunately, we can suppress insertion of these
elements with \NoFonts
command, and later enable again with \EndNoFonts
.
We can also use tex4ht.sty
option NoFonts
, which will suppress font processing in whole document, but you should use this with caution, as it may
have some side effects.
Let's take a look how would out configurations look with \NoFonts
command:
\Preamble{xhtml}
\Configure{textit}{\HCode{<span class="textit">}\NoFonts}
{\EndNoFonts\HCode{</span>}}
\Configure{hello}{\HCode{<span class="hello">}\NoFonts}
{\EndNoFonts\HCode{</span>}}
\Css{.textit{font-style:italic;}}
\Css{.hello{font-weight:bold;}}
\begin{document}
\EndPreamble
the output now looks much better:
<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span class="textit">ďábelské</span> ódy. Some text in English, <span class="hello">hello world</span>
</p>
It may seems that we can be happy at this point, but things aren't as easy as we may hope, because we haven't talked about one thing:
What if we add some more paragraphs in English to our sample file?
\documentclass{article}
\usepackage[english,czech]{babel}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{hello}
\begin{document} Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english} Some text in English, \hello
\end{otherlanguage}
\begin{otherlanguage}{english}
\textit{What will do} \verb|\textit| at the beginning of paragraph?
And also, what about configuration for \verb|otherlanguage| environment?
\end{otherlanguage}
\end{document}
What if we want to insert elements with lang
attribute to specify language of text in the html
. It might be useful from semantic point of view, we can also enable hyphenation in the css
and it works only when correct languages are marked in the source.
This exercise will be little bit more difficult