xsl/tei2c5.xsl: Very long runtime on large input #32

respiranto · 2021-07-04T03:23:48Z

Given deu-eng-phonetics.tei, running

$ xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /path/to/deu-eng/ /path/to/xsl/tei2c5.xsl build/tei/deu-eng-phonetics.tei >build/dictd/deu-eng.c5

as per the Makefiles, takes about 24 hours.

The problem is most likely not limited to tei2c5.xsl.

The deu-eng dictionary is quite large and has many siblings below the
entry level, in case that matters.

See also #31.

The text was updated successfully, but these errors were encountered:

humenda · 2021-07-04T10:51:46Z

This has been a long-standing issue. The problem seems to be somewhere in the way XPath expressions are built, resulting in path depths of > 10,000,000. However, XSL debugging is rather bad in the FOSS space. There is a related question and I will summarise the discussion around this here for completeness. Instead of fixing the old XSL style sheets, it might be worth to reimplement the style sheets in a different language. This would open the possibility to translate the TEI into an intermediate representation that can then be converted into multiple formats. PyGlossary has a similar goal, though lacks any semantic meaning in its internal formats and is hence not a good fit for FreeDict. So in case somebody would look into something new for conversion, my opinion would be to drop the style sheets, hence I bring it up in this issue. If somebody is able to write XSL better than I do and fixes the style sheets, this paragraph can be ignored.

karlb · 2021-07-04T12:28:50Z

There is a related question and I will summarise the discussion around this here for completeness. Instead of fixing the old XSL style sheets, it might be worth to reimplement the style sheets in a different language.

I also prefer working with languages other than XSL. Not because I think XSL is necessarily bad, but rather because I always have a hard time debuging non-trivial XSL problems.

This would open the possibility to translate the TEI into an intermediate representation that can then be converted into multiple formats.

In my mind, our TEI files should be the intermediate representation and we should try to improve that instead of adding an additional representation.

PyGlossary has a similar goal, though lacks any semantic meaning in its internal formats and is hence not a good fit for FreeDict.

I don't think we will be able to create a generic converter from TEI to detailed semantic representations. But most dictionary formats are mostly a mapping from headwords to formatted text. To handle those formats, I had good success with creating one format and then converting to other formats using PyGlossary. So the approach I have in mind is: * One TEI -> formatted text dict converter (e.g. StarDict) * Use Pyglossary to convert that to all other non-semantic formats * For each desired semantic target format (are there any at this point?), write a custom converter from TEI But ultimately, whoever does the work will get to decide, I assume. I didn't want to miss the opportunity to share my thoughts, though.

humenda · 2021-07-04T13:51:20Z

I also prefer working with languages other than XSL. Not because I think XSL is necessarily bad, but rather because I always have a hard time debuging non-trivial XSL problems.

Agreed.

> This would > open the possibility to translate the TEI into an intermediate > representation > that can then be converted into multiple formats. In my mind, our TEI files should be the intermediate representation and we should try to improve that instead of adding an additional representation.

The IR is just a terminology that is used in compiler construction, it is not another format, but I would say a stricter version of the TEI version. TEI get's arbitrarily complex: gramgrp within cit or outside? Examples next to quotes or within cit? The IR basically defines abstractly that examples are attached to a particular translation, instead of allowing several different ways to encode it. But it's not utterly important, that's a question of the implementation. I just have been implementing something alike in another project.

> PyGlossary has a similar > goal, though lacks any semantic meaning in its internal formats and is > hence > not a good fit for FreeDict. I don't think we will be able to create a generic converter from TEI to detailed semantic representations. But most dictionary formats are mostly a mapping from headwords to formatted text. To handle those formats, I had

What do you mean by that? The goal is not to foster an ecosystem of output formats for all applications, but to have a parsed representation that can be transformed for dictionary formats. Semantic formats pose their own set of problems, but they're meant for dictionaries so should have a common ground. Of course, this conversion would still ignore any application other than dictionaries. But maybe you have specific dictionary formats in mind for which the semantic conversion would be hard.

good success with creating one format and then converting to other formats using PyGlossary. So the approach I have in mind is: * One TEI -> formatted text dict converter (e.g. StarDict) * Use Pyglossary to convert that to all other non-semantic formats * For each desired semantic target format (are there any at this point?), write a custom converter from TEI

That's an intermediate solution, but not really nice. Do you know of the XDXF project? I would prefer this approach. The argument against this approach is simply that a project of this size should not have multiple converters because maintaining our custom dialect across multiple tools is a nightmare.

karlb · 2021-07-04T15:10:34Z

I'm all for having a dictionary representation that is more strict. My hope was to maintain dictionaries directly in that format rather than having our dicts in generic TEI and converting to strict TEI. Is there any reason why we could not do that?

I would like to have diversity in dictionary applications and dictionaries rather than having a diversity of formats, each with a low number of applications. I'm also scared of debugging long conversion chains.

I have seen XDXF, but I didn't investigate it enough to see how well it works in practice. What would be the advantages over TEI? A stricter format definition?
It looks like they also don't have a single semantic format as a conversion target and rely on PyGlossary for other formats.

This discussion is getting a bit off-topic for this issue. Maybe we should split part of it off into a separate issue if there is interest in more discussion.

respiranto · 2021-07-05T06:44:18Z

This discussion is getting a bit off-topic for this issue. Maybe we
should split part of it off into a separate issue if there is interest
in more discussion.

The topic of the discussion seems to have become how to replace the XSL
stylesheets (or: how to write exporters). Which is what seems to be the
preferred solution to the original issue.

I'd say, we could just rename the issue.

Intertwined with the now-predominant topic is the question how strict
our format should be, and possibly, how it should be at all.

On XSLT:

There seems to be consensus among you that it is not very well suited
here.
I just learned (a little) about XSLT.
- It looks close to functional programming, but probably harder to
  use.

On IR:

Already having only a single parser (constructing a common AST) would
be worth a lot. If redundant representations can be joined in an IR,
even better.
Parser + AST + IR could serve as an authoritative definition of
FreeDict TEI.
- Currently there are:
  - the Wiki: not exhaustive
  - XML schemata: very large, hard to read (to me), not very
    strict.
- Validation would come for free.
- Parser, AST and IR can be mostly self-documenting (and easy to
  read).
Ideally our format would be so strict that IR = AST (or at least
isomorphic, unless the IR is tailored to a target format).
One could add a pretty printer (IR/AST -> TEI). This should be
relatively easy.
- This would help writing importers.
- If IR != AST, this would give us a way to translate any valid
  TEI to a (to be defined) strict version. If we wanted to make
  our format more strict, we could use this to transform existing,
  less strict dictionaries.

On XDXF and Pyglossary:

XDXF: Are we discussing whether to
- replace TEI with XDXF,
- use XDXF as intermediate format for exporting, or
- adopt XDXF's "approach"?
  - What would that mean?
If good converters from X to many other formats exist, it makes sense
to me to write a TEI to X converter with the intention to use those
other converters.
- Particularly, if those converters are injective
  (i.e., reversible).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xsl/tei2c5.xsl: Very long runtime on large input #32

xsl/tei2c5.xsl: Very long runtime on large input #32

respiranto commented Jul 4, 2021

humenda commented Jul 4, 2021 via email

karlb commented Jul 4, 2021 via email •

edited

Loading

humenda commented Jul 4, 2021 via email

karlb commented Jul 4, 2021

respiranto commented Jul 5, 2021

xsl/tei2c5.xsl: Very long runtime on large input #32

xsl/tei2c5.xsl: Very long runtime on large input #32

Comments

respiranto commented Jul 4, 2021

humenda commented Jul 4, 2021 via email

karlb commented Jul 4, 2021 via email • edited Loading

humenda commented Jul 4, 2021 via email

karlb commented Jul 4, 2021

respiranto commented Jul 5, 2021

karlb commented Jul 4, 2021 via email •

edited

Loading