Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsl/tei2c5.xsl: Very long runtime on large input #32

Open
respiranto opened this issue Jul 4, 2021 · 5 comments
Open

xsl/tei2c5.xsl: Very long runtime on large input #32

respiranto opened this issue Jul 4, 2021 · 5 comments

Comments

@respiranto
Copy link
Contributor

Given deu-eng-phonetics.tei, running

$ xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /path/to/deu-eng/ /path/to/xsl/tei2c5.xsl build/tei/deu-eng-phonetics.tei >build/dictd/deu-eng.c5

as per the Makefiles, takes about 24 hours.

The problem is most likely not limited to tei2c5.xsl.

The deu-eng dictionary is quite large and has many siblings below the
entry level, in case that matters.

See also #31.

@humenda
Copy link
Member

humenda commented Jul 4, 2021 via email

@karlb
Copy link
Member

karlb commented Jul 4, 2021 via email

@humenda
Copy link
Member

humenda commented Jul 4, 2021 via email

@karlb
Copy link
Member

karlb commented Jul 4, 2021

I'm all for having a dictionary representation that is more strict. My hope was to maintain dictionaries directly in that format rather than having our dicts in generic TEI and converting to strict TEI. Is there any reason why we could not do that?

I would like to have diversity in dictionary applications and dictionaries rather than having a diversity of formats, each with a low number of applications. I'm also scared of debugging long conversion chains.

I have seen XDXF, but I didn't investigate it enough to see how well it works in practice. What would be the advantages over TEI? A stricter format definition?
It looks like they also don't have a single semantic format as a conversion target and rely on PyGlossary for other formats.

This discussion is getting a bit off-topic for this issue. Maybe we should split part of it off into a separate issue if there is interest in more discussion.

@respiranto
Copy link
Contributor Author

This discussion is getting a bit off-topic for this issue. Maybe we
should split part of it off into a separate issue if there is interest
in more discussion.

The topic of the discussion seems to have become how to replace the XSL
stylesheets (or: how to write exporters). Which is what seems to be the
preferred solution to the original issue.

I'd say, we could just rename the issue.

Intertwined with the now-predominant topic is the question how strict
our format should be, and possibly, how it should be at all.

On XSLT:

  • There seems to be consensus among you that it is not very well suited
    here.
  • I just learned (a little) about XSLT.
    • It looks close to functional programming, but probably harder to
      use.

On IR:

  • Already having only a single parser (constructing a common AST) would
    be worth a lot. If redundant representations can be joined in an IR,
    even better.
  • Parser + AST + IR could serve as an authoritative definition of
    FreeDict TEI.
    • Currently there are:
      • the Wiki: not exhaustive
      • XML schemata: very large, hard to read (to me), not very
        strict.
    • Validation would come for free.
    • Parser, AST and IR can be mostly self-documenting (and easy to
      read).
  • Ideally our format would be so strict that IR = AST (or at least
    isomorphic, unless the IR is tailored to a target format).
  • One could add a pretty printer (IR/AST -> TEI). This should be
    relatively easy.
    • This would help writing importers.
    • If IR != AST, this would give us a way to translate any valid
      TEI to a (to be defined) strict version. If we wanted to make
      our format more strict, we could use this to transform existing,
      less strict dictionaries.

On XDXF and Pyglossary:

  • XDXF: Are we discussing whether to
    • replace TEI with XDXF,
    • use XDXF as intermediate format for exporting, or
    • adopt XDXF's "approach"?
      • What would that mean?
  • If good converters from X to many other formats exist, it makes sense
    to me to write a TEI to X converter with the intention to use those
    other converters.
    • Particularly, if those converters are injective
      (i.e., reversible).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants