-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xsl/tei2c5.xsl: Very long runtime on large input #32
Comments
This has been a long-standing issue. The problem seems to be somewhere in the
way XPath expressions are built, resulting in path depths of > 10,000,000.
However, XSL debugging is rather bad in the FOSS space.
There is a related question and I will summarise the discussion around this
here for completeness. Instead of fixing the old XSL style sheets, it might
be worth to reimplement the style sheets in a different language. This would
open the possibility to translate the TEI into an intermediate representation
that can then be converted into multiple formats. PyGlossary has a similar
goal, though lacks any semantic meaning in its internal formats and is hence
not a good fit for FreeDict. So in case somebody would look into something new
for conversion, my opinion would be to drop the style sheets, hence I bring it
up in this issue. If somebody is able to write XSL better than I do and fixes
the style sheets, this paragraph can be ignored.
|
There is a related question and I will summarise the discussion around this
here for completeness. Instead of fixing the old XSL style sheets, it might
be worth to reimplement the style sheets in a different language.
I also prefer working with languages other than XSL. Not because I think
XSL is necessarily bad, but rather because I always have a hard time
debuging non-trivial XSL problems.
This would
open the possibility to translate the TEI into an intermediate
representation
that can then be converted into multiple formats.
In my mind, our TEI files should be the intermediate representation and we
should try to improve that instead of adding an additional representation.
PyGlossary has a similar
goal, though lacks any semantic meaning in its internal formats and is
hence
not a good fit for FreeDict.
I don't think we will be able to create a generic converter from TEI to
detailed semantic representations. But most dictionary formats are mostly a
mapping from headwords to formatted text. To handle those formats, I had
good success with creating one format and then converting to other formats
using PyGlossary. So the approach I have in mind is:
* One TEI -> formatted text dict converter (e.g. StarDict)
* Use Pyglossary to convert that to all other non-semantic formats
* For each desired semantic target format (are there any at this point?), write a custom converter from TEI
But ultimately, whoever does the work will get to decide, I assume. I
didn't want to miss the opportunity to share my thoughts, though.
|
I also prefer working with languages other than XSL. Not because I think
XSL is necessarily bad, but rather because I always have a hard time
debuging non-trivial XSL problems.
Agreed.
> This would
> open the possibility to translate the TEI into an intermediate
> representation
> that can then be converted into multiple formats.
In my mind, our TEI files should be the intermediate representation and we
should try to improve that instead of adding an additional representation.
The IR is just a terminology that is used in compiler construction, it is not
another format, but I would say a stricter version of the TEI version. TEI
get's arbitrarily complex: gramgrp within cit or outside? Examples next to
quotes or within cit? The IR basically defines abstractly that examples are
attached to a particular translation, instead of allowing several different
ways to encode it. But it's not utterly important, that's a question of the
implementation. I just have been implementing something alike in another
project.
> PyGlossary has a similar
> goal, though lacks any semantic meaning in its internal formats and is
> hence
> not a good fit for FreeDict.
I don't think we will be able to create a generic converter from TEI to
detailed semantic representations. But most dictionary formats are mostly a
mapping from headwords to formatted text. To handle those formats, I had
What do you mean by that? The goal is not to foster an ecosystem of output
formats for all applications, but to have a parsed representation that can be
transformed for dictionary formats. Semantic formats pose their own set of
problems, but they're meant for dictionaries so should have a common ground.
Of course, this conversion would still ignore any application other than
dictionaries. But maybe you have specific dictionary formats in mind for which
the semantic conversion would be hard.
good success with creating one format and then converting to other formats
using PyGlossary. So the approach I have in mind is:
* One TEI -> formatted text dict converter (e.g. StarDict)
* Use Pyglossary to convert that to all other non-semantic formats
* For each desired semantic target format (are there any at this point?),
write a custom converter from TEI
That's an intermediate solution, but not really nice. Do you know of the XDXF
project? I would prefer this approach.
The argument against this approach is simply that a project of this size
should not have multiple converters because maintaining our custom dialect
across multiple tools is a nightmare.
|
I'm all for having a dictionary representation that is more strict. My hope was to maintain dictionaries directly in that format rather than having our dicts in generic TEI and converting to strict TEI. Is there any reason why we could not do that? I would like to have diversity in dictionary applications and dictionaries rather than having a diversity of formats, each with a low number of applications. I'm also scared of debugging long conversion chains. I have seen XDXF, but I didn't investigate it enough to see how well it works in practice. What would be the advantages over TEI? A stricter format definition? This discussion is getting a bit off-topic for this issue. Maybe we should split part of it off into a separate issue if there is interest in more discussion. |
The topic of the discussion seems to have become how to replace the XSL I'd say, we could just rename the issue. Intertwined with the now-predominant topic is the question how strict On XSLT:
On IR:
On XDXF and Pyglossary:
|
Given
deu-eng-phonetics.tei
, running$ xsltproc --novalid --xinclude --stringparam dictname deu-eng --path /path/to/deu-eng/ /path/to/xsl/tei2c5.xsl build/tei/deu-eng-phonetics.tei >build/dictd/deu-eng.c5
as per the Makefiles, takes about 24 hours.
The problem is most likely not limited to
tei2c5.xsl
.The deu-eng dictionary is quite large and has many siblings below the
entry level, in case that matters.
See also #31.
The text was updated successfully, but these errors were encountered: