Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XDXF format comments, suggestions and needed corrections #30

Open
k-sl opened this issue Jan 21, 2017 · 18 comments
Open

XDXF format comments, suggestions and needed corrections #30

k-sl opened this issue Jan 21, 2017 · 18 comments
Assignees
Labels
format_specs format specifications, new versions

Comments

@k-sl
Copy link

k-sl commented Jan 21, 2017

@soshial and whoever else is involved in developing the XDXF format standard:

I recently realized that there are a number of great free Chinese and Japanese dictionaries but unfortunately each is made available in its own specific format, which means it takes a specific tool to read it. This made me start looking for a good dictionary format (preferably XML) that could be used for any language. I found that format in XDXF, which I do consider is the closest we have to an ideal open and global dictionary format standard. As I set out to write a converter for the Chinese-English CC-CEDICT dictionary, I unfortunately also noticed many problems with the format, some of those serious enough to prevent a good dictionary conversion (from non-alphabetic languages), some just minor (or major) inconveniences.

What follows is a series of comments, corrections, criticism and proposals about the XDXF standard.

There are four main points where I think improvement is needed for XDXF to fully achieve it's purpose: format (visual format needs to be completely dropped), file structure (there should actually be two different formats, a flat XML and a package), deeper semantic definition, and better support for non-European and non-alphabetic languages (especially multiple writing systems and transliterations).

It is important that this format must be able to display all information commonly found in a dictionary, be it paper or electronic and from any to any language.

(Any XML markup in the following suggestions that is not currently in the XDXF standard is just a suggestion. I am in no way implying that it should be the final version.)


I. Format

What makes XDXF stand out when compared to other formats is the ability to describe a dictionary in a semantic format. That is what XDXF brings to the table that previous dictionary formats cannot compete with. A stardict dictionary converted to visual XDXF may still be technically an improvement, but it'll be barely noticeable and so it doesn't make much sense to go through the trouble of converting it, when that is the most supported format anyway. (The same is true about other visual formats.)

I propose that the visual format be dropped from the XDXF standard; dictionaries in the visual format should be considered obsolete and no longer supported. I understand that, at least in part, the reason for a visual format is that it allows almost seamless conversion from most other dictionary format. Its discontinuation would make many-to-many converters much harder to write (if possible at all), as all information needs to be parsed. As I argued above, I don't believe being an easy target of conversion is worth much if there isn't a significant improvement of some form, either to the DS maintainers, or to the users. I think it reasonable to suggest that DS keep support for deprecated revision 33 as a way to keep supporting the visual format dictionaries that may be available in the wild.

It is not enough to mention the visual format is not supported, it cannot be part of the most recent revision.

A beneficial side effect is that this will make the XML definition much clearer, as it won't be defining what in effect are two different formats.

II. File Structure

The structure seems very confusing. On the one hand, it seems to be trying to describe an XML format for a dictionary, in the classical meaning for a dictionary: a list of words/phrases with their corresponding description and, possibly, additional metadata. On the other, it describes what could be called a DS "dictionary package" with things that aren't actually traditionally part of a dictionary, like toolbar icons, images, sound files, a folder structure, etc.

I agree that it is reasonable to try to accommodate both interpretations of what a dictionary is and an electronic dictionary should be, but a better solution would be do develop a standard for two different formats: (i) a "flat" XML dictionary format containing only the textual data that traditionally constitutes a dictionary and the metadata to describe it, written to a file with an identifying file name and .xdxf as extension; and (ii) a "dictionary package" possibly modeled after epub or opendocument. That is, essentially a zipped archive with an (XML) index file indicating the contents of the package, which must include one or more xdxf files (allowed only when they're related, e.g.Larousse English-French French-English dictionary would consist of two .xdxf files, the Oxford English Dictionary of only one). Icons, images, and other non-textual should be part of the package, correctly arranged in folders and indicated in the index. An index of images, for example, would indicate their relative location (by default /images/), their file name and the words/phrases under which they should appear. Textual information that is not part of the dictionary per se but that is traditionally part of dictionaries can also be included in pre-defined xml formatted files. More on that below.

In fact, the current dict.xdxf XML file in a folder with a more less defined name and optionally toolbar icons (for a simple dictionary) is an overly complicated, non-practical structure that is not easy to implement. (Imagine any other common file types in a similar structure; MP3, PDF, DOC all with the same name in folder with toolbar icons... who would want to use them?) In fact, notice how all DS that support XDXF will already gladly accept a simple .xdxf file regardless of its name, as is much more intuitive. A clear name identifying the dictionary and its edition/version should be recommended for practical reasons, but is in effect unnecessary as the information is already in the metadata.

This format would also allow for including information which is commonly included with dictionaries (both paper and electronic). One important example is conjugation/declension tables; while these aren't part of key phrases definitions and shouldn't be part of the XDXF file itself, they are commonly included as part of dictionaries and should be represented on the XDXF package format. Conjugations should be included in an independent XML conjugations standard file (to be developed) and referenced in the index file. The DS can then appropriately place a button/link on the entries for which there is a conjugation table which will display the properly formatted information. See the XML conjugation format for French conjugation software Verbiste as an example of such file.

The XML conjugations/declensions can also be used by the DS to recognize, for example, conjugated forms of verbs and display the correct entry (even indicating what form it is).

Some "sub-dictionaries" should also be in their own XML file. For example, some dictionaries include a "name's dictionary" as an annex. The user should be allowed to enable or disable these kinds of "sub-dictionaries" in the DS.

Icons can only be recommended, not required, as they are in no way part of the dictionary. In fact no DS requires icons and few would make any use of them. The icon reference in the current revision seems tailor-made for GoldenDict and the specifications for a standard shouldn't be intended for any specific DS. Icons should be supported, though, in the dictionary package and for the DS that do make use of them. They should be in the appropriate folder and need to be better defined: what format(s) can be used?; which sizes can/must be present?; etc. The icon metadata should also be present in the main index file (possibly unnecessary is defaults are used).

A beneficial side effect of the zipped package format is the enormous size reduction. As XML formats require a constant repetition of opening and closing tags, files are inflated significantly, an inflation that is greatly reduced in a zipped archive. A significant example: the CC-CEDICT dictionary, with 114,959 entries takes 8.4 MB in its original minimalist format; when converted to XDXF it takes 31.4 MB, an almost 4-fold increase in size! A zipped CC-CEDICT file takes 3.3 MB, and the zipped XDXF-converted file only 4.3 MB, a minimal increase over the original file size. In fact, DS should be recommended to import zipped flat XDXF files directly, even when not part of an XDXF package.

III. XML Structure

1. Root Element

See format argument above.

2. <meta_info>

All elements should be clealy described.

  • <title>/<full_title>: What is exactly the difference between <title> and <full_title>? How long can the <title> be, exactly? Several example here would be very helpful.

    • More importantly, why does the <title> need to be "written in English"? It makes absolutely no sense to me why a Chinese, or a Russian, or an Arabic monolingual dictionary's title would have to be written in English, a language that hypothetically its users can't even understand. Even for bilingual dictionaries in which none of the languages is English the requirement doesn't make much sense.
  • <description>: To include the amount of information required, this field will certainly include multiple lines. How should a line break be indicated? With a <br /> tag, like in the entries definitions? Then it should be stated so that it can be supported by the DS.

  • "<last_edited_date>, <dict_edition>, <publishing_date>, <dict_src_url> are optional meta info.": None of these elements were defined. <publishing_date> refers to the publishing of what? The original dictionary? The file the XDXF dictionary was converted from? This very XDXF file? The same could be asked about <last_edited_date> and <dict_edition>.

  • Some necessary additions to lexicon element mentioned below will require additional metadata. See below for more details.

3. Lexicon

  • <k>: This element must support defining different scripts/writing systems for the same "key phrase". This is different from different spellings, in that the user should be allowed to choose in the DS settings which script they prefer and the DS should only display the chosen script, and not show all key phases repeated as may times as there are scripts. The DS may (possibly should) display the other script(s) with the definition text, as it does with transcription, etymology, etc. If the user searches correctly for a word in different script than the one they chose, the entry should be displayed with the word in the chosen script as headword. An obvious example is Chinese: it is common to have all entries in both main variants, simplified and traditional Chinese, which, with the current format, means all entries are doubled in the DS, and a user will have to sift through the simplified entries, even if they only read traditional Chinese (and vice-versa). The same is true for any other language which can be written in more than one script/writing script (either because different areas speaking the same language use different scripts or because it used to be written in a different script and the dictionary includes both variants).

    I suggest the different scripts be noted with as system attribute. As in <k system="simplified">词典</k> and <k system="traditional">詞典</k>

    As far as I know there isn't an ISO list for writing systems, only for scripts, which doesn't work in this case as some languages use more than one script in one word (i.e. Japanese), and different variants may be counted as only one script. So it seems the writing systems used in the dictionary will have to be defined in the <meta_info> element, in a similar way as abbreviations are defined. More on writing systems (what exactly constitutes a script or a writing system is debatable).

  • <def>/<deftext>: The usage of the elements <def> and <deftext> is confusing. It seems (from the examples) that a general <def> is always needed as a placeholder to the <def> elements that actually contain definitions. The fist time I read the format description I thought <def> would be used for the more general meaning and <deftext> for the more detailed one. For example, this definition from the OED:

    marry, v

    1. To join in wedlock or matrimony (...)

    a. in pass. (with ref. either to the act and ceremony, or to the wedded state as a result).

    b. Said of the priest or other functionary who performs the rite. Also absol.

    2. a. To give in marriage, cause to be married. Said esp. of a parent or guardian.

    b. With off.

    Would be rendered:

    <def>To join in wedlock or matrimony (...)
    <deftext>in pass. (with ref. either to the act and ceremony, or to the wedded state as a result).</deftext>
    <deftext>Said of the priest or other functionary who performs the rite. Also absol.</deftext>
    <def><deftext>To give in marriage, cause to be married. Said esp. of a parent or guardian.</deftext>
    <deftext>With off</deftext></def>
    

    But, of course, this wouldn't work when there are three levels of definitions, as in (still from the OED):

    marry, v

    I. trans.

    1. To join in wedlock or matrimony (...)

    a. in pass. (with ref. either to the act and ceremony, or to the wedded state as a result).

    2. a. To give in marriage, cause to be married. Said esp. of a parent or guardian.

    II. 6. intr. a. To enter into the conjugal or matrimonial state(...)

    As I now understand the idea is to nest <def> elements. This would work to any number of level of definitions but still requires the doubled elements <def><deftext> (except when there are examples). It seems the idea is to include elements like <gr> and <tr> inside the <def> element, but this kind of information generally belongs to the "key phrase", not to any specific definition, and as such should be directly inside the <ar> element. Otherwise the pronunciation would have to be repeated on each <def> element, as it's not likely to change (except in the rare cases when it changes in different definition). It is possible I'm still not quite understanding the logic to these two elements, but that is also why they need to be more clearly described and examples need to be provided, especially complex examples with several levels of definitions. If a parent <def> tag is always needed to contain the <def> elements that in turn include the <deftext> elements which in turn include the actual definitions then the first and subsequent <def> elements have very different purposes. Changing the first <def> to <definition>, <def_container> or another similar label could make this structure much more clear.

  • <tr>/<gr> I couldn't get the file to validate with a <tr> element directly inside <def>, so I have to follow the example provided and put each <tr> inside a <gr> element. This makes absolutely no sense, how can a transcription be considered grammar?

  • <gr>: the element seems to be meant for free text but, XDXF being a semantic format, it needs to allow you to define common grammatical properties semantically. Common examples are grammatical gender and number and parts of speech for European languages and, for Asian languages, like Chinese and Japanese, measure words (or classifiers, as they are also known). You should be able to define measure words as such <gr><mw>份</mw><mw>顿</mw></gr>, letting the DS handle how to display it and allowing for indexing. Similarly, for European languages you should be able to define gender and number, as such <gr><gender>mas</gender><num>sing</number></gr>. The possible options should be predefined. There still needs to a free text element such as <grtext> for properties not yet defined and other grammatical comments. Of course this could be an enormous undertaking, but note only properties normally mentioned in dictionaries need to be defined; for example, verb tense doesn't have to be defined as dictionaries usually don't define it, only verbs in the infinitive are listed (for the languages I know, if this is not true for any language it still needs to be included). Also, not all possible properties for all languages in the world need to be included at once (which wouldn't even be possible) only the most common at first and more can be added to future revisions as needed.

    Atributes may be needed to indicate types of grammatical categories, these should be predefined and indicated by the 3-letter language codes from ISO 639-3 standard, as they should be language-dependent. If there are non-language specific categories, there is no need for the language code. For example Japanese adjectives can be divided in -i adjectives, -ii adjectives, -na adjectives, -no adjectives, attributives, -taru adjectives, and noun or verb acting prenominally (This are the categories as defined on JMDICT, what exactly constitutes an adjective is debatable as are its categories.) A parts of speech entry for the adjective 暑い, hot, could be as such: <part type="jpn-i">adj</part>. I'm not sure if these attributes should be defined the <meta_info> element or generally in the XDXF format. In any case, it's not for the XDXF project to decide which categories are valid (or to make any other classification judgment of any kind), any dictionary must be able to set it's categories freely, but the options may be predefined.

  • <tr>: "Marks transcription/pronunciation information" -- these can be very different things in dictionaries. The description also leaves out transliteration, which is essential to non-alphabetic languages, and which I assume is meant to be included in this element. Transliteration (generally, but not limited to, romanization) is particularly important because for entries to be easily searched their transliteration(s) needs to indexed (for ideographic languages like Chinese that's the only way to search for a word if you know the sound but not which characters are used to write it.) My proposal is that instead of one, there should be three elements:

    • Transcription (possibly <tr>) which by default should be IPA but should allow for an attribute defining the transliteration system, e.g. <tr system="SAMPA">"s{mp@</tr>. (The current "mode" attribute isn't very clear.) The valid systems need to be defined in the standard so that DS know exactly what they are and can do things like converting SAMPA (meant to be read by computers) to IPA (meant to be read by people). This will allow the DS to know what transcription system is being used and display it to the user. It is not likely more than one system will be used at a time, but in case it is it will also be supported.

    • Pronunciation (possibly <pr>): There are also different ways to indicate pronunciation without phonetic transcription. Pronunciation respelling, is very common in English monolingual dictionaries; also pronunciation respelling for just a syllable, or a even a letter when more than one pronunciation is possible can be seen in dictionaries of different languages. I have seen it in Portuguese monolingual dictionaries, usually enclosed in slashes, and may be used for other languages also. As it's highly unlikely both will be used at the same time, there is no need for attributes, e.g. <pr>paɪəˈnɪə(r)</pr> or <pr>/nɪə/</pr>. Possibly partial respelling should be indicated with an attribute and not by enclosing it in slashes. There may be other ways to indicate pronunciation in languages I don't know, in which case an attribute may be needed.

    • Transliteration (possibly <tl>): generally but not necessarily romanization. Needs an attribute to indicate the transliteration system used. As far as I know there is no ISO list of transliterations so the names may have to be defined in the <meta_info> element. The transliteration(s) in the most common system(s) should be indexed so that "key phrases" may be searched; which transliterations should be indexed must be indicated in the <meta_info> element. For example for the word 中国, China, the transliterations can be indicated as <tl system="pinyin">Zhōngguó</tl><tl system="Bopomofo">ㄓㄨㄥ ㄍㄨㄛˊ</tl><tl system="Gwoyeu Romatzyh">Jong'gwo</tl><tl system="Wade–Giles">Chung1-kuo2</tl>. But as only pinyin (and bopomofo in Taiwan) are commonly used to indicate pronunciation and to input characters, only one should be indexed (or two, if to be used in Taiwan). Common simplifications for typing the transliteration in ANSI characters should be handled by the DS, but should not be indicated in the XDXF file. For example "zhong1guo2" should be recognized by DS as "zhōngguó", and a search for "zhong" should show results for all possibilities (zhōng, zhóng, zhǒng, zhòng, zhong). When a transliteration is indexed and a user searches using the transliteration the results should show not only the key phrases but the transliteration also, either next to the key phrases or as a tooltip--as very different words can have the same transliteration.

    One more example of transliterations: for the Japanese word ローマ字 the transliterations can be defined as <tl system="revised-Hepburn">rōmaji</tl><tl system="kunrei-shiki">rômazi</tl><tl system="Nihon-shiki">rômazi</tl>. But only Hepburn is used nowadays and it is the only system that should be indicated to be indexed. The DS should recognize "roumaji" as both rōmaji and roumaji, but the simplified ANSI form should not be indicated in the XDXF file.

    There may be more than one transcription, pronunciation or transliteration, as there may be more than one pronunciation for a word. These elements may be repeated as many times as necessary. There should also be a comment attribute to indicate if a pronunciation is rare, archaic, regional, etc.

    These three elements may possibly be organized inside parent element.

  • <deftext>/writing systems: For dictionaries that have several writing systems defined you should be able to indicate alternatives in the definition text when a word/phrase in the original language is used. Only the selected writing system should be displayed. Also there needs to be a way to indicate transliteration in one of the systems defined, to be indicated by the DS as appropriate (e.g. simply next to the words, or with some specific formatting, or over the word, or as tooltip, etc.). An example of what this would look like for the Chinese word 乎: <deftext>classical particle similar to <tl sys="pinyin" text="yú"><altsys="traditional">於</altsys><altsys="simplified">于</altsys></tl>) in</deftext>.

  • <rref>: Should only be used in for dictionary packages. Doesn't need as much information as this should be defined in an index file. Something like <k>key phrase<audio type="transcription" /> may be enough. It is preferable to not indicate any external file in the flat XML file. What media types, formats and sizes are allowed should be defined in the XDXF (package) standard. <rref> should be directly inside <ar> or the "container-<def>", unless if it refers to a specific sense of the word; I don't understand how it could ever be inside <gr>.

  • <c>: I don't think it is needed for a semantic format.

  • <ex>: Missing one of the most common types of examples: quotations. I don't quite understand why examples should be indexed.

  • <co> : Not sure why comments should be indexed. Grammatical comments belong in the <gr> element; comments on etymology belong in the <etm> element. Types need to be defined in the standard so DS know how to handle them.

  • <sr> : Might be better to allow semantic relations to be defined in a separate XML file in dictionary packages. The element should still be be available for flat XML files.

  • <etm> There needs to be a way to define the genealogical relationship of wors. This is a proposal using nesting and 3-letter language codes with the example "fetish", as per the the OED: <etm><orig><k lang-"fra">fétiche</k><orig><k lang="por">feitiço</k><deftext>charm, sorcery (from which the earliest Eng. forms are directly adopted)</def></orig></orig></etm>. Which could be displayed by the DS as Etymology: From Portuguese "feitiço": charm, sorcery (from which the earliest Eng. forms are directly adopted); via French "fétiche". There may be better ways of defining this relationship, but this example is enough to show what should be possible. There still needs to be a free text element for any other etymological comments.

IV. Other Comments on XDXF

About transliterations and written systems: I don't think there is an ISO (or ISO-like list) of these systems, it would however be extremely useful to have an official list for allowed systems. This would make it clear and easier for DS to handle it. A solution would be an XDXF official list for each of the two, with an official code for each system. It could be done by adding the official ISO transliteration (that is, one per language) and Unicode scripts (not the same a writing systems) and then add as appropriate.

Information Pages: To allow for the XDXF format to include all information that is traditionally part of a dictionary, I believe it's necessary to include a new element under the root element, something I would call "information pages", to allow for including things like introductions, prefaces, bibliographies, abbreviations, etc. All things that are normally part of a dictionary but aren't allowed in the XDXF standard yet. This element should allow for including the same style tags as the textual definitions for key phrases plus <h#> and <p>. The number of information pages should be very limited, this is not an ebook format.

XDXF Project

Some improvements need to happen with the XDXF project itself:

  • There should be a detailed change log for what changes with each revision and all previous revisions should be made available. This will greatly help DS developers when they want to update their DS implementation of XDXF and, as an open format, an archive of past revisions should be available.

  • The XDXF standard definition should be moved to its own repository instead of being hosted on a folder on the makedict repository. The current situation makes it look as if XDXF is an internal format to be used in a specific tool and not an XML dictionary standard in its own right. (The first time I found it when looking for a good XML format for dictionaries that's what I thought, and I'm likely not the only one.)

  • Still related to the point above: XDXF needs it's own website where people can find information about the standard, the standard description, the DTD description and good example dictionaries without going into Github folders. Github is fine for developers wanting to implement XDXF but not for dictionary users trying to understand what this XDXF thing is. GitHub Pages would be a simple and effective solution.

  • DTD file needs to include all available languages, otherwise dictionaries for languages not included will not be validated.

  • It would be great if the DTD file were commented in detail. A good example of this would be the JMDICT DTD file.

Related issues: #28, #6, #5

@soshial
Copy link
Owner

soshial commented Jan 21, 2017

I would love to improve the format. I have several new ideas myself. If you would like to discuss all those privately (to avoid clutter here), PM me in Telegram (WhatsApp is less preferred). But let's discuss each step at a time, okay?

@k-sl
Copy link
Author

k-sl commented Jan 22, 2017

I agree we should discuss each step at a time, I just didn't want to spam the repository opening a separate issue for each item. I don't use Telegram or WhatsApp, so if you want to discuss these issues privately we'll have to do it old school-style, by email. We can also do it here talking over one item at a time, open to anyone who might want to join in.

@soshial
Copy link
Owner

soshial commented Feb 6, 2017

Let's separate each issue category into its own opened issue — I will open all corresponding issues. Let's start with organizational issues.

I will be able to start working on this in around 2 weeks.
What is your email, btw?

@soshial soshial self-assigned this Feb 6, 2017
@k-sl
Copy link
Author

k-sl commented Feb 6, 2017

Great, that sounds like a plan.
Sorry, I thought you could see my email address on the github emails. Feel free to contact me: aaa2b9ed at opayq.com .

Since we're at it, let me point out two two issues I forgot on the "thesis" above:

  • Related to the <etm> element: transcription/transliteration must also be allowed here, when etymology is given in a script different than that of the translation language. It might be a good idea to just allow for defining transcriptions on all text elements. E.g. the etymological information for the word "democracy" on the OED (ignoring hierarchies, abbreviations, just to show the need for transcription, even in dictionaries for European languages):

    a. F. démocratie (siː), (Oresme 14th c.), a. med.L. dēmocratia (in 13th c. L. transl. of Aristotle, attrib. to William of Moerbeke), a. Gr. <tl sys="classical-greek" text="dēmokratía">δηµοκρατία</tl> popular government, f. <tl sys="classical-greek" text="dêmos">δῆµος</tl> the commons, the people + -<tl sys="classical-greek" text="kratia">κρατια</tl> in comb. = <tl sys="classical-greek" text="krátos">κράτος<\tl> rule, sway, authority. The latinized form is frequent in early writers, and democratie, -craty, in 16– 17th c.

  • There needs to be a way to list different (historical) spellings, possibly a <spelling> element, or <hist-spelling> to make clear this is a historical spelling, not a valid modern spelling that can be included in <k>. These spellings are not to be indexed. There needs to be a way to indicate if this is the spelling for the whole word or just part of it (as with pronunciation), ideally an attribute. A way to indicate grouping may also be needed, as the OED does using Greek letters. I am not aware of any dictionary doing this but it might also be useful to be able to indicate dating and a free text element for any textual explanation related to the historical spellings. Two examples of the historical spelling information from the OED to show what needs to be possible in this format:

    world: Forms: α. 1 weorold, wuruld, worold, uoruld, wiarald, 1–3 weoruld, woruld, -eld, -old, 2 wurold, 3 we(o)reld, wæruld, Orm. we(o)relld. β. 1– world; 1–3 weorld, 4–6 worlde (2 worlð, 3 wurld, 5 whorlld(e); 2–3 werlð, 3 Orm. werrld, 3–5 werld(e; north. and Sc. 3– warld, 5–6 warlde, varld, (5 warlede). γ. 4–6 wordle, 5 wordel, wordil; north. and Sc. 5–7 wardle, 6 wardill, vardil, wardel, vardel; 3 werdle. δ. 3–6 word, 4–5 worde (6 woaude); 3–5 werd, 4–5 werde; 4 wird; north. 4, 6 ward. ε. 3 worl, 3–5 worle, 5 worlle, orlle, 6 worell; 8 worl', north. and Sc. 5 warle, 8 warl', 9 warl.

    democracy: Forms: 6–7 democracie, 6–7 (9) -cratie, 7 (9) -craty, 7– -cracy.

Sorry for adding more things here.

@soshial
Copy link
Owner

soshial commented Mar 24, 2017

Hi, @k-sl, I have moved to another country, changed a job, so I had a lot of stuff to do. I think I will have time in about a month to start working on new XDXF. So let's keep in touch. Sorry for such a long time to wait, but we will do it. I am counting on your advice!

@k-sl
Copy link
Author

k-sl commented May 16, 2017

I'm leaving this message here so you don't think I disappeared. I didn't before because I was trying to keep this from becoming an IM chat log. You don't have to apologise for anything, you have a job and so do I, whenever you have the time it would be great to work on this and, as long as I can, I will help.

@soshial
Copy link
Owner

soshial commented Oct 14, 2017

I. Format

I agree that visual format brings confusion and imports an illusion that any format can be converted to XDXF. We should promote the main idea of convertibility XDXF to Many formats. The idea is to store dictionary in XDXF and then easily convert to any other format that is needed.

II. File structure

I am surprised that there is nothing about compressing dictionary file. It should not be obligatory, but should be encouraged. The disadvantage of compressing is that unpacking the dictionary file has to be before DS can use it and it takes time. For this reason we might use dictzip, which helps to randomly access word articles. But! we need to check that putting several files into dictzip will work.

We also should provide an easy way to download a dictionary with/without all media files. As it was said:

In short, the main content (dictionary itself) can/should be compressed with dictzip, the media resources (images, audio, video) can/should be compressed with regular zip (but one need to be careful about file names encoding in such a zip file).

If xdxf file is put into an archive, of course the archive file can be named more liberally than it is prescribed now.

@soshial
Copy link
Owner

soshial commented Oct 14, 2017

Internationalization

Speaking of transliteration/writing system/regionality we can use built into XML tag xml:lang as recommended here. They prescribe using BCP 47 standard, which includes these examples:

  • mn-Cyrl-MN — montenegran language written in Cyrillic script and used in Montenegro
  • zh-Latn-wadegile — Chinese written in the Latin alphabet, according to the transliteration system developed by Thomas Wade and Herbert Giles
  • zho-Latn-pinyin — Chinese in pinyin
  • ja-Latn-hepburn — Japanese written in the Latin alphabet using the transliteration system of James Curtis Hepburn

Does this standard cover your case? It looks quite promising to me.
We might also tag each <def> or <k> with corresponding xml:lang, since we might also have multilingual dictionaries (e.g. English-Polish-Lithuanian-Latvian dictionary).

@k-sl
Copy link
Author

k-sl commented Oct 16, 2017

I.

We fully agree.

II.

I'm a professional translator but complete amateur when it comes to the technical side, so I all I can give is my amateurish opinion. Here are two reasons why dictzip might not be the best choice:

  1. Unlike XML, .zip, etc., dictzip is a much more obscure format most DS creators might not be familiar with or willing to look into. As far as I can see it was used mostly for dict, which is quite obsolete in itself and might mean dictzip isn't so easy to support on some platforms. This could impact the adoption of XDXF which at the moment is already quite low
  2. At the moment most DS will import the XDXF file into their own internal format or database, which makes random access useless, as the file is only read once. Of course it would be good thing to see XDXF-specific DS that will read XDXF directly as a native format but that is not really a problem right now. Maybe this is a problem for a later time?

What I would like to see is a epub-like format: simple, clear, transparent. Of course most html files in an epub are a couple hundred kb and not dozens of MBs, which is why I understand your point. But I just don't think this is a huge problem at the moment.

With the kind of format I'm suggesting we can also have a modular approach. All archives can have manifest file indicating what the file consists of and what dictionary it belongs to (files which an actual dictionary can have a different extension, like .xdxe -- xdxf extension).
Something like this.

<type>extension</type>
<dictionary>Oxford English Dictionary</dictionary>
<filename>oed.xdxf</filename>
<contents>img;sound</contents>
<index>
    <img>img.xml</img>
    <sound>sound.xml</sound>
</index>

The index files can will have a list of the files, their type, an optional description, and the headword under which they should appear (preferably by ID).
The actual dictionary manifest file would have <type>dictionary</type> and and the contents can indicate only a dictionary if the archive has no media or a dictionary any media that is included in the same archive. People making the archive can decide what to include or not with the dictionary itself.
That xml example is a mock-up, I'm not suggesting that should be the final format.

Internationalization

That is a great find! It can be used for the transliterations, alternative spellings and pronunciations. The languages in lang_from="XXX" and lang_to="XXX" should also use the same format, for consistency. So en instead of eng and zh instead of chi.

I'm not seeing some of the Japanese scripts I think should exist. It has kana, but not katakana and hiragana separated. No Kunrei-shiki or Nihon-shiki (unless they are under very strange names), which aren't commonly used in dictionaries but exist, Kunrei-shiki being the official government romanization. Modified Hepburn, which is the most common romanization in dictionaries shows as "Hepburn romanization, Library of Congress method" which is a very US-centric naming.

However, even if not 100% of systems are available (which would be impossible) they seem to have a working proposal submission system, so more can be added. And, last case scenario, the standard supports private-use tags, which we could use as an exception, if needed.

The question is whether to use the xml:lang attribute or to use more readable attributes while still using
BCP 47 standard tags. For example:

<tl system="pinyin">Zhōngguó</tl>
<tl system="Bopo">ㄓㄨㄥ ㄍㄨㄛˊ</tl>
<tl system="wadegile">Chung1-kuo2</tl>

Is much more clear and human-readable when the language has already been defined as Chinese. However:

<tl xml:lang="zh-Latn-pinyin">Zhōngguó</tl>
<tl xml:lang="zh-Bopo">ㄓㄨㄥ ㄍㄨㄛˊ</tl>
<tl xml:lang="zh-Latn-wadegile">Chung1-kuo2</tl>`

Is more canonical and makes for an easier DTD but is harder to read by humans and is less clear as it defines more than the system. The same is valid for the other sections where xml:lang is useful. The article you linked to discusses this issue. I'm putting this question forward but I don't think it is huge issue. I like it clear but I wouldn't oppose either way.

@nikita-moor
Copy link
Contributor

II. File Structure

Conjugations should be included in an independent XML conjugations standard file (to be developed) and referenced in the index file.

Why not create an independent dictionary with conjugation tables?

III. XML Structure
3. Lexicon

An obvious example is Chinese: it is common to have all entries in both main variants, simplified and traditional Chinese, which, with the current format, means all entries are doubled in the DS

<k system="simplified">词典</k>
<k system="traditional">詞典</k>

You could produce two separate variants of the dicitonary, one for Simplified script and another for Traditional. It's straightforward and does not requiere special handling by the dictionary shell.

@k-sl
Copy link
Author

k-sl commented Mar 9, 2019

Why not create an independent dictionary with conjugation tables?

I'm sorry I didn't understand what you mean by "independent dictionary with conjugation tables".

You could produce two separate variants of the dicitonary, one for Simplified script and another for Traditional. It's straightforward and does not requiere special handling by the dictionary shell.

The dictionary needs to have both both simplified and traditional Chinese headwords; you need to be able to look up a word in any of the two standards, regardless of which variant is used for the definitions. You also need to be able to see the characters used in the alternative standard when looking up a word. Your suggestion would mean all entries for which simplified and traditional characters are the same would be repeated and that, when looking up a word, the reader would have no way to know how the word is written in the other standard. Besides, most of what I'm describing already works fine in XDXF, I just add both <k> tags to to each article on my Chinese dictionaries and I've been using them like this for years. The problem is there is no way to define which is which, something that should be defined semantically, so the DS can show which is which, hide one if the reader wants to do so, and show the preferred version first, in all Chinese dictionaries.

See, e.g. this example for Cross-strait Dictionary, the dictionary definition is in traditional Chinese, you need to be able to find it through both standards.

@nikita-moor
Copy link
Contributor

nikita-moor commented Mar 12, 2019

I think I start to understand your position—you want to add more semantic features to XDXF. However, it's not a semantic storage of lexical information but the final result. Comparing to other existing formats, such as DSL (ABBYY Lingvo) or BGL (Babylon), XDXF separates content and styles, in a manner of HTML+CSS. It defines some level of semantic, but only in aim of correct rendering.

Many features you are instresting in, could be made in TEI format. It's more flexible but also more complicated. It would be wonderful, if GoldenDict adds support of TEI format with automatic XSLT transformation and CSS styles assigned to every dictionary independently. That will be the most powerful way, so dictionary compilers could define any additional elements and control how to show them.

Anyway, its only my opinion, it would be better to hear words of @soshial .

@soshial
Copy link
Owner

soshial commented May 22, 2019

Hey @k-sl. I have awoken from a long slumber =D and I have finished organizational stuff: removed all converter code, its files.

  1. For changelog, usually people use this section on Github: https://github.com/soshial/xdxf_makedict/releases. Let's stick to that, okay? Then, I will delete this CHANGELOG file. Agreed?
  2. Renaming repository maybe is not the best move at the moment, because inside XDXF files schema links to the DTD file. Should we maybe start using revision numbers in the DTD schema url?
  3. Listing software that supports XDXF is importnat, I think. Could you help me fill up the list HERE?

@k-sl
Copy link
Author

k-sl commented May 30, 2019

Hi, @soshial , nice to see you active again. I myself don't currently have much time, much has happened in the mean time. However, I'd like to help as I am able to.

  1. The Github releases section is really meant for software so you can have a summary of the changes when you share a new release. I don't think that is the the most appropriate way to log changes in this project, which is not a piece of software. I would like any DS developer to be able to just open a plain text file and see every change he needs to do to support the most recent revision. Also to download, share, etc, which is made harder if you tie the project to the Github releases page. Essentially I want it to be as clear and easy as possible; we want to make implementing/updating XDXF support as simple as possible. I don't think there is a problem with also using the releases page, either for a summary of the changes or for full list, but I think there should be a plain text file with a clear, detailed and extensive list of changes in the format.
  2. I don't think think you need to rename this repository, you can leave it as is so any any DS/tool reliant on it will keep working as before. But I would suggest starting a new repository "XDXF" (or similar) to use as the official repository from now on. Again, to make it easier for DS developers to implement it, we want to make it very clear this is a dictionary format that is independent of any tool or software and that the text on the repository is the official and up-to-date standard definition. A folder on the "xdxf_makedict" repository can create confusion. You could leave a note on this repository saying the makedict tool was discontinued and the official repo for the XDXF standard is soshial/xdxf, for example.
  3. I really only know of the ones I see @nikita-moor already mentioned on a separate issue. I also believe GoldenDict for Android doesn't support XDXF; Alpus does. QTranslate is a translator for Windows but claims to support lookup of XDXF dictionaries. I haven't used it, can't confirm.

@soshial
Copy link
Owner

soshial commented Jun 4, 2019

Answering to def and deftext criticism, I created examples for you here: #37

@manfred4321
Copy link

You might want to know about this: there is a full fledged dictionary exchange format, used by mainly linguistic software from SIL, and probably by many hundreds of linguists to create dictionaries: It's called LIFT, see https://github.com/sillsdev/lift-standard - unfortunately without a bridge to the word of dictionary programs like GoldenDict (this is what I really like about XDXF) . The description alone is a 38 page document! But maybe it has some inspiration for the future of xdxf? And I do hope that one day there will be a converter LIFT-XDXF...

@soshial
Copy link
Owner

soshial commented Jan 19, 2022

I was thinking of removing <opt> tag from the standard, because it's very inflexible.

Instead of current <k><opt>the</opt> United States</k> I was thinking of using sortby attribute like this: <k sortby="United States">the United States</k>. This attribute will ensure that the United States will be close to words unity and united. This will help avoid a situation when thousands of articles that start with the are sorted/accumulated together.

Anyone agrees/disagrees?

@soshial
Copy link
Owner

soshial commented Jan 19, 2022

By the way, I have updated the specification, taking into account some of your suggestions, @k-sl and @nikita-moor.

The main changes are:

  • full deprecation of visual format
  • support of BCP47 languages, variants and scripts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format_specs format specifications, new versions
Projects
None yet
Development

No branches or pull requests

4 participants