From 41443d25fab2988aeabee13b5c10e392336b5899 Mon Sep 17 00:00:00 2001 From: soshial Date: Wed, 19 Jan 2022 09:32:09 +0400 Subject: [PATCH] update to rev34; support BCP47 language variants --- AUTHORS | 2 +- CHANGELOG | 3 - CHANGELOG.md | 19 ++ README.md | 7 +- .../images/clickable_categories.png | Bin format_standard/images/dtrn_tag_tooltip.png | Bin format_standard/xdxf_description.md | 214 ++++-------------- format_standard/xdxf_old_schema_rev33.dtd | 119 ++++++++++ format_standard/xdxf_strict.dtd | 64 ++---- sample-dicts/rev33.xml | 63 ++++++ sample-dicts/rev34.xml | 176 ++++++++++++++ 11 files changed, 438 insertions(+), 229 deletions(-) delete mode 100644 CHANGELOG create mode 100644 CHANGELOG.md mode change 100755 => 100644 format_standard/images/clickable_categories.png mode change 100755 => 100644 format_standard/images/dtrn_tag_tooltip.png create mode 100644 format_standard/xdxf_old_schema_rev33.dtd create mode 100644 sample-dicts/rev33.xml create mode 100644 sample-dicts/rev34.xml diff --git a/AUTHORS b/AUTHORS index 1ff4034..960ecb4 100644 --- a/AUTHORS +++ b/AUTHORS @@ -1,2 +1,2 @@ -XDXF format - Sergei Snegov, Leonid Soshinskiy [https://github.com/soshial] +XDXF format specification - Sergei Snegov, Leonid Soshinskiy [https://github.com/soshial] makedict (Deprecated) - Evgeniy Dushistov , kubtek [https://github.com/kubtek] diff --git a/CHANGELOG b/CHANGELOG deleted file mode 100644 index d39950a..0000000 --- a/CHANGELOG +++ /dev/null @@ -1,3 +0,0 @@ -0.3.1-beta1 -- dsl parser fixes and updates -- mueller parser improvements diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..0a4e47a --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,19 @@ +### Changelog (rev. 34) 2022-01-20 +* since rev. 34 the format is only semantic and cannot store any presentational or visual data +* the language code limitation is removed: all languages that exist in BCP47 standard are supported (use http://schneegans.de/lv/?tags=hy-Latn-IT-arevela for validation) +* multilingual dictionaries are now supported: a dictionary may have multiple languages, that are translated from and into. It is also allowed to mark `` and `` tags with `xml:lang` +* description supports line breaks +* transcription info can be directly inside `def` tag + +### Changelog (rev. 33) +* `` introduced in order to fix multiple errors in DTD scheme +* `` tag: added `lctn` and `type` attributes, links are not stored inside the tag anymore +* `` tag: `idref` attribute introduced +* `` tag: added necessary hash sign # in attribute +* `` tag now is a list of `` tags +* `` may now have `` inside to mark etymological ancestors/cognates +* `` may now contain `` tag(s) inside +* `` tag for underlined text introduced +* `
` tag introduced for newlines inside articles +* `` now might have `` tag inside +* ``, ``, `` tags now may have user-set attribute values diff --git a/README.md b/README.md index cc65c6a..9235a88 100644 --- a/README.md +++ b/README.md @@ -11,11 +11,12 @@ Moreover, the format has many tags that are specific to dictionaries: etymologie ### Any drawbacks? 1. **XML parsing speed**. For opponents of using XML for storing dictionary and the problem of storing and parsing big XML-files in RAM, XDXF schema and structure of any dictionary allow to store all word articles on disk with help of hash-tables/. Some dictionary software applies this approach quite efficiently (for example, see [GoldenDict](http://goldendict.org/)). -2. **Editor software**. Although, there is no software that allows editing dictionaries at the moment, XDXF is a more or less human-readible XML, that is quite easy to edit manually in a text editor even without prior knowledge of the format specifications. +2. **Editing software**. Although, there is no software that allows editing dictionaries at the moment, XDXF is a more or less human-readable XML, that is quite easy to edit manually in a text editor even without prior knowledge of the format specifications. ### Which dictionary software supports XDXF? -* [Goldendict](https://github.com/goldendict/goldendict) (Win, Linux, MacOS, Android) -* (please send me other examples, that I dont know of) +* [Goldendict](https://github.com/goldendict/goldendict) (Win, Linux, MacOS) +* [Alpus](https://alpusapp.com/) (Win, Linux, MacOS, Android, iOS) +* (please send me other examples, that are not listed here) ## What was `makedict`? In the beginning of the project a converter was written to facilitate conversions to and from XDXF (`dictd/dsl/sdict/stardict/xdxf → dictd/stardict/xdxf`). diff --git a/format_standard/images/clickable_categories.png b/format_standard/images/clickable_categories.png old mode 100755 new mode 100644 diff --git a/format_standard/images/dtrn_tag_tooltip.png b/format_standard/images/dtrn_tag_tooltip.png old mode 100755 new mode 100644 diff --git a/format_standard/xdxf_description.md b/format_standard/xdxf_description.md index 5ed5398..080620b 100644 --- a/format_standard/xdxf_description.md +++ b/format_standard/xdxf_description.md @@ -1,4 +1,4 @@ - XDXF standard; Draft 033; 3 December 2015 + XDXF standard; Draft 034; 19 January 2022 ## Introduction XDXF stands for XML Dictionary Exchange Format, and specifies a **semantic** format for storing dictionaries. @@ -7,17 +7,24 @@ The format is **open and free** to use for everyone. Anyone interested in its fu The **main distinction of XDXF** that makes it stand out among all other dictionary formats is that it doesn't contain almost any representational information about how articles should look like. Instead, XDXF stores only structural and semantic information in word articles. The choice of how they have to be rendered is shifted to dictionary-browsing software ("DS"), its settings and user preferences. This might help users to be able to tweak layout, indentation, text colours, hiding examples or synonyms in order to not clutter the view etc. -Moreover, the format has many tags that are specific to dictionaries: etymologies, elaborate semantic relations, grammatical and stylistic sections and also marks, inter-article and intra-article links, categories/classes of words and many other. The format might be also useful not only for common, but also for scientific purposes. Not to mention the prolific amount of dictionary formats in use, XDXF might be a unified dictionary exchange format. +Moreover, the format has many tags that are specific to dictionaries: etymologies, elaborate semantic relations, grammatical and stylistic sections and also marks, inter-article and intra-article links, categories/classes of words and many others. The format might be also useful not only for common, but also for scientific purposes. Not to mention the prolific amount of dictionary formats in use, XDXF might be a unified dictionary exchange format. For more information on advantages of the format, consider reading the article "[Why XDXF is better?](https://github.com/soshial/xdxf_makedict/wiki/Why-is-XDXF-better%3F)". For opponents of using XML for storing dictionary and the problem of storing and parsing big XML-files in RAM, XDXF schema and structure of any dictionary allow to store all word articles on disk with help of hash-tables/. Some dictionary software applies this approach quite efficiently (for example, see [GoldenDict](http://goldendict.org/)). -Although, there is no software that allows editing dictionaries at the moment, XDXF is a more or less human-readible XML, that is quite easy to edit manually in a text editor even without prior knowledge of the format specifications. +Although, there is no software that allows editing dictionaries at the moment, XDXF is a more or less human-readable XML, that is quite easy to edit manually in a text editor even without prior knowledge of the format specifications. + +### Changelog (rev. 34) 2022-01-20 +* since rev. 34 the format is only semantic and cannot store any presentational or visual data +* the language code limitation is removed: all languages that exist in BCP47 standard are supported (use http://schneegans.de/lv/?tags=hy-Latn-IT-arevela for validation) +* multilingual dictionaries are now supported: a dictionary may have multiple languages, that are translated from and into. It is also allowed to mark `` and `` tags with `xml:lang` +* description supports line breaks +* transcription info can be directly inside `def` tag ### Changelog (rev. 33) -* `` introduced in order to fix multiple errors in DTD scheme +* `` introduced in order to fix multiple errors in DTD scheme * `` tag: added `lctn` and `type` attributes, links are not stored inside the tag anymore * `` tag: `idref` attribute introduced -* `` tag: added necessary hash sign # in atrribute +* `` tag: added necessary hash sign # in attribute * `` tag now is a list of `` tags * `` may now have `` inside to mark etymological ancestors/cognates * `` may now contain `` tag(s) inside @@ -27,8 +34,9 @@ Although, there is no software that allows editing dictionaries at the moment, X * ``, ``, `` tags now may have user-set attribute values #### Known limitations: -* ISO 639-3 contains around 7776 languages. The DTD scheme does not contain all of them. If there is a need for this, all codes may be added to the scheme. -* `lang_to` attribute may have multiple languages, since some dictionaries are multilingual. +* Many dictionary creators wished that XDXF supports some specific grammar forms of their language. Unfortunately, all possible grammar forms for all possible +languages cannot be formalised in a concise format such as XDXF. Also, supporting tables with grammar forms might over-sophisticate the format. Therefore, we +resorted to plain-text grammar information. ## Format description ### File structure @@ -39,30 +47,18 @@ the folder name should be something like "Webster1913". The dictionary file itse always "dict.xdxf". It is recommended for each dictionary to have a set of icons for toolbars and a large icon for the front page. The sizes should be of size like: 16x16, 32x32, 512x512. And the filenames would be icon16.png icon32.png and icon512.png respectively. -Note that all file names are case sensitive. +Note that all file names are case-sensitive. All XDXF dictionary text files (those with .xdxf extension) are in XML format with any Unicode encoding (usually UTF-8). Any non-Unicode encodings are strictly prohibited. ### Tags and structure: -`` is the **root element**. It must have 4 attributes: -* `lang_from` and `lang_to` values are 3-letter language codes from [ISO 639-3 standard](http://sil.org/iso639-3/) - and represents the language of key-phrases and definitions respectively. -* The `format` attribute specifies default formatting for the dictionary and might be either `visual` or - `logical`. The default format might be overwritten for specific articles as described below. - * In visual format, the articles are formatted visually and are intended to be shown by - dictionary software (referred to as "DS") "as is" without inserting or removing any whitespaces or newlines. - However, DS may mark the content of some logical tags (like `` or ``) with different colors. - **NB**! Remember, that visual format is NOT recommended! XDXF is developed especially for logically structured - dicts and the visual format was introduced only to be compatible with dicts converted from old plain-text formats. - * In logical format, the articles are not formatted visually and DS is responsible for formatting them before presenting them to a user. -* `revision` attribute specifies an exact format version of an XDXF file. This attribute is obligatory, but it was first - introduced in recent format revision, so it might be absent in some old xdxf files. +`` is the **root element**. The`revision` attribute specifies an exact version of the format standard of an XDXF file. Obligatory. The structure of a file is divided into 2 parts: the ```` and the ``: ```xml - + All meta information about the dictionary: its title, author etc. @@ -77,9 +73,9 @@ The structure of a file is divided into 2 parts: the ```` and the `` is the container for all meta information about the dictionary. It contains: - 1. `` The short title of the dictionary written in English - 2. `<full_title>` Full name of the dictionary, like it would appear on the book cover. - *Tip*: It can contain non-English title or several titles at once. + 1. `<title>` The short title of the dictionary (so that it can fit small screens and look well in lists). Examples: "Oxford English-Arabic" or "Oxford Picture (En-Ru)" + 2. `<full_title>` Full name of the dictionary, like it would appear on the book cover. Example: "Oxford Advanced Learners Dictionary 8th Edition" or "Oxford Picture Dictionary English-Russian: Bilingual Dictionary for Russian-speaking teenage and adult students of English (Oxford Picture Dictionary 2E)" + *Tip*: It can contain titles in several languages or several titles at once. 3. `<publisher>` The official publisher of the dictionary; optional. 4. `<authors>` contains a list of `<author>` tags and represents all people (organizations) that took part in making dictionary: lexicographers, proofreaders, programmers etc. Optional. * `<author role="xxx">` One tag for each author. @@ -94,7 +90,7 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l `<abbr_def>` might contain a `type` attribute, which states which type of label this abbreviation is: * `<abbr_def type="grm">` — stating grammatical features of word (noun, past participle etc.) - * `<abbr_def type="stl">` — stylistical properties of a word (vulgar, archaic, obsolete, poetic, disapproving etc.) + * `<abbr_def type="stl">` — stylistic properties of a word (vulgar, archaic, obsolete, poetic, disapproving etc.) * `<abbr_def type="knl">` — area/field of knowledge (computers, literature, culinary, typography etc.) * `<abbr_def type="aux">` — simple subsidiary words like ('e.g.', 'i.e.', 'cf.', 'also', 'rare' etc.). * `<abbr_def type="oth">` — others @@ -110,32 +106,31 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l 7. `<file_ver>`, `<creation_date>` are obligatory info. `<last_edited_date>`, `<dict_edition>`, `<publishing_date>`, `<dict_src_url>` are optional meta info. All dates should be formatted as `DD-MM-YYYY`. When the date is not fully known, there should be zeros: 05-00-2011. + 8. `languages` tag lists which languages are used in this dictionary. `from` sub-tags list languages that key-phrases (`<k>`) represent. `to` sub-tags list languages that definitions (`<def>`) represent. 2. `<lexicon>` is the container for all `<ar>` (article) tags. The `<ar>` tag groups together all the stuff related to one key-phrase. - They can have an optional attribute `f`, eg. `<ar f="…">` which may have value either `v` (visual) or `l` (logic) and - can be used to override locally the default dictionary format, which was specified in `<xdxf>` tag. - **NB**! Consider, that using visual format is NOT recommended and its support by DS is unlikely! - The following tags are allowed inside `<ar></ar>` tag. 1. `<k>` A "key phrase", that is written inside `<k>` tag is a sequence of letters/ideograms which a correspondent word article is identified with. Each article may contain more than one key phrase, but always at least one. If there are more than one `<k>`, DS must display all key-phrases. that are assigned to the article, no matter which key-phrase a user was looking for. Developers of DS should make sure that convenient search through all keys is possible (with disregarding diacritics, wildcards and so on). *Notice*, that if it's not possible to merge two articles into one article because of key phrases, then there could be several different articles with identical key-phrases (but different article bodies). For example: ```xml - <ar> - <k>disc</k> - <k>disk</k> - <def cmt="a device for storing information"> - <deftext>......</deftext> - </def> - </ar> - <ar> - <k>disc</k> - <def cmt="round object"> - <deftext>......</deftext> - </def> - </ar> + <lexicon> + <ar> + <k>disc</k> + <k>disk</k> + <def cmt="a device for storing information"> + <deftext>......</deftext> + </def> + </ar> + <ar> + <k>disc</k> + <def cmt="round object"> + <deftext>......</deftext> + </def> + </ar> + </lexicon> ``` Also, `<sup>` and `<sub>` tags are allowed inside `<k>`. * `<opt>` Marks optional part of a key-phrase. Articles are searched by the `<k>` contents without `<opt>` contents, but are showed in the article with it. Tag `<opt>` might be used only inside `<k>` tag. @@ -147,8 +142,7 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l \* They might have a unique lowercase alphanumerical `id` attribute [01-9a-z], that can be referred to from another article. \* They might have an integer/float attribute `freq`: some absolute or relative frequency value of the definition. - In articles with visual format `<def>` tags do not effect the representation, while in logical they do. - For articles that have logical format DS should distinguish visually one definition from another according to nesting level by means of indentation, font size or enumeration definitions with '1)','2)'... or 1.','2.'... or 'A.','B.'... etc. Consider checking out the examples or the DTD schema to understand the structure better. + The DS should distinguish visually one definition from another according to nesting level by means of indentation, font size or enumeration definitions with '1)','2)'... or 1.','2.'... or 'A.','B.'... etc. Consider checking out the examples or the DTD schema to understand the structure better. 1. `<gr>` Specifies grammar information about the word. Might contain different word forms, word usage, grammatical labels and other information of this sort. 2. `<tr>` Marks transcription/pronunciation information; IPA symbols are the default. Might also have "format" attribute with values "X-SAMPA" or "erkIPA". @@ -178,7 +172,7 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l Attribute `type` might be: * `exm` - common examples with or without translations * `phr` - might contain any type of phrasemes (idioms, collocations, clichés etc.) - * `prv` - proberbs + * `prv` - proverbs * `oth` - other Attributes `source` and `author` specify where the example was taken from. @@ -188,9 +182,9 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l * `<ex_tran>` is optional; may be multiple translations (amount: 0 or more). * Inside the previous two there might be useful `<mrkd>` tags. They are used to mark down main word(s) of an article both in original phrase and in translation ([example from Wiktionary](https://github.com/soshial/xdxf_makedict/raw/master/format_standard/images/mrkd_tag_in_examples.png)). * `<iref>` may contain a link to external resource. - 10. `<co>` Marks the text of an editorial comment that elucidates meaning or context (shown in a different colour by program depending on `type`). `type` attribute specifies what kind of comment it is: grammatical, stylistic, usage etc., anything that didn't fit for be placed into corrsponding sections. + 10. `<co>` Marks the text of an editorial comment that elucidates meaning or context (shown in a different colour by program depending on `type`). `type` attribute specifies what kind of comment it is: grammatical, stylistic, usage etc., anything that didn't fit for be placed into corresponding sections. *Tip* (indexing): comments are normally indexed. - 11. `<sr>` is a section dedicated to sematic relations to other words like synonyms, holonyms, hypernyms etc. + 11. `<sr>` is a section dedicated to semantic relations to other words like synonyms, holonyms, hypernyms etc. It uses `<kref>` with `type` attribute to relate to other words and definitions. Possible `type` values are: * `syn` and `ant` — synonyms and antonyms * `hpr` and `hpn` — hyperonyms and hyponims (these incorporate [troponyms](http://en.wikipedia.org/wiki/Troponymy)) @@ -217,126 +211,4 @@ The structure of a file is divided into 2 parts: the ``<meta_info>`` and the `<l ## Examples: - -### Visual format -**NOT RECOMMENDED**; this is how old XDXF-dictionaries look like. -```xml -<?xml version="1.0" encoding="UTF-8" ?> -<!DOCTYPE xdxf SYSTEM "https://raw.github.com/soshial/xdxf_makedict/master/format_standard/xdxf_strict.dtd"> -<xdxf lang_from="ENG" lang_to="ENG" format="visual"> - <meta_info> - <title>Webster's Dictionary - Webster's Unabridged Dictionary - Webster's Unabridged Dictionary published 1913 by the... - - n. noun - v. verb - Av.Ave.Avenue - - 001 - 07-04-2013 - - - - The United States of America - Соединенные Штаты Америки - - - record - - - n. - [re'kord] - Anything written down and preserved. - - - v. - [reko'rd] - To write down for future use. - - - - - home - - [ho:um] - n. - sounds_of_words.ogg - 1) One's own dwelling place; the house in which one lives. - 2) One's native land; the place or country in which one dwells. - 3) The abiding place of the affections. For without hearts there is no home. - 4) дом at home - дома, у себя; make yourself at home - будьте как дома - XDXF Home page - See also: home-made - - - - -``` -### Logical format -Example of the RECCOMENDED logical format: -```xml - - - - - Webster's Dictionary - Webster's Unabridged Dictionary - Webster's Unabridged Dictionary published 1913 by the Webster Institute - 001 - 07-04-2013 - 13-10-2017 - - n.noun - v.verb - Av.Ave.Avenue - - - - - home - - n. 'həum - XDXF Home page - One's own dwelling place; the house in which one lives. - One's native land; the place or country in which one dwells. - - The abiding place of the affections. - For without hearts there is no home. - - - - дом, at home - дома, у себя; - - - make yourself at home - будьте как дома - - Society - - home-made - - - - Society - - Plural form of word index. - - - - disc - disk - - n. - A flat, circular plate; as, a disk of metal or paper. - - - - CO2 - - Carbon dioxide (CO2) - a heavy odorless gas formed during respiration. - - - - -``` \ No newline at end of file +For examples see https://github.com/soshial/xdxf_makedict/master/sample-dicts/. diff --git a/format_standard/xdxf_old_schema_rev33.dtd b/format_standard/xdxf_old_schema_rev33.dtd new file mode 100644 index 0000000..ba6a337 --- /dev/null +++ b/format_standard/xdxf_old_schema_rev33.dtd @@ -0,0 +1,119 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/format_standard/xdxf_strict.dtd b/format_standard/xdxf_strict.dtd index ba6a337..c926052 100644 --- a/format_standard/xdxf_strict.dtd +++ b/format_standard/xdxf_strict.dtd @@ -1,11 +1,16 @@ - - + + + + + + + - + @@ -22,19 +27,21 @@ + - + - - + + + @@ -72,48 +79,3 @@ - - - - - diff --git a/sample-dicts/rev33.xml b/sample-dicts/rev33.xml new file mode 100644 index 0000000..190d022 --- /dev/null +++ b/sample-dicts/rev33.xml @@ -0,0 +1,63 @@ + + + + + Webster's Dictionary + Webster's Unabridged Dictionary + Webster's Unabridged Dictionary published 1913 by the Webster Institute + 001 + 07-04-2013 + 13-10-2017 + + n.noun + v.verb + Av.Ave.Avenue + + + + + home + + 'həumn. + XDXF Home page + One's own dwelling place; the house in which one lives. + One's native land; the place or country in which one dwells. + + The abiding place of the affections. + For without hearts there is no home. + + + + дом, at home - дома, у себя; + + + make yourself at home + будьте как дома + + Society + + home-made + + + + Society + + Plural form of word index. + + + + disc + disk + + n. + A flat, circular plate; as, a disk of metal or paper. + + + + CO2 + + Carbon dioxide (CO2) - a heavy odorless gas formed during respiration. + + + + \ No newline at end of file diff --git a/sample-dicts/rev34.xml b/sample-dicts/rev34.xml new file mode 100644 index 0000000..780eab5 --- /dev/null +++ b/sample-dicts/rev34.xml @@ -0,0 +1,176 @@ + + + + + + + + + + + Webster's Dictionary + Webster's Unabridged Dictionary + + Webster's Unabridged Dictionary published 1913 by the Webster Institute. +
+ It has ver extensive lexicon. +
+ 001 + 07-04-2013 + 13-10-2017 + + + n. + noun + + + v. + verb + + + Av. + Ave. + Avenue + + +
+ + + bet + + pronunciation + anagrams + + + Noun + + A wager, an agreement between two parties that a stake (usually money) will be paid by the loser to the winner (the winner + being the one who correctly forecast the outcome of an event). + + + Dylan owes Fletcher $30 from an unsuccessful bet. + + + + A degree of certainty. + + It’s a safe bet that it will rain tomorrow. + + + + + Verb + To stake or pledge upon the outcome of an event; to wager. + + example 1 + + + + To be sure of something; to be able to count on something + + example 2 + + + + (poker) To place money into the pot in order to require others do the same, usually only used for the first person to place + money in the pot on each round. + + + example 3 + + + Etymology 1 + + + + Noun + The letter in Semitic languages + + Etymology 2 + + + + Abbreviation + (knitting) between + + Etymology 3 + + + + + + home + houm + + 'həum + + n. + + + XDXF + + Home + page + + + + One's own dwelling place; the house in which one lives. + + + One's native land; the place or country in which one dwells. + + + The abiding place of the affections. + + For without hearts there is no home. + + + + + дом, at home - дома, у себя; + + + make yourself at + home + + будьте как + дома + + + + Society + + + + home-made + + + + + Society + + Plural form of word index. + + + + + disc + disk + + + n. + + A flat, circular plate; as, a disk of metal or paper. + + + + CO + 2 + + + Carbon dioxide (CO2) - a heavy odorless gas formed during respiration. + + + + +
\ No newline at end of file