Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Capitalisation of beginning of humanisation #87

Closed
mzeinstra opened this issue Jan 15, 2023 · 5 comments · Fixed by #89
Closed

Avoid Capitalisation of beginning of humanisation #87

mzeinstra opened this issue Jan 15, 2023 · 5 comments · Fixed by #89

Comments

@mzeinstra
Copy link
Collaborator

mzeinstra commented Jan 15, 2023

Interesting suggestion by @verdy-p that will increase the usability of this repo:

avoid forcing a leading capital in sources (e.g. "from $1 to $2" and not "From $1 to $2", "yesterday" and not "Yesterday"...) and that all translations should use uncapitalized terms (unless these terms are always capitalized like "Monday" or "March" in English), i.e. like entries in dictionaries. The capitalisation at start of a sentence or title can be inferred. CLDR does not force the capitalisation in any one of these translatable terms.

Originally posted by @verdy-p in #77 (comment)

@JeroenDeDauw
Copy link
Member

Do you have an example of a message for which this is currently going wrong?

@verdy-p
Copy link

verdy-p commented Jan 16, 2023

E.g. in French: forced capitalization is wrong on almost every part of formatted dates (including but not limited to month names, weekday names, season names),

except some parts for era names citing proper names like people names, or names of ethnic/national groups of people, toponyms ("Jésus-Christ", "Bouddha", names of Japanese emperors...) and their abbreviations, whose capitalization type is unmutable.

Forced capitalization is wrong for ordinals (like "premier", "deuxième", ...), for all prepositions (like "de", "à", "vers", ...) and adverbs (like "environ", "approximativement", ...), and for all names and adjectives referencing to languages/cultures, and derived terms like adjectives (like "chrétien(ne)", "bouddhiste", "musulman(e)", "islamique").

@JeroenDeDauw
Copy link
Member

Yeah ok. So what is an example of those rules being violated by the current implementation?

Any of the translations in here wrong? https://github.com/ProfessionalWiki/EDTF/blob/master/tests/Functional/FrenchHumanizationTest.php

@verdy-p
Copy link

verdy-p commented Jan 16, 2023

-		yield 'Leading zeroes' => [ '0042', 'Année 42' ];
+		yield 'Leading zeroes' => [ '0042', 'année 42' ];

-		yield 'Interval with open end' => [ '2019/..', 'Depuis 2019 (fin indéterminée)' ];
+		yield 'Interval with open end' => [ '2019/..', 'depuis 2019 (fin indéterminée)' ];

-		yield 'Interval with open start' => [ '../2021', 'Jusqu’à 2021' ];
+		yield 'Interval with open start' => [ '../2021', 'Jusqu’en 2021' ];

-		yield 'Interval with unknown end' => [ '2019/', 'Depuis 2019 jusqu’à une fin inconnue' ];
+		yield 'Interval with unknown end' => [ '2019/', 'depuis 2019 jusqu’à une fin inconnue' ];

-		yield 'Interval with unknown start' => [ '/2021', 'Depuis un début inconnu jusqu’à 2021' ];
+		yield 'Interval with unknown start' => [ '/2021', 'depuis un début inconnu jusqu’en 2021' ];

-		yield 'Year approximate' => [ '2019~', 'Autour du 2019' ];
+		yield 'Year approximate' => [ '2019~', 'autour de 2019' ]; //or:
+		yield 'Year approximate' => [ '2019~', 'en 2019 environ' ];

-		yield 'Year uncertain approximation' => [ '2019%', 'Autour du 2019 (incertain)' ];
+		yield 'Year uncertain approximation' => [ '2019%', 'autour de 2019 (incertain)' ]; // or:
+		yield 'Year uncertain approximation' => [ '2019%', 'en 2019 environ (incertain)' ]; // or:
+		yield 'Year uncertain approximation' => [ '2019%', 'vers 2019 (incertain)' ]; // or:

-		yield 'Month approximate' => [ '2019-04~', 'Autour du Avril 2019' ];
+		yield 'Month approximate' => [ '2019-04~', 'autour d’avril 2019' ]; //or:
+		yield 'Month approximate' => [ '2019-04~', 'en avril 2019 environ' ]; //or:
+		yield 'Month approximate' => [ '2019-04~', 'vers avril 2019' ];

-		yield 'Month uncertain' => [ '2019-04?', 'Avril 2019 (incertain)' ];
+		yield 'Month uncertain' => [ '2019-04?', 'avril 2019 (incertain)' ];

-		yield 'Day approximate' => [ '2019-04-01~', 'Autour du 1er avril 2019' ];
+		yield 'Day approximate' => [ '2019-04-01~', 'autour du 1er avril 2019' ]; //or:
+		yield 'Day approximate' => [ '2019-04-01~', 'vers le 1er avril 2019' ]; //or:
+		yield 'Day approximate' => [ '2019-04-01~', 'le 1er avril 2019 environ' ];

As you see, there's no common format for datetime ranges (or open intervals), the prepositions and articles (and their contractions of preposition+article or articles with apostrophes before a term starting by vowel or a unaspirated mute 'h') depend on the precision (alternatives can also use appended adverbs like "environ"); and all strings above should have a leading lowercase letter.

How to handle contractions of articles with apostrophes depend not just on precision, but also specific values (like "avril" here); there's a general rule in French for most vowels (a, e, i, on u, y, possibly with accents), but complexities for terms starting with 'y' or 'i' followed by another vowel, and for terms starging by a mute 'h' (you need lookup in a French dictionary to know if it is aspirated or not; borrowed foreigh terms may be using non-mute 'h' but in most cases there are mute in French, and there's no rule to know if it's aspirated or not; it's a matter of usage). However for translating dates, such dictionnary lookup in French would not be very complex, there are not a lot of terms.

These phenomenoms also apply in Italian and Spanish and many other languages.

@verdy-p
Copy link

verdy-p commented Jan 16, 2023

We shuold note that EDTF deviates from CLDR only because "raw" values use additional delimiters that are still not specified in ISO 8601 (".." for ranges/intervals, or "%", "~", and "?" for uncertainty and "," for list of values; it had "()" also for subenemarations on some elements, but you deprecated them by adding "left/right" semantics for uncertainty qualifiers). But CLDR has full support for translation of ranges/intervals, variable precision for individual dates.

ETF also defined some "magic" values for "pseudo-months" representing seasons, quarters of years, quadrimestres, and half-years. In CLDR they are using another format (e.g. "H1" and "H2" for half-years, "Q1" to "Q4" for quarters, it also adds "W1 to W53" for weeks in the year, inherited from ISO 8601 but that EDTF still forgot to specify).

Such extensions could be added in CLDR (by adding a request to them). And possibly integrated in its "root" locale, or in a separate special locale (like "POSIX", or "C" in legacy standard i18n libraries for C/C++), if there's a need to deviate (for example CLDR uses en-dashes rather than ".." for its root locale, and some locales for actual languages may change en-dashes with or without surrounding non-breaking spaces, depending on the number of date elements that they link together in the range). CLDR however has standardized translated items without forcing the leading capital.

However the translations made for EDTF are not directly related at all to the EDTF compact syntaxic format, whose purpose is only to represent "raw" values in a locale-neutral format. This means that EDTF libaries should not depend at all about translation, made separately in CLDR. The EDTF is just a particular locale, that is parsable into Datetime objects, or extended Datetime objects (for optional certainty/approximation qualifiers), possibly part of collection objects (lists).

Wikidata itself defines its own "qualifiers" to represent certainty/approximations, and does not need use lists: instead it represent each given date or interval as separate items, so for Wikidata the ISO 8601 standard is sufficient and does not need EDTF at all. Wikidata also supports dates relative to eras, and different calendar systems (not just the modern Gregorian calendar which is insufficient, including for many historic dates or official modern uses). Datetime elements should also support the ISO 8601 specification for timezone indicators and for week numbers (CLDR contains much data about them and their translation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants