Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/ability to understand others speaking/ and /ability to be understood by others while speaking/; /ability to read a language in a specific writing system/ and /ability to write a language in a specific writing system/ #40

Open
fititnt opened this issue Dec 31, 2021 · 1 comment

Comments

@fititnt
Copy link

fititnt commented Dec 31, 2021

As per Wiki documentation, this proposal is divided into I - submitter name and affiliation, portal; II - service or software product represented or affected; III - clear and concise description of the problem or requirement; IV - proposed solution, if any.

I - submitter name and affiliation

Emerson Rocha, from @EticaAI.

Some context: as part of @HXL-CPLP, the submitter is working both on community translation initiatives for humanitarian use and (because lack of usable data standards) specialized public domain tooling to manage and exchange multilingual terminology (so different lexicographers can compile results, without centralization).

II - portal, service or software product represented or affected

Non exhaustive list of affected points

  • Decision making for 'best interpreters available' when response time is Important
    • Note to self: example of UK with issues to redirect interpreters on afghan refugees
    • Note to self: example of interpreters asked to aid different macrolanguage on medical context
  • Be able to Exchange "volunteer translators" lists between NGOs (maybe even police) when someone is detained but police and doesn't have interpreters.
  • Minimal standards for when collecting language skill. Both exact language codes and usable scale of skill is relevant.
    • It's also relevant to have automated ways to on worst (which actually is very common) scenario, get the closest volunteers have to do bare minimum interpreting
  • Increase awareness via make minimal conventions on sign languages
  • Increase awareness on European minority languages which doesn't have an two-letter ISO 639-1 by recommending only using no less than three-letter ISO 639-3

III - clear and concise description of the problem or requirement

Comments on nomenclature and symbols used on this proposal:

  • The / // [ [[ notation:
    • his syntax is inspired by a non-standard International Phonetic Alphabet usage of delimiters. A quick explanation would be:
      • /vague term/
      • //extra vague term//
      • [precise term]
      • [[extra precise term]]

  • NOTE: To be explained better next year

IV - proposed solution, if any

The submitter proposes no less than 2 fields, but ideally four. But whatever is the result, it is important to differentiate skill on spoken language and writing language as baseline, so if a person is native speaker and has done education the data fields could have the same values.

The sugestions here use 4 fields.

An compromise to help improved tables to provide labels for such content is also added at the end.

A. The asked addition to CPV

1. /ability to understand others speaking/@eng-Latn

§ IV 1

General idea of definition: ability of the person to understand another person of the spoken or sign language

  • Property: /ability to understand others speaking/@eng-Latn shorter (yet acceptable precise) head term
  • Expected Range:
    • Way to exchange data:
      • Pair of an option from Codelist AND language-neural ordinal number from //0// to //100//;
        • Open-world assumption; absense means "Unknown"
          • To explicitly mean absense of information when exchanging data for a code //0// cannot be used. Either remove the language reference or use of //-1// can be used as alternative.
      • The user should be free to add as many options as possible.
  • Description:

    The person self-asserted ability to understand others speaking the natural language, sign language, constructed language or controlled natural language which is specified. This assumes no slang, specific regionalism or neologisms are used.

  • Usage:

    In case of poor data collection, post corrections are encouraged before exchange when additional context (like country of origin and/or many years a person lived in a region, etc, proof of education, etc) cannot be sent. This is best done closest to the information managers with direct access to raw information. Such human-in-the-loop (even if they become scripted after "on best efforts" detect workflow error) for common mistakes can still be considered self-asserted if done in good faith.

    However, fully automated inferences such as know algorithms on intelligibility based solely on user skill on self-asserted skills can also be exchanged as long as is explicitly different fields and there is no user self-assigned skill on such inferences (in special if is "zero"). This approach both avoids error propagation while can be acceptable under urgency.

    Recommended coding systems are the use of ISO 639-3 and Glotocode, both which have decent documentation and good baseline for human labels. Writing system codes, such as ISO 15924 are not needed (they don't make sense in spoken or sign language). Language attributes such as simple (https://www.iana.org/assignments/lang-subtags-templates/simple.txt) as in [[eng-simple]]@zxx-Latn-x-bcp47 (using BCP-47 semantics) can be used to represent skills such as the person skill to understand English for Learners (or the "ICAO English): this would mean understanding at least the English grammar, shorter sentences and minimal vocabulary.

    With use of ISO 639-3 or Glotocode as minimal reference, the country codes ISO 3166-3 as part of the language become redundant. Even if it really wanted to be more precise and ISO 639-3 is not "good enough", then Glotocode can be an alternative with vast documentation.

    About the code used to express the user ability the recommended machine level scale is numerical value from 0 to 100. How this data coercion is made can be explained with additional metadata (for example, which actually labels or context information for users where shown). In case of persons who do have proof of skill on language tests applied to more than a country, the scale used on such tests and the coercion to such 0 to 100 scale becomes relevant. If better for consumers of such data already having such numerical scale is "good enough".

  • Codelist:
    • For language code:
    • For ability:
      • Machine exchange code:
        • An ordinal number from 0 to 100 (U+0030..U+0039)
          • Semantics:
            • [0]@mul-Zmth-x-unicode ([[U+0030]]@mul-Zmth-x-unicode) means self-asserted explitity absense of even most basic knowledge.
            • [100]@mul-Zmth-x-unicode means self-asserted maximum ability on a given human label skill when the data was collected.
            • If necessary to exchange unknow skill, [0]@mul-Zmth-x-unicode should not be used to avoid confusion with excplicitly self asseted absense. Is better to use of negative value, such as /-1/), or platform dependent value such as NULL@zxx-Zyyy-x-sql (see Null (SQL) for issues) as long is documented.
      • Human labels:
        • Human labels to represent such machine exchange code on skill are not only language dependent, but culturally dependent. The numeric range itself can be used as last resort.

2. /ability to be understood by others while speaking/@eng-Latn

  • Property: /ability to be understood by others while speaking/@eng-Latn or shorter (yet acceptable precise) head term
  • Expected Range:
  • Description:

    The person self-asserted ability to understand others speaking the natural language, sign language, constructed language or controlled natural language which is specified. This assumes that the person considers itself capable of avoiding slang, specific regionalisms and neologisms. Also that, if asked, have patience to speak slower and use synonyms or explain better specific words not understood.

  • Usage:

    Same usage notes of § IV 1 1./ability to understand others speaking/@eng-Latn description apply here.

    While the intelligibility of different close languages can be asymmetric, the person's skill to speak and be understood can be lower. Some examples:

    • The target language can have sounds which do not exist in the languages the person uses more often. The mere lack of training in speaking already affects this difference, but different sounds make it harder for a non native speaker.
    • Simplified versions which do have standard tests, such as Aviation English, [eng-simple] tend to be asymmetric. An educated native speaker of English such as what is said in England or USA, despite by inference be able to understand others 100 out of 100, is likely to use larger vocabulary, speak faster and use more complex sentences which are outside respective [eng-simple] even if they're avoid use regionalisms.

    Again, note that unnecessary usage of ISO 3166-1 (country code) not only adds too many options for users, but can make they're self-assertion lower skill than would be in practice.

  • Codelist:

3. /ability to read a language in a specific writing system/@eng-Latn

§ IV 3

  • Property: /ability to read a language in a specific writing system/@eng-Latn shorter (yet acceptable precise) head term
  • Expected Range:
  • Description:

    The person self-asserted ability to undestand the natural language, sign language, constructed language or controlled natural language which is specified in an specific writting system. This assumes no slang, specific regionalism or neologisms are used.

  • Usage:

    Usage notes of § IV 1 1./ability to understand others speaking/@eng-Latn are relevant here. Except that now is suggested also explicitly the addition of a code used for writing systems: the ISO 15924 kept by Unicode. The full semantics become BCP-47 without omitting the script: this description is written in [eng-Latn] (English in Latin writing system).

    Please note that this field explicitly excludes indiscriminately reuse of same codes used by sign languages: they're (typically) not written and have their own grammar independent from the non-sign languages. Some use cases:

    • Is safe to assume that a well educated native speaker of a non sign language could have all 4 proposed fields inferred with just one question (so the main reason for 4 types of fields are both for non-mother languages and languages wich are not prestige dialects and have less educational material).
    • But a counter example: while you're reading this in [eng-Latn] and English do have a single ISO 639-3 code, the [bfi] British Sign Language and [ase] American Sign Language speakers would need interpreters to talk with each other. Intelligibility (even when avoiding regionalisms) have a non-intuitive path mostly for historical reasons and lack of visibility of sign language speakers.

    This doesn't mean sign languages cannot use this field, but actually need more care on how to collect data and this is very relevant for sake of inclusion. Explicit use of Latin Script code such as [ase-Latn] and [bfi-Latn] (even if use of Fonts) is not-attested outside academic areas. As a generic writing system the closest ISO 15924 could either be Zxxx (code for unwritten languages) or Zsym (Symbols), but discussion is scarce on this area. However [ase-Sgwn] and [bfi-Sgwn] using SignWriting as a written system are known to exist as alternatives to video editing. SignWriting, despite starting as left-to-right, like alphabetic Latin Script, evolved to be top-to-bottom as if felt more natural by its users. Considering each "full symbol" means from a word to a full small sentence, the best attested writing system specialized for sign languages in western cultures actually is close to asian writing systems, like ones used for Chinese, which are not alphabetic and top-to-bottom, but horizontal writting and left-to-rigth was influenced by adoption of computers based on leeds of western (mostly English) usage.

  • Codelist:
    • For language code:
      • ISO 639-3 with ISO 15924, BC47 Semantics
        • Examples:
        • rmf-Latn
          • Label: [Kalo Finnish Romani (Latin)]@eng-Latn
        • rmf-Cyrl
          • Label: [Kalo Finnish Romani (Cyrillic)]@eng-Latn
      • Glotocode with ISO 15924, BC47 Semantics
        • Examples:
        • rmf-Latn-x-kalo1256
          • Label: [Kalo Finnish Romani (Latin)]@eng-Latn
        • rmf-Cyrl-x-kalo1256
          • Label: [Kalo Finnish Romani (Cyrillic)]@eng-Latn
        • Trivia: what if there is no exact ISO639-3 code for an attested language used by Glotocode?
          • Use the closest parent ISO 639-3 as baseline. Only as last resort, und, zxx, mul and mis could be used. For example mul-Latn means multiple languages, but Latin script. zxx-Zmth can also be used for numerical codes used to exchange data
    • For ability:

4. /ability to write a language in a specific writing system/@eng-Latn

B. Tasks the submitter is willing to help

Even if not with exact same terms, the 4 fields to collect such data AND a numerical scale is a huge improvement. Note explanation on the whys can be given, but on "worst case" the codes are BCP-47, which is a standard.

However if this issue goes ahead, the submitter can promise to do it's best to actually go after a stable way to provide pre-compiled tables with all multilingual information needed. Without this implementers would get stuck. And even if they do use BCP-47 properly, it's so hard to get the existing translations for the names of the languages that most submissives to Unicode CLDR on this day are only usable by Apple, Google, Microsoft and few other tech giants.

B.1 Where to publish pre-compiled data

B.1.1 Did the European Union have a CKAN?

I'm not aware of this.

B.1.2 The Humanitarian Data Exchange

https://data.humdata.org is a great alternative to publish periodically updated from the sources such dataset.

Several of it's sets are already automated. See https://github.com/OCHA-DAP to see as reference.

With some discussion, as long as such compiled tables have a license which can be exchanged and there is minimum interest for humanitarian usage, the end result not only can be sent "one time" there, but become part of the process of getting continuous updating.

This means codes and translations wouldn't get outdated, and everyone could have a more centralized reference.

This type of dataset is algo a great candidate for https://vocabulary.unocha.org/.

B.2 Licensing issues

Even for humanitarian usage, it is more likely that merging Unicode CLDR, ISO 639-3 and Glotocode can be delayed because of licensing issues than technical viability. I'm saying this upfront because the idea of expecting each implementer to merge several datasets AND keep them updated is unrealistic. So the experience in the humanitarian sector is the need of pre-compiled datasets more ready to use.

Both if the European Commission does have an "CKAN-like" or The Humanitarian Data Exchange, I know upfront they will ask about licensing.

B.3 Proof of concepts that translations for such terms do exist and most algoritms to calculate more related language are ready for use

The Unicode CLDR (https://cldr.unicode.org/, https://github.com/unicode-org) have both translations for languages and scripts, but also have Territory-Language Information https://unicode-org.github.io/cldr-staging/charts/40/supplemental/territory_language_information.html.

One difference betwen this proposal and the CLDR Territory-Language Information is that CLDR have only 2 fields (Literacy% vs Written%) instead of 4.

eng-Latn (CLICK HERE)
ititnt@bravo:~/Documentos/temp/Core-Person-Vocabulary-reply$ /workspace/git/EticaAI/tico-19-hxltm/scripts/fn/linguacodex.py --de_codex eng-Latn  | jq
{
  "language": "English",
  "script": "Latin",
  "macro_linguae": false,
  "codex": {
    "BCP47": "en",
    "ISO639P3": "eng",
    "ISO639P2B": "eng",
    "ISO639P2T": "eng",
    "ISO639P1": "en",
    "Glotto": "stan1293",
    "ISO15924A": "Latn"
  },
  "communitas": {
    "litteratum": 1636849041,
    "scribendum": 1327465383
  },
  "nomen": {
    "intranomen": "English",
    "externomen": {
      "ar-Arab": "الإنجليزية",
      "hy-Armn": "անգլերեն",
      "ru-Cyrl": "английский",
      "hi-Deva": "अंग्रेज़ी",
      "gu-Gujr": "અંગ્રેજી",
      "el-Grek": "Αγγλικά",
      "ka-Geor": "ინგლისური",
      "pa-Guru": "ਅੰਗਰੇਜ਼ੀ",
      "zh-Hans": "英语",
      "zh-Hant": "英文",
      "he-Hebr": "אנגלית",
      "ko-Jamo": "영어",
      "jv-Java": "English",
      "ja-Kana": "英語",
      "km-Khmr": "អង់គ្លេស",
      "kn-Knda": "ಇಂಗ್ಲಿಷ್",
      "lo-Laoo": "ອັງກິດ",
      "la-Latn": "English",
      "my-Mymr": "အင်္ဂလိပ်",
      "su-Sund": "English",
      "ta-Taml": "ஆங்கிலம்",
      "te-Telu": "ఇంగ్లీష్",
      "th-Thai": "อังกฤษ",
      "bo-Tibt": "དབྱིན་ཇིའི་སྐད།",
      "ii-Yiii": "ꑱꇩꉙ"
    }
  },
  "praejudicium": [],
  "__meta": {
    "de_codex": "eng-Latn"
  }
}

Captura de tela de 2021-12-31 20-47-27

Also, the commented algorithms about "closest language to reference ones" do already exist. This not even need access to remove services.


Trivia: This proposal is still suggested during the year 2021 according to Portugal's time zone.

@EmidioStani
Copy link
Member

EmidioStani commented Jan 6, 2023

I think these properties could be useful for the European Learning Model v3, see: https://github.com/european-commission-empl/European-Learning-Model/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants