-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement codes for concept with own internal implicitly taxonomy #39
Comments
@fititnt Thank you for your extensive usecase description. Can you elaborate what is the difference with persistent dereferenceable identifiers? E.g. http://data.europa.eu/m8g/Criterion? In addition, the publishing of the identifiers according to the Semantic Web / Linked Data principles is with reuse in a data exchange context in mind. So that means that when parties want to exchange data they are invited to reuse the datamodels with the according identifiers. If they decide for reuse then they should be able to map the data properties on those identifiers. Because the identifiers are persistent, the mapping will be more reliable. Observe that the data models are not software artifacts to be reused as is in a software development. The data models are not about a technical alignment between systems. System developers should do there homework and integrate the data models in the best way in their systems. |
I'm using ISO 639-3 special codes for implementations that are not strict and human language, but are related to the concept IDs. The namespace for this are The reason to use private language code for this is because the same advanced features to translate from one type to another could be abused. It may be easier show in practice than say, but for example since the back and forth of language files could not store complex metadata, if for some reason someone is interested to not extract a natural language or a code for other persistent, but literally something like regex, then the person could request to "translate" from one forma to the new one. If there is nothing there, then it would return empty for that item of the list of values.
For something that already has a very strict match (which may be different from flattened data on CSV) then this would fit reverse search. But note that depending on how abstract is a concept, it would return a lot of results. I'm not saying that is a bad thing, but the main focus here is more about providing translations with fallback.
I personally love (and was one of the enthusiasts) of semantic web! By no means I'm against it. But the main focus is mostly (at least that I know is feasible at some reasonable extend):
I might talk more about this later. But the target audience tend to be less skilled than system developer (on IT) and more skilled on the are of that data. But already the fact of make easier to translate (or some way to a user without web, but maybe desktop app) to point a file and something explain what each variable means on another language already is an improvement. I think this actually fit exactly use cases I would be interested later but this we're already talking about not just translate, but find patterns. But this we're already talking about some automation. But on a quick look at CCCEV, I would need to see at least data forms that do have part of this, so have idea if they can fit. @bertvannuffelen if on next days/next weeks you know forms with this, please share! But for more shorter needs, the topics about sensitivity/confidentially are likely to be related to concepts of Person name (but mostly if the person is a Victim, not if is the police officer, a judge or a coroner. Another point I know do in HXL (but no idea how to encode, or if it should have some sort of special syntax) are variables related to time of a event (start time, end time, duration, etc etc etc). There are good candidates.
One experience I already have from HXLStandad is that not only each variable (think a row list of what user want convert) should be analysed alone, but only collections of variables. This is relevant because context matters. And part of this is possible to already encode on numeric pattern. That is why already on the first post, at least the idea of "cluster" with One inspiration about "composing codes" (after the masterpiece of UN m49, original reference) is the current ICD-11 (https://icd.who.int/ct11/Help?state=Release&lang=en) where a health professional can compose concepts via interface. I'm not saying that such simple could should go as complex as this, but at least clustering and maybe reduced number of other relations would be possible. So, yes, this is exactly about map properties! update: about tests on the parsing of numeric codesFor sake of curiosity, I just started a playground at https://hxltm.etica.ai/codex-simplex-ontologiae.html / https://github.com/EticaAI/hxltm/blob/main/docs/codex-simplex-ontologiae.html. We never use tracking on @EticaAI sites, but maybe on this we will use browser localStorage so it would keep user edits instead of reset. Maybe in a few days I may have a decent proof of concept, so the code part keep mostly to break rules and load the vocabularies! The reason for I prefer to wait for proof of concept is also because trying add to much features before implementation does not work. The ALGOL 68 for example is a reference on International programming languages, but the quote from Koster On ALGOL 68 is... interesting. Anyway. Continuing. In addition to vocabularies (which could be encoded to TBX/TMX/etc) it may be possible to get the header of existing tabular format, such as https://github.com/UNMigration/HTCDS and https://www.interpol.int/content/download/5324/file/Ante%20Mortem%20%28yellow%29%20INTERPOL%20DVI%20Form%20-%20Missing%20Person.pdf and add line per line, divided by string such as Case Owner,[12.10.23] First Name,[12.10.45] Second Name,Type of Exploitation,(...)
AM No:,[12.10.23] First name(s):,[12.10.23] Family name:,Date of birth:,Nature of disaster:,(...) And then the same parser that "learned" the previous form, if the user try to insert the first row of something new (without any code) it could brute force the most equivalent previous form and then suggest the same codes as before. This obviously will not work very well for forms where there is a lot of repetition, and the input data needs to be already well labeled. As reference HXL Standard had something here https://www.microsoft.com/en-us/research/publication/machine-learning-for-humanitarian-data-tag-prediction-using-the-hxl-standard/. So the idea of suggesting labels for human review would not be really strange. I can't make a promise, but in theory is possible to do reverse search (try discover ideal codes based on previous labeled data). This might be very interesting if eventually projects like HTDCS (or other with some review of terms) could start to get translations. Also, for each concept, it could show, obviously it could explain for the user (on his list of preferred languages) what that thing means (aka definition of concept based on the term it found). And I say that I can't promise not because is hard to make proof of concept work on reverse search, but depending of how much data was added (or mislabeled) it would suggest errors to end user. Add to this that some things are prone to error because already are not strict, as is the case of "First Name" instead of "given name"@eng-Latn-GB. But, like I mentioned before, one way (without require computing resources) to have better understanding is not do reverse search by each term, try to first see if it matches several fields of a previous labeled example. This could actually work because Overfitting. I mean, shared resource others could download already labeled a common used dataset in practice, the next time a human would try check what it means, it could get a 100% hit. |
This issue concerns how Core Vocabularies are implemented, so far English is the main language, if there is intention to support multiple languages, this issue could be taken in account |
As per Wiki documentation, this proposal is divided into I - submitter name and affiliation, portal; II - service or software product represented or affected; III - clear and concise description of the problem or requirement; IV - proposed solution, if any.
I - submitter name and affiliation
Emerson Rocha, from @EticaAI.
Some context: as part of @HXL-CPLP, the submitter is working both on community translation initiatives for humanitarian use and (because lack of usable data standards) specialized public domain tooling to manage and exchange multilingual terminology (so different lexicographers can compile results, without centralization).
II - portal, service or software product represented or affected
III - clear and concise description of the problem or requirement
The submitter suggests some explicit way to, without need to directly use language dependent head terms, be viable to conciliate data from different files. One common way would be to create short numeric identifiers and separators of such numbers.
Some of the numbers are related to the CPV itself, but some presented here use a syntax that works to explain relationships between the terms on each dataset implementation.
Use case 1: interoperability for conciliate translations of core vocabularies
The most basic use case would be link translations which already are not part of the 24 European Languages. When preparing translations/terminology, is necessary some concept-like identifier:
<termEntry id="[conceptum-id]">
on a TermBase eXchange (TBX)<tu tuid="[conceptum-id]" >
on a Translation Memory eXchange (TMX)<unit id="[conceptum-id]">
on XML Localization Interchange File Format (XLIFF)By looking at documentation of SEMICeu, it could be possible to make such identifier with such as
Person:alternative name
(SEMICeu::Person:alternative name
) orCore-Person-Vocabulary:Person:alternative name
(SEMICeu::Core-Person-Vocabulary:Person:alternative name
).Why this writing system non neutral approach already is not as perfect for multilingual terminology
Using a concept identifier such as
Person:alternative name
(24 characters) orSEMICeu::Core-Person-Vocabulary:Person:alternative name
(56 characters) is not a problem to use with technology. We used something like this on HXL-CPLP-Vocab_Auxilium-Humanitarium-API/schemam-un-htcds.tm.hxl since the reference work was already using this.From the point of who manually created them (without any tooling) it is easier to manage than define numbers. But numbers:
However, Core-Person-Vocabulary (CPV) currently does not have any explicit strategy of concept identifiers, so not even
Person:alternative name
instead of some planned12.34
exist. But for translations itself, not using neutral codes is less an issue for lexicographers, and more for internal management. Requires more discipline from creators.Use case 2: very compact meaning together with human labels
Another use case is to have some way to express the concept together with labels for end user documentation, such as UML diagrams.
eng-Latn:
[12] Person
[12.34] alternative name
[14] Legal Entity
[12.56] alternative name
Namespaced compact numbers, while at first could be manually put on diagrams, after already having stored all terminology language variants (which are more than average translations) could eventually be reused to create diagrams or other documentations replacing placeholders with they're actual variant.
prs-Arab
//جامعه قابلیت همکاری معنایی//
[12] //شخص//
[12.34] //نام جایگزین//
[14] //نهاد قانونی//
[14.56] //نام جایگزین//
I'm not sure which file formats are used by SEMICeu to create the UMLs, but the JSON-LDs could be templated. Unlike XML, there is no JSON strategy to localize, so the only way would be pre-processing entire files based on single arrival with all terminology language variants.
Some difference in this use case from others is that the end result is to be interpreted visually by readers. The codes may still be relevant to simplify how to generate the final format based on template files. Also note that despite using Unicode Digits (the term used on Unicode for numeric range workout explicit formatting) each template generation could use any valid numeric system, which means [١٢.٣٥](Eastern Arabic numerals) is same as [12.34] (Western Arabic numerals); see https://en.wikipedia.org/wiki/List_of_numeral_systems. But different writing systems, the numeric system on Arabic Numbers are quite popular, so the likelihood of someone knowing (or learning) [12.34] is plausible. But even direct symbol by symbol from "alphabets" is not viable.
Use case 3: machine parsable coding systems together with labels
Different from case 2, the use case here is generation of output files which do not support complex meanings but we still bruteforce them as short codes. The examples will use CSV headings and even more restrictive places, such as URLs slugs, as examples, but all other file formats are easier to work with.
In the following CSV example, the decision was done using a similar strategy of visual UML using
[]
to enclose codes. Either prefixed or suffixed codes could be used (but not in the middle of a string). The example, which uses the exact English term (but real use cases people always will customize it, so most labels already are not reliable) uses suffix strategy (end of term):CSV, eng-Latn:
CSV, prs-Arab:
Sometimes, even if CSV allows spaces and a full range of delimiters, people may want to create headers with more limited range in addition to the letters/symbols for each language. This is common for URL slugs (or as HTMl anchors).
URL-slug, eng-Latn:
alternative-name--12-34
(double delimiter)altrrnative-nane_12-34
(different delimiter, uses reference term with misspelling)lawyer_12-34
(different delimiter, totally different language term)URL-slug, mul-Zyyy:
aliud-nomen--12-34
(double delimiter, any language)备用名称--12-34
(double delimiter, any language)மாற்று பெயர்--12-34
(double delimiter, any language)More examples could be given, but the argument here is that as long as it would be possible to have a generic function extract the code from an string (which can have no code, or have codes for unknown terms) it is worth to allow such flexibility. Also most of the time these codes will be used (if not with definition) with one or more terms. There is some room for human error (like write
11-34
instead of12-34
) but it would be better to improve generic, more universal tools that allow users "check what the dataset means" than try to add human letters, since even short mnemonic codes can also have human errors.Use case 4: know which type of thing (like type of person) an implementation uses (unsolved)
Very often real world datasets which save data in tabular format (and not rare, on other formats that allow complex schemas) actually are not talking about the same thing, even if labels may be near similar. This means the encoding could actually reveal this type of relation.
Let me use as example the Human Trafficking Case Data Standard (HTCDS), https://github.com/UNMigration/HTCDS, as toy example.
On this case, most of the fields are for a survivor (of human trafficking), but there are references for another person who manage the case. The HXL Standard (see https://hxlstandard.org/) tend to recommend add hashtags only to most important fields, so if for simpler cases such as the HTCDS, this would means simply not add codes to Case Owner, which could already be ok for minimal interoperability. Using CPV and some other vocabularies, already would be possible to know what is about a person, but if HTCDS itself be encoded + terminologically translated for very specialized fields such as the Type of Exploitation (which is very, very specific), then the way codes for something like Core-Person-Vocabulary are done, could also allow not just reuse for other standards, but also compact ways for they create terminological translations. However, I still think the code numbers should be short, because this type of work of scaling up translations and publishing for general use with an open license is uncommon.
But going back to Core-Person-Vocabulary, the main problem is the following: the main problem is the following: still an open issue to know "types of person" in a namespace such as the examp
12.*
would be talking about, with special focus for some limited number of types of persons that often appear on data exchange related to human rights violations. This would generate translations beyond the "Person 1 alternative name" to "Lawyer 1 alternative name", "Judge 1 alternative name", (...). This type of generalization could also apply for Core-Business-Vocabulary to specify a Hospital or police department. But in practice, most "grammar" tends to be reusable from Core-Person-Vocabulary, but maybe some strategy to be more specific could allow both better translations and also very optimized conversion from data from different places. And automated processing tends to be quite important, since this could be used to detect outliers when the human rights defenders could be targeted if they try to explain to outsiders.Use case 4: know which type of thing a implementation uses by groups
While the previous Use case is more complex (not only from deciding the pattern of coding) but may need planning on calling up translations, there is something which could be done in a generic way: an additional separator, for example
~
, which would allow to specify some sort of explicit grouping. Without such a separator, some type of default group could be assumed, such as~0
.The separation of alternative names from person and business already is part of the Core Vocabulary semantics. But a single person can have more than one alternative name or the dataset can be calling about two different persons. Another case is when there are more persons and more business, and a business is related to all persons (maybe this would be implicitly with the default
~0
, like a broadcast address https://en.wikipedia.org/wiki/Broadcast_address equivalent on IP addresses) but if numeric indices are explicitly used, this could make the relationship explicy.So:
Attacker name [12.34~1]
Survivor name [12.34~2]
Company name [14.56]
Means that have two different persons, but the business is related to all persons:
Company name [14.56]
is the same asCompany name [14.56~0]
But
Attacker name [12.34~1]
Survivor name [12.34~2]
Company name [14.56~2]
In this case, the business is related to only person 2.
The numeric approach is quite compact (and may be generic enough to have it's semantics used for other types of Vocabularies). But note that when something more complex as this is necessary, very likely either the codes would have some help to allow users to implement them or they would be handcrafted (like assign codes for a data standard such as the HTCDS).
What about the language translations for complex cases? Well, depending on the syntax, the numeric encoding could be more complex (to a point of require a full grammar specialized for each case, like Grammar Framework https://www.grammaticalframework.org/) but if the focus is more around the idea of make interoperable by machines, instead of "translate" to a natural language, it could create new headings that actually are friendly to be processed in something else (like HXL hashags https://hxlstandard.org/standard/1-1final/dictionary/). I will not over-complicate here, but the point is that it could be possible to convert CSV to XML.
Use case 5: nested levels (including not used on final Core Vocabularies)
While the UML diagrams put all data from persons on the same level, at least one numeric code, one potential interesting feature would be to group all terms related to the name of the person under a namespace such as
[12.10] //Any type of person name//
. This approach would mean instead of[12.34] alternative name
, the code would be[12.10.34] alternative name
.One downside of this approach is that ideally also a shadow term not on CPV would need to be added on the translations tables. Without this, if some code is not known on the multilingual terminology, such as
[12.10.99] nicknames in the workplace
would fallback directly to the concept[12] Person
instead of[12.10] //Any type of person name//
.In general, the idea of having some type of nested nevels is to allow some type of granularity. So if someone is using a more specialized field, then the best baseline would be the closest concept. So if "nicknames in the workplace" is a type of
[12.10.34] alternative name
, the best place for person create a nested concept would be[12.10.34.99] nicknames in the workplace
since on worst case the parser would assume it is a specialized type of[12.10.34] alternative name
and the generic translation on eng-Latn could be[12.10.34.99] alternative name 99
.Use case 6: strategy for private namespace when using numeric codes
This is the last use case (and is relevant since codes could be short, but not too short). My suggestion is to avoid using numbers lower than 10 even if generic parsers could work if they find translations with such codes as exact matches on the file with all terminologies.
One downside of the advantage of Use case 5 (the one about allowing any subdivision at any level) is that it reduces changes of known user error. Actually, even if the user could see the translation with a hint that this was not a script match, since mostly such conversions would be automated, it would be better to allow scripts that convert between formats to return an error and break the automation pipeline.
In other words: if someone try to generate translations (or convert between formats) if code patterns are found, but there is not even the concept of that term (which is different of have the concept, but not exact translation for that language, for this users tend to like allow some auxiliar language as fallback) even if a script could know how to generate a upper translation, unless user ask to tolerate, it will raise error.
Then, what about if users want to publish datasets without immediate intent of releasing the terminological translations to not break others data pipelines? My suggestion to mitigate the mitigation of the use case 5 is to allow the use of private codes. My current suggestions on this would be number 5 (between 0 and 9) to start a private prefix. So I instead of
[12.10.34.99] nickname from workplace
an[12.10.34.5.99] nickname from workplace
wouldn't not raise errors. But to allow other potential special uses (not just private use) then the entire range from 0 to 9 could have special semantics for tools trying to read all full concept codes and deciding which one is the ideal one if they already are not found as part of some concept code. For example, if a user edited some XML or spreadsheet with all concepts and manually added the exact match, the processor would not need to allow any advanced checking.So, in practice, the use of any level of codes lower than 10 could still work, but when something is not found and there is a number lower than 10, the fallback strategy can be specialized.
example of result of all use cases
eng-Latn:
[12] Person
[12.10] //Any type of person name//
[12.10.34] alternative name
[12.10.34.5.99] nickname from workplace
[14] Legal Entity
[12.56] alternative name
prs-Arab
//جامعه قابلیت همکاری معنایی//
[12] //شخص//
[12.10] //هر نوع نام شخصی//
[12.10.34] //نام جایگزین//
[12.10.34.5.99] //نام مستعار از محل کار//
[14] //نهاد قانونی//
[14.56] //نام جایگزین//
Please note that, with the exception of the suggestion of restricting use of numbers lower than 10 and the suggestion of 5 for private codes, all numbers used here are examples. This means that if CPV was the first Core Vocabulary (on the knowledge of who distributes the numbers at the moment), it would have preference over the lowest number assigned by SEMICeu if no one is sure. Also, note that numbers don't need to be sequential. In this example
[12.34] alternative name
was moved to[12.10.34] alternative name
and the submitter decided to keep the old local number, but it could be renamed for something different, with a lower number. THe point is that it is the responsibility of SEMICeu to decide who is responsible by the[12] Person
, and then it is the responsibility of[12] Person
to explain the grammar of every namespace under it.IV - proposed solution, if any
That's it: it's about implementing concept codes with our own internal implicit taxonomy.
Note that for use cases such as emergency response, quite often would exist intermediate layers without term language and the initial round would unlikely to be perfect. So best use cases would be to tolerate deprecated codes or any other feature to allow delivery of the first round as fast as possible (think hours, not weeks).
Tasks the submitter is willing to help
Proof of concepts to validate the implicit taxonomy on such codes (code)
The submitter, as part of previous work on @HXL-CPLP, and as a user of @HXLStandard from a majority non-English speaker community, already has experience on the good and bad parts of how to deal with multilingual content. The major part of the already public domain code used on https://hapi.etica.ai is being more generalized to a point where even a non-XML format such as average spreadsheets can be used as a source to extract what each concept means for any language available.
So, extracting and converting back and forth language versions already is not a problem for us (even if most open source and commercial tools can't cope well with such tasks when more than 2 languages are involved; that's why we had to resort to implement tools to solve our own translation management issues to make not just Portuguese versions viable).
But the task we're talking about here is some generic proof of concept of either such numeric codes or at least some common patterns users would do (like analyzing each header) and then return the most appropriate translation.
The idea is that such algorithms be simpler (but without limiting levels) and work as some sort of grammar, while the vocabulary is fully on multilingual files. The reason for them being simpler is that if they are too complicated (like Unicode CLDR is) it would be hard to port the functions for different programming languages.
I think proof of concept both on Python (for more heavy work) and JavaScript (potentially to be referenced later to explain for users online what something means) would be a good baseline.
Also, something simpler makes it easier to create user documentation. This means less documentation to translate.
Compile Core Vocabularies on some multilingual format which SEMICeu would likely prefer
We from @HXL-CPLP use HXL (more strictly http://hxltm.etica.ai) as working format mostly because of allow directed use from collaborative online spreadsheets and allow users exchange CSV or edit with Excel, but for projects like CPV, I think a XML like TBX is the closest to a strict standard which would be able to be edited by hand while we could still export to any other file.
The point here is that a more realistic proof of concept (and actually a step before allowing direct reuse with other vocabularies) would require that all vocabularies from SEMICeu already have a single source of truth.
If SEMICeu would break per project, then maybe I could try some way to either by providing shell scripts or by explaining how to use GitHub Actions, the final work is compiled. Since things are already automated, the point here is to try to at maximum reduce human error, something actually not uncommon. It may not be as relevant on SEMICeu but is on smaller initiatives.
Localized Schemas/Documentation/Scripts/etc beyond English
XMLs/RDFs standards often allow be multilingual (even if this is not widely used in practice) but most schemas and documentations formats don't
This means for example JSON-LDs, or formats viable for direct usage such as Table Schema (frictionless data) and OpenAPI are strictly monolingual. You simply cannot add language variants because the standard doesn't permit it. So use of English on computing is still limited by tooling even when volunteers speak multiple languages.
However, we from HXL-CPLP cope with this by making multilingual files directly usable to extract target language (with optional auxiliary language in case of missing translations, something common when drafting new things) and concert templates for the final result. The template engine used is the popular Liquid (Used by Shopify, GitHub Pages, etc) with some extensions. This means that to scale up to all new translations, after having a multilingual source of truth (or divided ones, but they are merged on the previous step) like TBX, each template can be converted to target languages.
We can help you with that.
The additional complexity compared to just doing it in one language manually is higher. The breaking point may be around 3 languages. This also becomes an advantage when having several files (even in the same language) where the values are the same and a human can copy paste wrong or forget something.
Non-technical request from submitter
Explicitly dedicate the numeric codes to public domain (or equivalent) and ask others to do the same
First, some context. Most tools done by an average open source volunteer (like embed ISO codes on software) do it freely. But more academic, governamental and (this is my focus) humanitarian area will not be republished. Licensing is taken so seriously that no exception like "people are dying because information managers (IM) are using old code tables" would allow lawyers to take risks publishing compiled tables on a place such as HDX (https://data.humdata.org/). But even a humanitarian IM which would have the skills to know what to do, will have no time to download things. Everything is urgent all the time.
So, my non-technical request is that if such numeric code be implemented, please consider put explicitly on Public Domain. And then if others link their vocabularies, ask then that the numeric coding taxonomy be also public domain equivalent. Please note I'm NOT even asking the terms or contextual information (like TBX from IATE do https://iate.europa.eu/legal-notice) but the codes.
This approach both allow others users reuse the numbers from other regions without need to be approved by lawyers of their organization and, in (case of more specialized vocabularies for human rights use on European context, which can themselves become targets) even when original source is removed from online access, at least the coding system republished by others cannot be blocked due to licenses issues. The explanation of what they mean is complicated. When strange things happen, it is already better for outsiders to create non licensed equivalents.
Another point on the codings being open in such a way is to avoid breaking software (like interlink between vocabularies) and also that providers (SEMICeu is at European level, but think local level) can republish a minimal interoperable set. Please note that one major use case are users (not just human right defenders) who cannot count with help from IT personnel and not hard; their use cases are so specialized that cannot be predicted by outsiders. I'm not saying that more centralized approaches are not useful, but in the field if we manage to make the main logic portable even as Excel macros or GSheet scripts, this is relevant. But without some way for them to download already compiled translations on a single file, they would be too busy.
The text was updated successfully, but these errors were encountered: