Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2020-12-15: Japanese glosario page is not alphabetical order #254

Open
masamiy opened this issue Dec 15, 2020 · 9 comments
Open

2020-12-15: Japanese glosario page is not alphabetical order #254

masamiy opened this issue Dec 15, 2020 · 9 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request lang: ja issues and PR for Japanese entries

Comments

@masamiy
Copy link
Contributor

masamiy commented Dec 15, 2020

https://carpentries.github.io/glosario/ja/ lists Japanese entries based on the first character of the entry. It means that the entries are not categorised by Japanese alphabet (nor English alphabet), but characters. The last entry, 'function', should be top of the current list as it is read as 'kansuu' if terms are categorise by Japanese alphabet.
As there are 46+ characters in Japanese alphabet, I feel we need to have some indexing strategy.

@masamiy masamiy added question Further information is requested bug Something isn't working lang: ja issues and PR for Japanese entries labels Dec 15, 2020
@baileythegreen
Copy link
Contributor

@masamiy

The order of the entries is determined by a sort function on line 16 of _includes/glossary.html, which operates on individual characters. It may be that for languages such as Japanese we need to find a different solution entirely. The sort function currently being used is a liquid one, and I very much doubt they have a different one that will sort Japanese correctly. I am familiar with the website infrastructure and the code, but I don't know the Japanese alphabet, so this isn't something I can fix on my own.

I can see two options for solving it:

  1. We find or write a function in something like Ruby or Python to sort Japanese (and any other language that has this problem), based on an input list of the alphabet, if need be.
  2. We move to a different system for storing definitions other than a YAML file so that the sorting can take place at a slightly different step. An example would be an SQLite database which can export its contents, or part of its contents, as a YAML or other config-type file. This involves more changes to the infrastructure, though.

Perhaps @fmichonneau or @gvwilson will have another idea?

@fmichonneau
Copy link
Contributor

It looks like option 1 is going to be the way to go.
From a quick search, I saw mecab being mentioned regularly but that's Japanese-specific and wouldn't work for Arabic, Hebrew, Amharic, etc.
From my limited understanding of this, I think the ICU library would order the characters correctly. In R, it's implemented by the stringi/stringr packages, in Python, by PyICU.

@baileythegreen
Copy link
Contributor

I think option 1 is certainly easier to implement in the short-term. I can take a stab at writing Python code to do this, though I may need someone to verify the output in those languages.

If I do this, unless someone has an objection, I'll probably try to remove the sort logic from _includes/glossary.html entirely and use one script to do all alphabetising, rather than have it happen in different places based on the language in question.

@masamiy
Copy link
Contributor Author

masamiy commented Dec 15, 2020

Hi @baileythegreen @fmichonneau , Thank you for your attention and suggestions. A new sort logic will definitely help for non-alphabet languages. I am happy to check Japanese output. Please let me know if there is anything I can help.

@baileythegreen
Copy link
Contributor

@masamiy It'll probably take me a couple of days to get to it because I have some deadlines coming up, but I'll tag you when I do, unless @fmichonneau beats me to it.

@masamiy
Copy link
Contributor Author

masamiy commented Dec 16, 2020

Take your time :)

@TomKellyGenetics
Copy link
Contributor

TomKellyGenetics commented Dec 16, 2020

@masamiy I think the issue is a mixture of Romaji, Katakana, and Kanji in the terms defined. It's sorting them correctly (as expected for this).

I see two solutions:

  1. Give the terms in Hiragana first and it will sort by them. This could make searching them difficult (do the packages support partial matches.

  2. write a custom script that sorts differently depending on the language (as proposed above). There should be existing solutions for sorting Japanese characters but I think it's working as expected now.

Either way furigana (kanji readings) would need to be supported to sort by them and added for each entry (for option No. 2 this would be a need a new slot I think).

You cannot parse furigana from Kanji automatically (although some databases already exist). I think it is easier to specify the intended reading for each entry.

@TomKellyGenetics
Copy link
Contributor

TomKellyGenetics commented Dec 16, 2020

Regarding the order of the entries, the languages on the homepage may need to be changed as well (this is done manually as I understand it).

Sorry this may need it's own issue. (See #259)

@froggleston
Copy link
Contributor

@TomKellyGenetics @masamiy @baileythegreen This has taken a while to address, but please check the output on the new Glosario site to raise any sorting issues that still need addressing!

@froggleston froggleston self-assigned this Sep 18, 2024
@froggleston froggleston added documentation Improvements or additions to documentation enhancement New feature or request and removed bug Something isn't working question Further information is requested labels Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request lang: ja issues and PR for Japanese entries
Projects
None yet
Development

No branches or pull requests

5 participants