-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General discussion #16
Comments
I am fine with this issues to be used as a general discussion. 20% is a great improvement! Thanks for working on it I would happily help, but now I am not well equipped with free time :( |
20% was relatively easy - presage once in a while asks for sum of counts for all n-grams. By caching it, you avoid full database scan. I'll have to look at TODO and see the earlier plans |
I have a working prototype which is using MARISA together with counts file as n-gram database. The speed on my test case is about 3.8x faster than the original Presage version. Total benchmark run went from 90s to 25s with 15s spent in UserSmoothedNgramPredictor. In addition of being faster, the prototype MARISA based database is also smaller: 3.4MB vs 20MB of the same data in SQLite format. I will work on it a bit more now to change few things in the code before merging into the master branch. Then the import tools will be needed and we could after that test it on devices as well. |
It sounds awesome! |
And the first test results on device are very good! I generated Estonian database using 1.7GB text and
with 3247094 n-grams. Size is 24MB for that database When used with marisa-branch SFOS predictor, it felt very snappy and predicted quite well. I will have to write docs on how to package dictionaries for SFOS and then merge with the SFOS predictor changes to make MARISA the default predictor. As for how to generate MARISA database itself, this is documented in https://github.com/rinigus/presage/blob/master/README.md |
Sounds great! Only one thing came into my mind: what about the memory usage? |
Good question! maliit-server memory consumption is With Estonian predictor: 318MB VIRT / 48MB RES / 24MB shared The database is split into MARISA trie (that maybe loaded into RAM, 13MB) and counts file (11MB) which is MMAP'ed into RAM. From these numbers, I would say consumption is reasonable. |
PS: I have added stats collection for maliit-server to collectd configuration. This will let us monitor maliit server over longer time and see if there are any issues with it as well as access its RAM consumption better |
Since I monitor maliit-server by collectd, I can report that during the last 2 days max RSS memory usage was 60.6MB. That is with this Estonian database. |
Sounds great! Do you have any hints for release plans? Before the release I would like to sort out this issue at least https://github.com/rinigus/presage/issues/11 And I have had an email what I could not type through a point because the keyboard is crashed always. |
I think we should be getting close to the next release. However, it will be more clear after you tested Marisa version. I don't know how to attach SDK's gdb to presage library. I presume that you need to attach it to maliit-server. What you could do though, is to get that email text and test if presage is crashing on desktop through As for blocking issues:
Anything else is a bonus. The release doesn't stop the development. I would like to address #9 and, after that, https://github.com/rinigus/presage/issues/4 . But we all have our priorities. |
I have adopted NLTK for generation of n-gram database for English. It seem to work quite well and I can get words like It seems to me that we will have to adjust tokenization by Presage (or Maliit?) to each language separately. Such adjustments shouldn't be too hard, I just would have to find the right knobs in the code. But it will be more clear when the feedback will start floating in regarding errors in predictions. As it is, on my part, I would be happy with the release of the current state. That would allow users to start testing and give feedback. Let me know how does your testing go and when you are ready. I'll probably setup also language support package at OBS to let users get development versions easily. |
And an update, I managed to get English and Estonian keyboards and language support into OBS as well. So, for testing, you could just install English layout and language support together with Maliit plugin from http://repo.merproject.org/obs/home:/rinigus:/keyboard/sailfish_latest_armv7hl/ |
@rinigus Great! I am also still looking at the language model and tokenization per language. |
@ljo: nice! In which context do you look into LM and tokenization? To generate ngram database or use it in presage? For ngram database generation, I am using NLTK now with relatively simple script, slightly altered tokenization to keep contracted words (it's didn't ...) which worked for English (allowing to use it for presage as well). Or do you look how to make tokenizer in presage language-specific? |
@rinigus yes, the tokenization need to be same both in n-gram generation and in use of it in library/keybord like I wrote in mid-December. So this must be a language specific configuration even if many languages can share. I started with the tokenizer in presage, but any generator capable of csv/tsv should use a configured tokenizer otherwise there will be missmatch in some places. There are also other areas where the LM should be language specific both in the keyboard part and in the library. I am also working on improving the suggestions since they are too strict wrt keeping already typed characters. |
@ljo - awesome! Would you mind to base your changes on https://github.com/rinigus/presage fork? This will allow you to use much faster and compact database for n-grams storage and keep our work easily combined via merging. Although, if you want to generate the database via text2ngram, you would have to use SQLite as intermediate step before generating MARISA database (that's documented in README.md). I guess, if tokenization is right, then there is no problem with mixing it. The problems will arise if its wrong :) . But its great that you look into it and we are all looking forward to better tokenizer. I will start then working on async mode of the plugin (#9) that should allow me later to implement predictor based on a speller (https://github.com/rinigus/presage/issues/4) - I expect it to be a bit more processing requiring and, as a result, slower. After that, I'll look into the next outstanding issues around the plugin |
@rinigus I will base it on your fork. I have followed the discussion and progress, but have not had much time for commenting and interaction for a while. Sounds great! |
Hi one aspect that hit me while working on Hunspell based predictor: we specify the keyboard by language only: HU, EN, ... Actually, we should use hu_HU, en_US and so on instead as a languageCode in the keyboard configuration file. I guess such syntax is supported, just would have to test it |
By seeing that my free time is one of the narrow cross-section in the process of moving forward with bringing open source prediction solution to SFOS I would propose to create an organisation here on Github for coordinating development. Presage-SFOS-hacking or similar. In this case we could still do PR based code reviews, but it won't be necessary to always waiting for me. |
Its quite late over here to discuss this, let me think about it. It seems like a good idea, but not due to the limited amount of your time - that can strike each of us any moment, but since it will allow us to setup a single place with all relevant repositories. We'll probably want to add separate repo for configuration dialog as well. It will also ensure that my presage commits are reviewed, surely good thing! I'll have to sleep now. Just wanted to let you folks know that I managed to add Hunspell predictor to Presage and hook it to the keyboard plugin. It works on top of async, so we don't see any sluggishness induced by it (if there is any). So, if you misspell the word, just keep typing and you should get speller predictions when its ready to figure it out and when there are no prefixed words in n-gram databases. I am using for that full language codes. So, for English, keyboard has to specify But as for original question, let's think about it. I don't know how to set it up, though. |
If we are going to make such reorganization, we may as well propose UBTouch developers to join. There are probably more environments that use Maliit keyboard and they could be interested as well. So, this organization could have some more general name. If we do it that way, we could still keep Sailfish code there as well, together with the code for other OSes. |
I looked into UBPorts keyboard and a little bit into Maliit. To be honest, I am a bit confused what are we exactly developing? :) Looks like UBPorts keyboard has presage and hunspell support. These are also implemented as separate threads, as done in async PR. From the look at the source tree, its quite modular and has been incorporated into maliit repositories at https://github.com/maliit/keyboard . The plugin that we work on here hooks to jolla-keyboard, which is proprietary, right? Should we consider switching to FOSS solution? Although that may require too much work, no idea about it. |
OK, I think I learned a bit regarding this infrastructure and now I do wonder what's the long term plan for the keyboard development for SFOS. I'll send the email around at sailfish devel list to get some feedback. I presume that you both (@martonmiklos and @ljo) are there as well :) |
From my side I could describe the goal in the following manner: I do use SFOS every day with a "community suppored language" where the Xt9 is not available and I would have a proper text prediction. Given the number of SFOS users in Hungary I doubt that Jolla will ever add support in Xt9 for my language, so I looked for other ways. If I do things like this I usually try to make it scalable. In the current situation I mean under scalability that adding support for a random language is documented and possible easily. Having fun, and learning new technologies during the development is a bonus. |
I agree with that. Also ported devices don't have access to Xt9 - so, for me using 1+X, this is a great and interesting project. But its interesting to hear what's the long term plan for SFOS, maybe we can contribute to that as well . |
I think we could try to go down the same path with this like we did with the community translations. |
I have just sent email to the list. Let's see what we get from there. I agree, let's get it working and polish. The large fraction of the things we worked on (porting presage, understanding how it works, fixing, extending it, hooking to the keyboard) would had to be done anyway. As well as dealing with the bugs :) Now a question - should I push hunspell branch on top of async or wait for you to test async till the end first? I have to change one small thing in hunspell support, but then it will be ready for submission. |
Looks like hunspell will have to wait. I have hit an issue with using Estonian: when using a word starting with Õ, Presage couldn't make a lower case letter out of it. Which is due to use of rather simple lower case algorithm. I think we need full UTF8 support in Presage and I will look into it. It seems to me that we should move Presage internally to ICU strings and require that input is in UTF8 always. Any objections against it? Or should we always convert the strings from current locale to UTF8? It will probably require to work through tokenizer as well. But I think that UTF8 would be handy there too |
I see no objections about making presage UTF-8 internally. |
@ljo: great that you've been using ICU already, I am not familiar with it. I can easily make an update to Marisa database that would work as described, but ICU expertise would be of great advantage when working on tokenization. I can now confirm that packaging UTF8 hunspell dictionary indeed made Hunspell working as expected on SFOS. I'll polish it a bit and will ping when I'm ready. Then we can decide whether to submit it on top of Async or separately. Any update on Async review status? Does it look (code) and feel fins or should I make some additional changes? (typed as many other things these days using this plugin) |
@rinigus It looks good and feels super. Typing a lot with it here too. :) Mostly it is the odd things in the suggestion handling and insertion that makes me react. So I look forward to your ping to test the hunspell predictor for other languages too. I agree we can then decide and start addressing the bugs (in suggestion handling) etc. |
@martonmiklos: "forgot" to reply regarding organization. If you want to do it, I will surely support it. I guess the main advantage would be keeping all projects together. In addition to sailfishos-presage-predictor and presage, we would probably have some configuration tool (#5) and can then think if we want to add other keyboard core as well. Maybe it would be good to keep n-grams also somewhere. Right now we are somewhat spread. Although, with only 3 of us, its not a major issue :) . (Would be good to keep the name relatively short though...) I have now tuned hunspell plugin as well. So, presage has hunspell "predictor" that is very handy when you misspell the word. Its configured via its file basename (hunspell has 2 files, affix and dictionary, the both should have the same basename in this config to simplify the switch). Now, since there is a distinction between language variants, I am using 'en_US' for specific English. This comes basically from https://github.com/rinigus/sailfishos-presage-predictor/blob/hunspell/src/presageworker.cpp#L104 and requires also that the keyboards specify their language in the same way. Such notation drops any lowercasing as was done earlier in the plugin and allows us to use any filesystem-allowed UTF8 sequence for keyboard language specification. From user's POV, we have "en_US" on the spacebar. Similar for any other language. I think we'll need it in the future for correct normalization using ICU as well (turkish i comes as a classic example used in all these normalization texts). Hunspell, as mentioned above, requires UTF8 dictionaries. I have seen some at Github, but they were broken for Estonian. However, after converting system-installed myself to UTF8, I had no problems. See description at the end of https://github.com/rinigus/presage/blob/master/README.md Hunspell builds are at https://build.merproject.org/project/show/home:rinigus:keyboard:devel . Few questions:
|
I have just read through the async implementation and LGTM got merged. As far as I see you have more experience with these stuff so I can only spot the low hanging fruits otherwise I have to delve myself to the docs. |
@martonmiklos: no worries, we do our best and its progressing nicely. Just think where we were a halfa a year ago, before you got the ball rolling :). I'll prepare Hunspell for submission. Will start with the corresponding dictionaries so you could test. Will ping here as well when ready. |
Hi! Hunspell PR submitted and the packages are baked by our friendly OBS. A bit annoying change that I had to introduce is the usage of full language spec (en_US) in all package names and configs. I hope that its the last change we have to do before releasing the packages in terms of package names. It was required to be able to differentiate between en_ and some other language variants. Seems unavoidable to me. Now when you misspell the word, its adding it through hunspell predictor. If the word is spelled correctly, hunspell adds it to the "predictions" as well. That way it should give you a feedback regarding correct spelling too. Hunspell predictions are using fixed probability which is declying linearly along hunspell suggestions to keep the order the same. Seems to work decently well to me. Maybe probabilities have to be tuned, I used the value suggested for other dictionaries. @ljo: I think the quality of predictions will depend on UTF8 support. Before that we are hit by not so good normalization before searching in ngram database (could influence its creation too). Right now I am going to check for the bugs that you reported and see if I can reproduce them |
Before using the version with Hunspell, I would suggest to remove all presage packages. Then remove .local/share/presage and .presage (if you have any). After that check that you don't have old After cleanup, maybe reboot, and then install again. I have all working as expected on my device with the latest version. So far, no unexpected bugs (lowercasing of Õ will happen only with UTF8). In my eyes, its ready for preparation for release. If you see bugs, please report |
Morning! From Hunspell testing, any bug reports? Any bugs to fix blocking the release of the next version? I guess before the release, @martonmiklos, it probably makes sense to remove all presage libraries and extra packages from OpenRepos (or just delete RPMs from those and state that they are not needed due to linking). Do we go with the organization? If we do, it may be good to do that before release and merge all repositories in one place. Then we can collect all issues over there. Otherwise, let's advertise this repository as a main user's issue collector. If it's presage issue, we can move it as a part of resolution. |
I have not got to the point yet to install the hunspell stuff. Organization: if we agree on the name I can create it any time. |
@martonmiklos - fair enough :) . just looking forward to the feedback... As for the name, it should reflect what do we try to achieve. I presume its mainly focused of Sailfish keyboard. As examples of names:
I presume we'll move main development of this plugin and presage over to that organization. What else? Maybe, in future, we will want to take a shot on syncing with UBPorts keyboard to make fully OSS keyboard component under SFOS. That would probably fit nicely there. I would suggest to make repository for n-grams and hunspell dictionaries to keep them in a single place. I would suggest from such scope to name it sailfish-keyboard . Or shorter: sfos-kbd ? Folks, maybe there is something better? Just doesn't pop up into my mind... |
+1 for sailfish-keyboard. Waiting for @ljo comments. |
Edited out since not relevant :) |
Yeah notifications are tricky :D |
I have created the sailfish-keyboard organisation. If better names will proposed we can rename it. I am trying to figure out how can I move repos under the organization. |
Great ! We always have option to playback all commits by merging with the current ones. But that will leave issues behind. Maybe there is a better way |
Well it is turned out to be super easy: https://help.github.com/articles/transferring-a-repository-owned-by-your-personal-account/#transferring-to-an-organization |
Awesome! I will transfer tonight as well. Thank you for getting it organized!!! |
That was simple indeed! Thanks for setting it up, persage is transferred as well. |
@rinigus I managed to install the hunspell-enabled version Saturday lunch just after they bacame available in obs, but then got some kind of flu. I have been typing and experimenting since then though and it works almost perfect in creating suggestions. Especially after you get used to the changing suggestions even in first and second position. |
@ljo - great to hear that hunspell is working :) . @martonmiklos, any success with starting the tests? I will look into cleaning up English n-gram database. It has rather prominent use of swear words that would be good to filter out. I would expect it to be done tonight. PR acceptance and the release plans? I would vote for releasing as it is with hunspell and get the ball rolling with getting more languages/users in. |
Not yet, I have not had a time to install it yet.
Let's aim for PR review on this weekend, and release on the next weekend with mariadb+async+hunspell. I will also try to find some better corpus for Hungarian until that, but it is just a bonus. |
Thanks, then we have a timeline :) |
@ljo: what was the English corpus that you used? I used OpenSubtitles and OANC and, as a result of the subtitles, get many profanity hits in n-grams. So, I wonder whether you cleaned your corpus somehow or maybe used something different? |
@rinigus CurrentIy in Helsinki until Friday evening. I used a selection from our ASPAC English corpus combined with some small texts with more contractions. Should I redo it now when we can use a larger n-gram database? |
@ljo, no problem, I think I found a reasonable filter and will test that. I'll try to generate the database using such corpus. I'll ask for help if I'll get into trouble. |
@ljo, I think I managed to cleanup English dictionary, as much as I tested. I'll package the scripts and will push it to presage under utils, probably tonight. I have seen that you packaged Swedish keyboards in OpenRepos - great! I would suggest to add also the corresponding QML files and make the names of the files match, as done for English and Estonian in my OBS. While plane copy of the ones provided by arrowboard (in your case), this will allow selecting only Presage keyboards in Settings and will behave well on switching languages. |
Hah! It might makes sense to continue the general/non issue related (for e.g. releasing related) discussions there :) |
Excellent, should we close this issue then :) |
Discussion will be continued here: https://github.com/orgs/sailfish-keyboard/teams/development-team/discussions/1 |
I am opening this issue for general discussion and updates/notes. Please feel free to close it at any time or move it to some other channel, as it is the best. Right now, we have been using 'English keyboard' issue for such discussion, maybe its better to make something more dedicated.
As a heads up, after build environment adjustments, I have updated presage library at https://github.com/rinigus/presage to
make double conversion from string using C locale. This is done by specifying locale only for conversion step
add empty database and configure presage to use it on SFOS install. This allows to start the predictor with the default settings and later change the database in accordance with the used language. The solution used earlier was causing crashes on desktop.
I have worked a bit on improving the performance, so far about 20% gain. I have plans on how to improve it further by using other database format, let's see if it will work.
I will read now the plugin code and will make probably some small adjustments. Will submit PR when ready.
The text was updated successfully, but these errors were encountered: