What are you using for morphological analysis? #1411

TheGreenAirplane · 2023-11-02T20:07:49Z

TheGreenAirplane
Nov 2, 2023

Hello,

I'm developing a project that's relying on Japanese morphological analysis to function. I'm completely new to this topic, so I just looked around for opensource morphological analyzers. I settled on NMeCab, which is a .NET port of MeCab, as my project is in .NET. It uses the old IpaDic dictionary.

I am quickly running into limitations of this tool. For example, I just discovered that the analysis of the following:
罪を憎んで、人を憎まず
does not correctly identify the -te form of of the verb in the first half of the sentence.
It correctly identifies the verb 憎む, and tells me the inflection is "連用タ接続" (continuous connection), but the morpheme is "憎ん" and then it claims で is the next morpheme (specifically the particle で).
The tool also supports UniDic (not the latest one), though that one has its own problems.
I could try to write some processing rules for this, (if inflection is continuous connection and next morpheme is で or て, then it's a -て form etc) but this could get complicated and messy. Trouble is this isn't the only not-so-edge case I've noticed during my short experience with the tool. Finding all these issues at an ad-hoc basis and trying to fix them one by one is something I'd very much like to avoid. I'd much rather find a better analyzer.

I'm asking here, because I can see that the 10ten reader has no problems identifying the -te forms of verbs. In fact, I've been using 10ten reader for a while now, and I've always been impressed by its ability to perfectly decode Japanese grammar. So your approach is obviously superior.

I would like to ask, are you using some opensource tool that I could use as well? Or did you write your own custom code for this? Are there any learning resources you could recommend on this topic?

SaltfishAmi · 2023-11-02T21:17:15Z

SaltfishAmi
Nov 2, 2023

Well, I think technically it is correct to separate 憎ん and で since the て thing is actually an auxillary word and not a part of the inflected verb. 10ten decodes many forms that are technically not a part of verb but common enough such that they're highly useful while easy to implement.

3 replies

TheGreenAirplane Nov 2, 2023
Author

I don't believe it's correct. The analyzer does separate auxiliary words from inflected verbs, for example, in
どうかした

した will be separated into し, (with the full form of the verb being する) and た. But the た is marked as a 助動詞 - auxiliary verb. Which is fine, I can use this without any issues.

but for 憎んで it claims で is a 動詞 - particle. That's not correct.
自分で作った - here で is a particle.
憎んで - here で is part of the conjugation of the preceding verb - or technically an auxiliary verb. Apparently it is itself a conjugation of だ.

Anyway, regarding my original question, from your response I gather you have implemented your own set of rules for this? You're not using any preexisting tool?

SaltfishAmi Nov 2, 2023

I have basically zero knowledge about how Japanese grammar constructs are called in English (I learned Japanese with Chinese) so I am not able to understand your argument, but here is the code for this functionality:

https://github.com/birchill/10ten-ja-reader/blob/main/src/background/deinflect.ts

TheGreenAirplane Nov 2, 2023
Author

Thanks. I'll take a look at it.

birtles · 2023-11-06T01:22:52Z

birtles
Nov 6, 2023
Maintainer

10ten does it's own very limited parsing. We've tried using Mecab+Unidic in other projects and there are certainly a lot of cases where it's not particularly accurate.

For 10ten, however, the problem it's trying to solve it much simpler than general morphological analysis. Given a string, it simply needs to find matches in the current dictionary. It does that using the code in deinflect.ts. What that does, is take an input string and apply a bunch of possible deinflections to it and then see if any of the results match something in the dictionary.

So for 憎んで it will apply the ['んで', 'む', Type.Initial, Type.GodanVerb, Reason.Te] deinflection rule which produces a possible deinflection of 憎む with type GodanVerb. Then, if it can find a dictionary entry for 憎む with type GodanVerb then it can treat this as a possible result.

1 reply

TheGreenAirplane Nov 6, 2023
Author

Yeah, I realized sometime after posting the question that 10ten is solving an easier problem. You don't need to determine where each word starts, since you're only concerned with one word at a time, and it starts where the user places the cursor.

I've been looking for a more advanced analyzer, but at this point it looks like I'll be using MeCab with some custom post-processing for verbs. Although I've run into an edge case where MeCab won't even recognize the verb correctly:

失くした約束は星に
here instead of identifying 失くす, it gets split into the following:
失
表層形：失
読み：シツ

くし
表層形：くし
読み：クシ
original form: くい

Although the dictionary says なくす is usually spelled 無くす or 無くす, and 失くす in this case is irregular kanji. So I guess I can't complain too much.

Oh what joy :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are you using for morphological analysis? #1411

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

What are you using for morphological analysis? #1411

TheGreenAirplane Nov 2, 2023

Replies: 2 comments · 4 replies

SaltfishAmi Nov 2, 2023

TheGreenAirplane Nov 2, 2023 Author

SaltfishAmi Nov 2, 2023

TheGreenAirplane Nov 2, 2023 Author

birtles Nov 6, 2023 Maintainer

TheGreenAirplane Nov 6, 2023 Author

TheGreenAirplane
Nov 2, 2023

Replies: 2 comments 4 replies

SaltfishAmi
Nov 2, 2023

TheGreenAirplane Nov 2, 2023
Author

TheGreenAirplane Nov 2, 2023
Author

birtles
Nov 6, 2023
Maintainer

TheGreenAirplane Nov 6, 2023
Author