What are you using for morphological analysis? #1411
Replies: 2 comments 4 replies
-
Well, I think technically it is correct to separate 憎ん and で since the て thing is actually an auxillary word and not a part of the inflected verb. 10ten decodes many forms that are technically not a part of verb but common enough such that they're highly useful while easy to implement. |
Beta Was this translation helpful? Give feedback.
-
10ten does it's own very limited parsing. We've tried using Mecab+Unidic in other projects and there are certainly a lot of cases where it's not particularly accurate. For 10ten, however, the problem it's trying to solve it much simpler than general morphological analysis. Given a string, it simply needs to find matches in the current dictionary. It does that using the code in So for 憎んで it will apply the |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'm developing a project that's relying on Japanese morphological analysis to function. I'm completely new to this topic, so I just looked around for opensource morphological analyzers. I settled on NMeCab, which is a .NET port of MeCab, as my project is in .NET. It uses the old IpaDic dictionary.
I am quickly running into limitations of this tool. For example, I just discovered that the analysis of the following:
罪を憎んで、人を憎まず
does not correctly identify the -te form of of the verb in the first half of the sentence.
It correctly identifies the verb 憎む, and tells me the inflection is "連用タ接続" (continuous connection), but the morpheme is "憎ん" and then it claims で is the next morpheme (specifically the particle で).
The tool also supports UniDic (not the latest one), though that one has its own problems.
I could try to write some processing rules for this, (if inflection is continuous connection and next morpheme is で or て, then it's a -て form etc) but this could get complicated and messy. Trouble is this isn't the only not-so-edge case I've noticed during my short experience with the tool. Finding all these issues at an ad-hoc basis and trying to fix them one by one is something I'd very much like to avoid. I'd much rather find a better analyzer.
I'm asking here, because I can see that the 10ten reader has no problems identifying the -te forms of verbs. In fact, I've been using 10ten reader for a while now, and I've always been impressed by its ability to perfectly decode Japanese grammar. So your approach is obviously superior.
I would like to ask, are you using some opensource tool that I could use as well? Or did you write your own custom code for this? Are there any learning resources you could recommend on this topic?
Beta Was this translation helpful? Give feedback.
All reactions