Normalized surface in user dictionary. #126

mrikitoku · 2018-05-28T07:26:30Z

in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?

i think that this functionality is very useful and required in common situations.

so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.

what do you think about this?

cmoen · 2018-05-28T10:40:25Z

Thanks! Could you give an example of what kind of normalisations you'd like to see?

I'm wondering if we might already support it in the full/expanded user dictionary format in 1.0-SNAPSHOT.

mrikitoku · 2018-05-30T01:37:39Z

Token class has the getBaseForm method I regard as a kind of surface normalization as you know well.

by using this method, we can get normalized surface if we register base form for each morpheme.
like this.

public static void execute() {
        Tokenizer.Builder builder = new Tokenizer.Builder();
        builder.mode(TokenizerBase.Mode.SEARCH);
        String text = "プログラミングの入門書を書いている.";
        Tokenizer tagger = builder.build();

        List<Token> tokens = tagger.tokenize(text);
        for (Token t : tokens) {
            out.println(String.format("t.getSsurface(): %s", t.getSurface()));
            out.println(String.format("t.getBaseForm(): %s", t.getBaseForm()));
            out.println(" " + t.getAllFeatures());
        }
    }
   ...
   >t.getSsurface(): 書い
   >t.getBaseForm(): 書く

but, on the current implementations of ipadic user dictionary, it seems that there are no means to register the base form for each user dictionary word. instead of base form , we can register the reading and splitted surface.

mrikitoku · 2018-06-13T23:39:17Z

in current implementation about UserDictionary, SIMPLE_USERDICT_FIELDS is set to 4 as follows

SIMPLE_USERDICT_FIELDS = 4;

simple userdict fields means the following fields

 String surface = values[0];
 String segmentationValue = values[1];
 String readingsValue = values[2];
 String partOfSpeech = values[3];

i think that base form is more needed in usual usecases.
so, i have a plan to add fifth field regarding the base form.

or

letting the segmentationValue handle base form.

mrikitoku · 2018-06-14T01:29:02Z

length of the segmentationValue without spaces must equals to the length of surface because term splitting is executed by using offset and length of splitted word ?

mrikitoku · 2018-06-14T01:59:12Z

@Test
    public void testBaseForm() throws IOException {
        String userDictionary = "NAIST,奈良先端科学技術大学院大学,naisuto,meisi";
        Tokenizer tokenizer = makeTokenizer(userDictionary);
        String input = "大学の略称はNAIST（ナイスト）、奈良先端大学";

        List<Token> tokens = tokenizer.tokenize(input);
        for (Token t : tokens) {
            System.out.println("----");
            System.out.println(" surface:" + t.getSurface() );
            System.out.println(" base form:" + t.getBaseForm() );
            System.out.println(" reading:" + t.getReading());
            System.out.println(" class:" + t.getPartOfSpeechLevel1());
            System.out.println(" " + t.getAllFeatures());
        }
    }

----
 surface:大学
 base form:大学
 reading:ダイガク
 class:名詞
 名詞,一般,*,*,*,*,大学,ダイガク,ダイガク
----
 surface:の
 base form:の
 reading:ノ
 class:助詞
 助詞,連体化,*,*,*,*,の,ノ,ノ
----
 surface:略称
 base form:略称
 reading:リャクショウ
 class:名詞
 名詞,サ変接続,*,*,*,*,略称,リャクショウ,リャクショー
----
 surface:は
 base form:は
 reading:ハ
 class:助詞
 助詞,係助詞,*,*,*,*,は,ハ,ワ
----
 surface:NAIST
 base form:奈良先端科学技術大学院大学
 reading:naisuto
 class:meisi
 meisi,*,*,*,*,*,奈良先端科学技術大学院大学,naisuto,*
----
 surface:（
 base form:（
 reading:（
 class:記号
 記号,括弧開,*,*,*,*,（,（,（
----
 surface:ナイスト
 base form:*
 reading:*
 class:名詞
 名詞,固有名詞,一般,*,*,*,*,*,*
----
 surface:）
 base form:）
 reading:）
 class:記号
 記号,括弧閉,*,*,*,*,）,）,）
----
 surface:、
 base form:、
 reading:、
 class:記号
 記号,読点,*,*,*,*,、,、,、
----
 surface:奈良先端大
 base form:奈良先端大
 reading:ナラセンタンダイ
 class:名詞
 名詞,固有名詞,組織,*,*,*,奈良先端大,ナラセンタンダイ,ナラセンタンダイ
----
 surface:学
 base form:学
 reading:ガク
 class:名詞
 名詞,接尾,一般,*,*,*,学,ガク,ガク

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalized surface in user dictionary. #126

Normalized surface in user dictionary. #126

mrikitoku commented May 28, 2018 •

edited

Loading

cmoen commented May 28, 2018

mrikitoku commented May 30, 2018 •

edited

Loading

mrikitoku commented Jun 13, 2018

mrikitoku commented Jun 14, 2018

mrikitoku commented Jun 14, 2018

Normalized surface in user dictionary. #126

Normalized surface in user dictionary. #126

Comments

mrikitoku commented May 28, 2018 • edited Loading

cmoen commented May 28, 2018

mrikitoku commented May 30, 2018 • edited Loading

mrikitoku commented Jun 13, 2018

mrikitoku commented Jun 14, 2018

mrikitoku commented Jun 14, 2018

mrikitoku commented May 28, 2018 •

edited

Loading

mrikitoku commented May 30, 2018 •

edited

Loading