Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalized surface in user dictionary. #126

Open
mrikitoku opened this issue May 28, 2018 · 5 comments
Open

Normalized surface in user dictionary. #126

mrikitoku opened this issue May 28, 2018 · 5 comments

Comments

@mrikitoku
Copy link

mrikitoku commented May 28, 2018

in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?

i think that this functionality is very useful and required in common situations.

so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.

what do you think about this?

@cmoen
Copy link
Member

cmoen commented May 28, 2018

Thanks! Could you give an example of what kind of normalisations you'd like to see?

I'm wondering if we might already support it in the full/expanded user dictionary format in 1.0-SNAPSHOT.

@mrikitoku
Copy link
Author

mrikitoku commented May 30, 2018

Token class has the getBaseForm method I regard as a kind of surface normalization as you know well.

by using this method, we can get normalized surface if we register base form for each morpheme.
like this.

public static void execute() {
        Tokenizer.Builder builder = new Tokenizer.Builder();
        builder.mode(TokenizerBase.Mode.SEARCH);
        String text = "プログラミングの入門書を書いている.";
        Tokenizer tagger = builder.build();

        List<Token> tokens = tagger.tokenize(text);
        for (Token t : tokens) {
            out.println(String.format("t.getSsurface(): %s", t.getSurface()));
            out.println(String.format("t.getBaseForm(): %s", t.getBaseForm()));
            out.println(" " + t.getAllFeatures());
        }
    }
   ...
   >t.getSsurface(): 書い
   >t.getBaseForm(): 書く

but, on the current implementations of ipadic user dictionary, it seems that there are no means to register the base form for each user dictionary word. instead of base form , we can register the reading and splitted surface.

@mrikitoku
Copy link
Author

in current implementation about UserDictionary, SIMPLE_USERDICT_FIELDS is set to 4 as follows

SIMPLE_USERDICT_FIELDS = 4;

simple userdict fields means the following fields

 String surface = values[0];
 String segmentationValue = values[1];
 String readingsValue = values[2];
 String partOfSpeech = values[3];

i think that base form is more needed in usual usecases.
so, i have a plan to add fifth field regarding the base form.

or

letting the segmentationValue handle base form.

@mrikitoku
Copy link
Author

length of the segmentationValue without spaces must equals to the length of surface because term splitting is executed by using offset and length of splitted word ?

@mrikitoku
Copy link
Author

@Test
    public void testBaseForm() throws IOException {
        String userDictionary = "NAIST,奈良先端科学技術大学院大学,naisuto,meisi";
        Tokenizer tokenizer = makeTokenizer(userDictionary);
        String input = "大学の略称はNAIST(ナイスト)、奈良先端大学";

        List<Token> tokens = tokenizer.tokenize(input);
        for (Token t : tokens) {
            System.out.println("----");
            System.out.println(" surface:" + t.getSurface() );
            System.out.println(" base form:" + t.getBaseForm() );
            System.out.println(" reading:" + t.getReading());
            System.out.println(" class:" + t.getPartOfSpeechLevel1());
            System.out.println(" " + t.getAllFeatures());
        }
    }
----
 surface:大学
 base form:大学
 reading:ダイガク
 class:名詞
 名詞,一般,*,*,*,*,大学,ダイガク,ダイガク
----
 surface:の
 base form:の
 reading:ノ
 class:助詞
 助詞,連体化,*,*,*,*,の,ノ,ノ
----
 surface:略称
 base form:略称
 reading:リャクショウ
 class:名詞
 名詞,サ変接続,*,*,*,*,略称,リャクショウ,リャクショー
----
 surface:は
 base form:は
 reading:ハ
 class:助詞
 助詞,係助詞,*,*,*,*,は,ハ,ワ
----
 surface:NAIST
 base form:奈良先端科学技術大学院大学
 reading:naisuto
 class:meisi
 meisi,*,*,*,*,*,奈良先端科学技術大学院大学,naisuto,*
----
 surface:(
 base form:(
 reading:(
 class:記号
 記号,括弧開,*,*,*,*,(,(,(
----
 surface:ナイスト
 base form:*
 reading:*
 class:名詞
 名詞,固有名詞,一般,*,*,*,*,*,*
----
 surface:)
 base form:)
 reading:)
 class:記号
 記号,括弧閉,*,*,*,*,),),)
----
 surface:、
 base form:、
 reading:、
 class:記号
 記号,読点,*,*,*,*,、,、,、
----
 surface:奈良先端大
 base form:奈良先端大
 reading:ナラセンタンダイ
 class:名詞
 名詞,固有名詞,組織,*,*,*,奈良先端大,ナラセンタンダイ,ナラセンタンダイ
----
 surface:学
 base form:学
 reading:ガク
 class:名詞
 名詞,接尾,一般,*,*,*,学,ガク,ガク

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants