The following package consists of a rule-based verb inflector in Persian developed by Mohammad Sadegh Rasooli. The code was mainly used for preprocessing the Persian dependency treebank.
-
If you use this software in your research work, please cite to the following paper:
Mohammad Rasooli, Heshaam Faili, and Behrouz Minaei-Bidgoli. "Unsupervised Identification of Persian Compound Verbs", in Proceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I (LNCS 7094), Pages 394-406, Puebla, Mexico, 2011.
In ACM library: http://dl.acm.org/citation.cfm?id=2178197.2178234&coll=DL&dl=GUIDE&CFID=97422849&CFTOKEN=39528935
In Springer library: http://www.springerlink.com/content/n3r0181wu2h6p337/
-
Please send the bugs and your questions to <rasooli.ms{AT}gmail.com>
The code is compatible with C# 3.5 or upper versions.
There are two options for getting a verb analyzed sentence:
-
Without part of speech tags (without disambiguation, considering all the words as potential verbs). In
SentenceAnalyzer.cs
:public static VerbBasedSentence MakeVerbBasedSentence(string sentence)
or
public static VerbBasedSentence MakeVerbBasedSentence(string[] sentence)
-
With part of speech and morphosyntactic tags (with a good accuracy): the pos tags are the same as Bijankhan corpus:
public static VerbBasedSentence MakeVerbBasedSentence(string[] sentence, string[] posSentence, string[] lemmas, MorphoSyntacticFeatures[] morphoSyntacticFeatureses)
In the program.cs
file there is a test output of a Persian sentence that can be used as a starting point.
var analyzer = new SentenceAnalyzer("../../../Data/VerbList.txt");
var sentence = "من دارم به شما میگویم که این صحبتها به راحتی گفته نخواهد شد و من با شما صحبت زیاد خواهم کرد.";
var result = SentenceAnalyzer.MakeVerbBasedSentence(sentence);
var output = new StringBuilder();
foreach (var dependencyBasedToken in result.SentenceTokens)
{
output.AppendLine(dependencyBasedToken.WordForm + "\t" + dependencyBasedToken.Lemma + "\t" +
dependencyBasedToken.CPOSTag
+ "\t" + (dependencyBasedToken.HeadNumber+1).ToString() + "\t" +
dependencyBasedToken.DependencyRelation);
}
File.WriteAllText("../../../testOutPut.txt",output.ToString());
Output in testOutPut.txt
:
من _ _ 0 _
دارم داشت#دار V 5 PROG
به _ _ 0 _
شما _ _ 0 _
میگویم گفت#گو V 0 _
که _ _ 0 _
این _ _ 0 _
صحبتها _ _ 0 _
به _ _ 0 _
راحتی _ _ 0 _
گفته نخواهد شد گفت#گو V 0 _
و _ _ 0 _
من _ _ 0 _
با _ _ 0 _
شما _ _ 0 _
صحبت _ _ 18 NVE
زیاد _ _ 0 _
خواهم کرد کرد#کن V 0 _
The file is tab-separated with the following fields:
-
verbType
:integer
1: simple, 2: prefix verb, 3: compound verb, 4: compound prefix verb , 5: prepositional compound prefix verb, 6: enclitic verb, 7: prepositional verb
-
transitivity
:integer
0: intransitive, 1: transitive, x 2: bitransitive
-
past tense root
:string
"-" if not present
-
present tense root
:string
"-" if not present
-
Non-verbal element
:string
"-" if not present
-
Prefix
:string
"-" if not present
-
Preposition
:string
"-" if not present
-
amrShodani
:string
"-" =true, *: false
-
vowelEnd
:string
End of present root vowel: U: ends with u, I: ends with ei, A: ends with a, ?: else
-
maziVowel
:string
Start vowel type of past tense root A: starts with "a" or "\ae", @: else
-
mozarehVowel
:string
Start vowel type of present tense root bU: starts with "bu", ba: start with "b\ae", bA: starts with "ba", A: starts with "a" or "\ae", !: else
I assumed the character set is being refined when you pass array argument to the methods. As shown in the follwoing code, I used Virastyar library for refining characters and tokenizing strings.
public static VerbBasedSentence MakeVerbBasedSentence(string sentence)
{
sentence = StringUtil.RefineAndFilterPersianWord(sentence); // using the refiner of Virastyar software
var tokenized = PersianWordTokenizer.Tokenize(sentence,true); // using the tokenizer of Virastyar software
return MakeVerbBasedSentence(tokenized.ToArray());
}
You can go to Virastyar official site in order to know more about its options http://virastyar.ir.
If you do not want to use it for your purposes you can clean the mentioned lines from the code
You can find a morphological-based POS tagger that can be used in your code. You can also use the tagger to help improve learner POS taggers such as HMM tagger.
I assumed that the writers use semi-space for verb inflections. In Bijankhan corpus, you can replace space with semi-space in words with verb tag.