本同义词语料库内容基于在线成语词典(见我的另外一个repository: Chinese-fixed-phrases-idioms)、哈工大同义词词林扩展版、汉语大辞典的近义词大全、在线成语词典和在线近义词查询等。
对于哈工大同义词词林,我只取至少“成对出现”的同义词。部分光杆的所谓“同义词”,如“Aa01C05@ 众学生”,则不取。如果一个词在不同来源有部分不同或者完全不同的同义词时,则取这些同义词的并集。另外,除了哈工大同义词词林,所有其他来源均注明哪些词是哪个词的同义词,而哈工大同义词词林则只简单给出一组同义词词汇。在操作上,我简单把每组同义词词汇的第一个词定义为目标词,其余的词汇则为目标词的同义词(放在列表list里)。由此而来,我总共发现了18589条同义词语例,以字典的形式保存在synonyms.json.
synonyms_expanded_narrow.json和synonyms_expanded_broad.json是synonyms.json的扩展,均含有52157条同义词语例。Narrow版的是将每个目标词的同义词单作是另外一个目标词,而原本的目标词则变为其同义词的同义词。比如A的同义词是B和C,那么B和C的同义词都是A。假如B也是D的同义词,那么B的同义词则有A和D,以此类推。Broad扩展版的则预设同义词间的广泛联系,认定既然A的同义词是B和C,那么B和C也存在同义词联系,所以B的同义词就应该是A和C、C的同义词就应该是A和B。假如B和C还是其他的词存在同义词联系,那么B和C的同义词就会更多更广泛。
很显然,Narrow扩展版对同义词的定义比较保守、更可靠,但它相对无法将一些潜在的同义词对联系在一起;而Broad扩展版虽然尽可能广泛地组建同义词网络,但是不少由此而来的同义词对并不能成立。比如在synonyms.json中,“暗娼”的同义词是“私娼“和”野鸡“,但是反过来的同义词语例则不存在。在synonyms_expanded_narrow.json中,查”私娼“,只得到同义词“暗娼”,而在synonyms_expanded_broad.json中,”私娼“的同义词则为“暗娼”和”野鸡“,显然更为精准。不过,如果查“野鸡”,Narrow扩展版给的同义词会是“非法”, “雉”, “暗娼”
,对应着“野鸡”在不同语境中的不同语义,但是Broad扩展版却给出了“山鸡”,”越轨“,“非法”,“地下”,“私自”,“黑”,“非法定”,“翟”,“私”,“暗娼”, “不法”, “非官方”, “私娼”, ”雉“,”伪“,“暗”
,鱼龙混杂。
当然,由于一词多义的现象的存在,针对某些拥有一组同义词的词汇,无法简单地通过读取词典来准确找出对应的同义词,这个时候或许可以统计学或者机器学习的方式来构建语言模型,进而排歧。
The contents of this corpus are based on several reputable sources: 在线成语词典(see my another repository: Chinese-fixed-phrases-idioms)、哈工大同义词词林扩展版、汉语大辞典的近义词大全、在线成语词典和在线近义词查询。
For 哈工大同义词词林扩展版, I discarded instances where only a word or phrase is given as there are no proper synonym(s). Also, due to the fact that 哈工大同义词词林扩展版 is the only source in which a list of synonyms, versus word-synonym pair, are given, the first word in the list is constantly taken as the target word with the rest being its synonyms. This results in a corpus of 18,589 word-synonym pairs, saved in the form of a dictionary in synonyms.json.
synonyms_expanded_narrow.json and synonyms_expanded_broad.json are expanded versions of synonyms.json, both of which have 52,157 synonym pairs. For the narrowly expanded version, the synonym(s) of a word, as in synonyms.json, are taken as a target word, respectively, with that word of which they used to be a synonym, being their synonym. That is, if A has synonyms B and C, then B and C both have A as their synonym. Additionally, if B or C is a synonym of another word, say D, then D is also a synoym of B or C. For the broadly expanded version, however, it sees a more broad connection between different words. In the same example, if B and C are synonyms of A, then B will have synonyms of A and C and C will have synonyms of A and B, so on and so forth. Likewise, if B or C is also synonym of another word,it synonym will also include all the synonyms that word possess.
Due to the existence of polysemy in natural language, a corpus of synonyms will not automatically give you the correct synonym for words who have different meanings and thus types of synonyms in different contexts. A way to get around this can be as simple as building a statistical or machine learning language model so that we can utilize the linguistic context to disambiguate when accessing the synonyms of a given word.