论文部分内容阅读
频率与互信息是近年来汉语新词自动发现中最重要的特征,它们还被列入现代汉语词典编撰选词原则中。本文以《现代汉语词典》(第6版)中全体包含“蛋”字的二字词、三字词为考察对象,分别在北京大学CCL语料库、华中师范大学Cici语料库中统计其频次,计算互信息。对比被收录词和部分未被收录词的频次和互信息却发现:部分被收录词的频次、互信息都比一些未被收录的词低。分析多组频次和互信息数值,可推测在《现代汉语词典》编撰中,词的频次与互信息其实不如词典编撰者的语感关键。
Frequency and mutual information are the most important features of automatic Chinese new word discovery in recent years. They are also included in the compilation and selection principle of modern Chinese dictionaries. In this paper, we use the two-character and three-character words that contain “egg” in the “Modern Chinese Dictionary” (6th edition) as the object of study and count their frequencies in CCL corpus of Peking University and Cici corpus of Huazhong Normal University respectively, Calculate mutual information. Comparing the frequencies and mutual information between the collected words and some unaccomplished words reveals that the frequency and the mutual information of some collected words are lower than some unrecorded words. Analysis of multiple frequency and mutual information value, it can be speculated in the “Modern Chinese Dictionary” compilation, the frequency of the word and mutual information is actually not as key language compiler language.