论文部分内容阅读
经过分词处理的大型汉语语料库是进行语言学和计算语言学研究的重要资源。一致性是衡量分词语料库质量的重要标准之一。本文列举了导致分词语料库出现不一致的主要结构类型,讨论了“语法词”与“心理词”的区别,指出分词语料库以切成“心理词”为宜。“心理词”的模糊性决定了严格意义的完全一致对分词语料库是不可能实现的,我们所追求的目标应调整为受控条件下的一致性。
The large Chinese corpus processed by word segmentation is an important resource for linguistics and computational linguistics. Consistency is one of the important criteria to measure the quality of corpus. This article lists the main types of structures that lead to inconsistencies in the participle corpus. It discusses the differences between “grammatical words” and “psychological words”, and points out that the segmentation corpus should be cut into “psychological words”. The ambiguity of “psychology words” determines that exactly the exact meaning of psychology words can not be achieved for the segmentation corpus. The goals we pursue should be adjusted to be consistent under controlled conditions.