,Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

来源 :浙江大学学报(英文版)(C辑:计算机与电子) | 被引量 : 0次 | 上传用户:xiaoshumin82
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine leaing methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.
其他文献
根据高职院校《手机摄影》公选课的特点,结合高职院校人才培养模式,对课程进行项目化教学设计,通过课程目标设计、项目设计以及项目实施,使学生在实践中培养职业能力,积累项
选用三个细胞质分别为cms-I、cms-D和cms-N的不育系作细胞质供亲,用野败细胞质(cms-WA)龙特浦不育系的保持系龙特浦B作轮回亲本,采用回交核置换技术育成包括cms-WA龙特浦A在
Recently, dictionary leaing (DL) based methods have been introduced to compressed sensing magnetic resonance imaging (CS-MRI), which outperforms pre-defined ana
旱麦草属(Eremopyrum(Ledeb.)Jaub.&Spach)是小麦族内研究得尚不深入的一个属.该文第一部分对该属的分类历史、现状和存在的问题以及该属植物与小麦族内11个属14个物种的杂交
美国佐治亚大学植物学系Gary Kocher博士由美国洛克菲基金会推荐,于1989年5月16日至17日在湖南杂交水稻研究中心,作了题为“RFLP(Restriction Fragment Length Polymorphism
采用MS基本培养基,附加不同组合的激素对2个甘蓝型油菜(B.napus L.)和2个白菜型油菜(B.campestris L.)的不同外植体进行培养,其中特别比较了AgNO、水解酪蛋白和谷氨酰胺对