论文部分内容阅读
【目的】中文机构名结构复杂、罕见词多,识别难度大,对其进行正确识别对于信息抽取、信息检索、知识挖掘和机构科研评价等情报学中的后续任务意义重大。【方法】基于深度学习的循环神经网络(Recurrent Neural Network,RNN)方法,面向中文汉字和词的特点,重新定义了机构名标注的输入和输出,提出汉字级别的循环网络标注模型。【结果】以词级别的循环神经网络方法为基准,本文提出的字级别模型在中文机构名识别的准确率、召回率和F值均有明显提高,其中F值提高了1.54%。在包含罕见词时提高更为明显,F值提高了11.05%。【局限】在解码时直接使用了贪心策略,易于陷入局部最优,如果使用条件随机场算法进行建模可能获取全局最优结果。【结论】本文方法构架简单,能利用到汉字级别的特征来进行建模,比只使用词特征取得了更好的结果。
【Objective】 Chinese institutions have complex structure and rare words, so it is very difficult to recognize. The correct identification of Chinese institutions is of great significance to the follow-up tasks in information science, such as information extraction, information retrieval, knowledge mining and institutional research evaluation. 【Method】 Based on the recurrent neural network (RNN) method of deep learning, this paper redefined the input and output of body name annotation according to the characteristics of Chinese characters and words, and proposed a cyclic character annotation model of Chinese characters. 【Result】 Based on word-level recurrent neural network, the accuracy of word-level model proposed in this paper has been significantly improved, the recall rate and F-value have been significantly improved, and F value increased by 1.54%. The increase was even more pronounced with the inclusion of rare words, with a F-value increase of 11.05%. [Limitations] The greedy strategy is used directly during decoding, which is easy to fall into the local optimum. If the conditional random field algorithm is used for modeling, the global optimal result may be obtained. 【Conclusion】 The method proposed in this paper is simple in structure and can use Chinese character level to model. It achieves better results than using only word features.