论文部分内容阅读
互联网中数据、信息、知识资源呈现指数级增长,获取这些公开或内部资源的手段分别是传统搜索引擎和站内搜索,这种分离的获取手段造成了信息搜集的不全面,因此对数据融合方法提出了新的挑战。现有的数据融合方法不灵活、集成复杂度高、信息缺失度高。本文提出一种新型的内外数据融合方法,集成自主开发的资源获取组件和成熟的商用服务模块,并通过构建一个应用模型来搭建面向大型机构的学术搜索引擎、形成一个可扩展性强、实时性强、抽取精度高的融合内外部数据的应用平台。该项工作已成功地收集了244个中国科学院所属单位以及相关单位的586,572个网页,34,737个视频,47,390篇论文,并为中国科学院广大师生提供学术资源检索服务功能。
The exponential growth of data, information and knowledge resources in the Internet, and the means of obtaining these public or internal resources are traditional search engines and in-station search respectively. This means of separation has led to the incomplete collection of information. Therefore, the data fusion method New challenges. The existing data fusion methods are not flexible, with high integration complexity and high information loss. This paper presents a new method of data fusion both inside and outside, integrating self-developed resource acquisition components and mature business service modules. By building an application model to build an academic search engine for large organizations, a scalable and real-time Strong, high precision extraction of internal and external data integration application platform. This work has successfully collected 586,572 web pages, 34,737 videos and 47,390 essays from 244 Chinese Academy of Sciences affiliated institutions and related institutions and provided academic resource search service functions for a large number of Chinese Academy of Sciences teachers and students.