论文部分内容阅读
鉴于目前通用搜索引擎对藏文网页主题信息判断不够理想的现状,设计了一种基于改进向量空间模型的藏文主题网页采集算法。相比传统方法,该算法考虑了网页页面的不同标记内容对主题的影响,利用页面各个标记对藏文导向词进行分类,并通过实验确定了算法“导向词个数”和“主题相关度”的合理阈值,最后,通过运算结果判断网页主题的相关度。通过对Heritrix爬虫关键模块的改进,以中国西藏网(藏文版)为例对该算法进行测试,共采集藏文网页550个,主题相关准确度为62%。
In view of the current situation that the general search engine does not judge the topic information of the Tibetan web page well enough, a Tibetan web page collection algorithm based on the improved vector space model is designed. Compared with the traditional method, the algorithm considers the influence of different markup content of the webpage on the subject, classifies the Tibetan wordguide by using each mark of the webpage, and determines the number of “” and “ Relevance ”reasonable threshold, and finally, through the operation to determine the relevance of the subject page. Through the improvement of the key modules of Heritrix crawler, this algorithm is tested with the Tibet of China (Tibetan version) as an example. A total of 550 Tibetan web pages are collected, and the accuracy of the related topics is 62%.