论文部分内容阅读
针对网页设计结构与文本内容上的关联特点,提出了融合结构和内容特征的多类型网页文本要素提取方法。依据网页头部标题元素与网页体内容上的联系提取网页标题;提取网页正文区域的网页结构和内容上的多个特征分类网页DOM节点,定义节点的扩展、整合规则获得正文候选块,引入密度值和影响因子从各候选块中甄别正文块;利用发布时间与标题、正文之间的位置关系,通过正则表达式实现发布时间的提取。对国内新闻网站、博客、论坛及贴吧进行抽取试验,结果表明该方法具有较好的效果。
Aiming at the characteristics of the relationship between webpage design structure and textual content, a multi-type webpage textual element extraction method based on fusion structure and content features is proposed. Extracting the title of the webpage according to the relationship between the title element of the webpage and the content of the webpage; extracting the DOM node of the webpage structure and the content of the webpage in the text area of the webpage, defining the extension of the node, integrating rules to obtain the textual candidate block, Value and influence factor from each candidate block to identify the body of the block; the use of the publication time and the title, the relationship between the location of the body through the regular expression to achieve release time. The domestic news websites, blogs, forums and post bars are extracted and tested, the results show that the method has good effect.