Content extraction from Chinese web page based on title and content dependency tree

来源 :第六届中国传感器网络学术会议(CWSN 2012) | 被引量 : 0次 | 上传用户:a5346160
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  Content extraction is the basis of many other technologies about data mining,which aims to extract the worthiest information from data-intensive web pages full of noise.Traditional content extraction based on statistics cannot deal with short content documents,table text or documents with long comments.Thus,through the research of positional relation between title and content,the paper provides you with a new method to extract content of web pages,which constructs title and content dependency tree (TCDT),localizes a content with the smallest dependency distance and realizes the accurate extraction of web pages contents by usage of dependency relation between title and content and the statistical information of pages.A number of experiments of several websites prove that it can not only make up for the deficiency of statistical method,but also has a better precision in extracting content.
其他文献
一例放射工作者的眼晶状体变化金家美,张桂芳,杨淑敏(山东省医学科学院放射医学研究所)某男,50岁,放射科医师,放射专业工龄32年。早期使用日本产30mA和德国产150mAX线机。70年代后改用国产200mA及500mAX线机,工作
  With the development of web 2.0,users are becoming more and more deeply involved in Internet,not only as readers,but also as authors.Wording preference is a
会议
  Compared with ordinary text,patent text often has more complex sentence structure and more ambiguity of multiple verbs.To deal with these problems,this pape
会议
  The passive voice often appears in patent documents but seldom gets the right translation results,which has greatly affected the understanding of the full t
会议
  Bag-of-words (BoW) representation becomes one of the most popular methods for representing image content and has been successfully applied to object categor
会议
With the introduction of the household contract responsibility system, the traditional ruralcooperative medical service has declined rapidly in terms of its co
  This paper studies the current status of handwritten character recognition and two major problems for research.Then one of the problems of feature selection
会议
  In recent years,web service is the hotspot in the academia and industry research.How to select the service to meet user needs accurately and efficiently,has
会议
鸡西矿业集团公司张辰煤矿西三采区3
期刊
广州石油化工总厂炼油及化肥作业工人健康影响的研究陈伟明,卫建平(广东省职业病防治院)报道:对2021名石油化工炼油、化肥作业工人及190名对照组人员进行了体检,结果发现慢性上呼吸道炎症