【摘 要】
:
Content extraction is the basis of many other technologies about data mining,which aims to extract the worthiest information from data-intensive web pages f
【机 构】
:
School of Information and Communication Engineering,Beijing University of Posts and Telecommunicatio
【出 处】
:
第六届中国传感器网络学术会议(CWSN 2012)
论文部分内容阅读
Content extraction is the basis of many other technologies about data mining,which aims to extract the worthiest information from data-intensive web pages full of noise.Traditional content extraction based on statistics cannot deal with short content documents,table text or documents with long comments.Thus,through the research of positional relation between title and content,the paper provides you with a new method to extract content of web pages,which constructs title and content dependency tree (TCDT),localizes a content with the smallest dependency distance and realizes the accurate extraction of web pages contents by usage of dependency relation between title and content and the statistical information of pages.A number of experiments of several websites prove that it can not only make up for the deficiency of statistical method,but also has a better precision in extracting content.
其他文献
一例放射工作者的眼晶状体变化金家美,张桂芳,杨淑敏(山东省医学科学院放射医学研究所)某男,50岁,放射科医师,放射专业工龄32年。早期使用日本产30mA和德国产150mAX线机。70年代后改用国产200mA及500mAX线机,工作
With the development of web 2.0,users are becoming more and more deeply involved in Internet,not only as readers,but also as authors.Wording preference is a
Compared with ordinary text,patent text often has more complex sentence structure and more ambiguity of multiple verbs.To deal with these problems,this pape
The passive voice often appears in patent documents but seldom gets the right translation results,which has greatly affected the understanding of the full t
Bag-of-words (BoW) representation becomes one of the most popular methods for representing image content and has been successfully applied to object categor
With the introduction of the household contract responsibility system, the traditional ruralcooperative medical service has declined rapidly in terms of its co
This paper studies the current status of handwritten character recognition and two major problems for research.Then one of the problems of feature selection
In recent years,web service is the hotspot in the academia and industry research.How to select the service to meet user needs accurately and efficiently,has
广州石油化工总厂炼油及化肥作业工人健康影响的研究陈伟明,卫建平(广东省职业病防治院)报道:对2021名石油化工炼油、化肥作业工人及190名对照组人员进行了体检,结果发现慢性上呼吸道炎症