Redundancy Elimination in Multi-signature Based Parallel Entity Resolution

来源 :Journal of Donghua University(English Edition) | 被引量 : 0次 | 上传用户:jiushizhegehao
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
The multi-signature method can improve the accuracy of entity resolution. However,it will bring the redundant computation problem in the parallel processing framework. In this paper,a multisignature based parallel entity resolution method called multi-sig-er is proposed. The method was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs,then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance,two strategies were adopted. One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets,and it is more suitable for complex similarity computation as compared to simple similarity matching. The multi-signature method can improve the accuracy of entity resolution. However, it will bring the redundant computation problem in the parallel processing framework. In this paper, a multisignature based parallel entity resolution method called multi-sig-er is proposed. was implemented in MapReduce-based framework which first tagged multiple signatures for each input object and utilized these signatures to generate key-value pairs, then shuffled the pairs to the reduce tasks that are responsible for similarity computation. To improve the performance, two strategies were One is for pruning the candidate pairs brought by the blocking technique and the other is for eliminating the redundancy according to the transitive property. Both strategies reduce the number of similarity computation without affecting the resolution accuracy. Experimental results on real-world datasets show that the method tends to handle large datasets rather than small datasets, and it is more suit able for complex similarity computation as compared to simple similarity matching.
其他文献
期刊
重金属铅是环境中典型的污染物质,它在土壤中可以通过植物吸收和食物链进入人体而危害人类健康,研究铅在土壤中的形态分布对了解铅在土壤中对作物的有效性,保护环境和人类健康具
期刊
大家都知道“田忌赛马”的故事,田忌以前和齐王赛马时,每次都输,而在孙膑的指导下,选择以“下等马对齐王的上等马、上等马对齐王的中等马、中等马对齐王的下等马”这一方式,
期刊
突变体是目前玉米生理和遗传研究的重要材料。玉米根系突变体对于研究根系的建成和发育、养分的吸收和利用及其根系建成相关基因的克隆和研究具有重要的价值。本试验利用自交
期刊
使用Fura - 2荧光探针技术 ,检测细胞内的镧离子浓度 [La3+]i 变化 ,研究其跨膜行为。结果显示 ,细胞外镧 [La3+]o(0 .1mmol/L)可使细胞内镧离子浓度增加 ,说明镧离子能够跨