Memory Efficient Two-Pass 3D FFT Algorithm for Intel~ Xeon Phi~(TM) Coprocessor

来源 :Journal of Computer Science and Technology | 被引量 : 0次 | 上传用户:csxna
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Equipped with 512-bit wide SIMD instructions and large numbers of computing cores,the emerging x86-based Intel Many Integrated Core(MIC) Architecture provides not only high floating-point performance,but also substantial off-chip memory bandwidth. The 3D FFT(three-dimensional fast Fourier transform) is a widely-studied algorithm; however,the conventional algorithm needs to traverse the data array three times. In each pass,it computes multiple 1D FFTs along one of three dimensions,giving rise to plenty of non-unit strided memory accesses. In this paper,we propose a two-pass 3D FFT algorithm,which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache.The main idea is to split one dimension into two sub-dimensions,and then combine the transform along each sub-dimension with one of the rest dimensions respectively. The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-level parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this,a number of optimization techniques,such as memory padding,loop transformation and vectorization,are employed in our implementation to further enhance the performance.We evaluate the algorithm on the Intel Xeon PhiTMcoprocessor 7110 P,and achieve a maximum performance of 136 Gflops with 240 threads in offload mode,which beats the vendor-specific Intel MKL library by a factor of up to 2.22 X. Equipped with 512-bit wide SIMD instructions and large numbers of computing cores, the emerging x86-based Intel ™ Many Integrated Core (MIC) Architecture provides not only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT Three-dimensional fast Fourier transform is an widely-studied algorithm; however, the conventional algorithm needs to traverse the data array three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of non -unit strided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce the amount of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into The difference in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. Multi-leve l parallelism is leverage on the many-core system for a high degree of parallelism and better data reuse of local cache. On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectorization, are employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel Xeon PhiTM processor 7110 P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which beats the vendor-specific Intel MKL library by a factor of up to 2.22 X.
其他文献
读了《中小学数学》(小学版)2008年第4期苟锡金的《新教材值得改进的几个问题》一文,心有同感,总觉得与原教材(浙江省以前都使用由浙江教育出版社出版的省编教材,如今已全部
南北朝时期有一个叫江泌的孩子,家境贫寒。小时,家里吃了上顿没有下顿,由于生活所迫,他成了小小修鞋匠,专门替人削木鞋底。江泌的母亲见别人家的孩子能上学读书,而自己的孩
土壤保持是生态系统提供的重要调节服务之一,在区域侵蚀控制以及生态安全的维持方面具有不可替代的作用。以全国主体生态功能区划中“两屏三带”的南方丘陵山地带为研究对象,
为孩子的成长,我这个做母亲的什么事情都为孩子做过。记得孩子上四年级的时候,我对他说:要努力学习。他听了却说,努力学习有什么用,老师和同学也不会选我当干部。我听了很惊
The relationship between the electric properties and the vacancy density in single-walled carbon nanotubes has been investigated from first principles as well a
一个不起眼的农垦小厂,一个1979年仅有20名职工,靠借钱发5元奖金过年的小厂,如今从崇明岛插翅高飞了,竟远飞五洲四洋一层风骚。它,就是上海前进不锈钢制品厂(下称前进厂);其
【摘要】美术教育是全方位、多元化的教学,在教学中老师是发挥传道授业的启蒙家。因此,如何教好每一节美术课,都事关学生美术素质的提高。  【关键词】美术教学 策略 素质 多元 美术文化  【中图分类号】G633.955【文献标识码】A 【文章编号】2095-3089(2012)10-0197-01  美术课堂教学是传授美术知识和技能,实施艺术素质教育的主要场所。美术课堂教学是教师根据美术的教学
请下载后查看,本文暂不支持在线获取查看简介。 Please download to view, this article does not support online access to view profile.
期刊
目的探讨精细化管理作用于手术室护理管理中的临床效果。方法将2013年11月至2015年9月期间在我院进行手术治疗的200例患者作为研究对象,随机将200例患者分为观察组和对照组,
一种构思新颖,功效独特的专利产品新型停水自动关闭水龙头日前通过专家鉴定,并投入批量生产。 新型停水自动关闭水龙头是采用军工技术,依据流体力学原理,在流体通道上设一自