论文部分内容阅读
当前Hadoop的实现主要针对同构集群,假设任务处理的数据基本是本地的.然而,实际应用中集群多为异构.这暴露出现有的数据分配策略对数据局部性考虑的不足,其产生的不必要数据传输耗费了大量的带宽资源和传输时间.通过结合Hadoop中数据放置与任务执行的关系,按不同节点对不同任务的执行能力进行数据分配.在考虑异构集群中节点固有性能的情况下,提出一种机架间基于任务特性和节点计算能力的数据分配策略.该分配策略提高了对数据局部性的关注,使每个节点都尽可能只访问本地数据.通过实验可知,该策略可以有效地缩短作业执行时间,提高时效性;同时提高数据局部性,减少网络数据传输,避免拥塞;最后,该分配策略还具有较好的稳定性.
At present, the implementation of Hadoop is mainly aimed at isomorphic clusters, assuming that the data processed by the task is basically local, however, in practical applications, the clusters are mostly heterogeneous, which exposes the existing data allocation strategy to consider the lack of data locality. Unnecessary data transmission consumes a large amount of bandwidth resources and transmission time.Based on the relationship between data placement and task execution in Hadoop and data distribution on different nodes according to the execution ability of different tasks.Considering the inherent performance of heterogeneous clusters , This paper proposes a data distribution strategy based on task characteristics and node computing capabilities between racks.This allocation strategy increases the focus on data locality so that each node can access only local data as far as possible.Through the experiment, Which can effectively shorten the job execution time and improve the timeliness; at the same time, improve the data locality, reduce the network data transmission and avoid congestion; finally, the allocation strategy has better stability.