Education, Science, Technology, Innovation and Life
Open Access
Sign In

Partial Task Shuffle First Strategy for Spark

Download as PDF

DOI: 10.23977/amce.2019.010

Author(s)

Tianlei Zhou, Yuyang Wang

Corresponding Author

Tianlei Zhou

ABSTRACT

Apache Spark is an in-memory distributed computing framework, which is more suitable for iterative jobs than MapReduce. However, the shuffle process needs to synchronize tasks between nodes, which may lead to waste the computing resources of the cluster and ultimately reduce the computing performance of the cluster. This is an important reason to limit the performance of Spark. In this paper, we proposes a Partial Task Shuffle First (PTSF) Strategy to dynamically generate Shuffle Write tasks and perform Shuffle operations on partial completed tasks. The strategy increases the parallel degrees of data calculation and transmission, lowering the peak of the Shuffle stage, allowing the cluster to be more balanced in the course of the operation. Finally, experiments show that the proposed strategy can improve Shuffle execution efficiency.

KEYWORDS

Big data, spark, shuffle, task

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.