A new scheduler being aware of tasks' size and nodes' capability for spark streaming
@u2009cf / (1)
In radar, we design a new scheduler based on tasks' size and node capability. The traditional locality-aware scheduler is not suitable for heterogeneous tasks and shared clusters. We propose to add two eyes for scheduler. One is task' size which we can get from HDFS. Another is node's capability which we can get by exploiting the recurring characteristics of spark streaming batches. We make scheduling decisions according to the following principles: (1) large task first which we can amortize the large tasks' impact on stage execution time through multi waves, (2) fast nodes large tasks and slow nodes small tasks which we can achieve better load balancing, (3) choose task in the corresponding location of node capability order for this node. By doing so, we can avoid 86.57% speculative tasks, lower 20.96% latency and save about 10% resources in Tencent production clusters.
This package doesn't have any releases published in the Spark Packages repo, or with maven coordinates supplied. You may have to build this package from source, or it may simply be a script. To use this Spark Package, please follow the instructions in the README.