异构冗余作业屏障模式并行系统的性能与稳定性分析 (Performance and Stability of Barrier Mode Parallel Systems with Heterogeneous and Redundant Jobs)

In some models of parallel computation, jobs are split into smaller tasks and can be executed completely asynchronously. In other situations the parallel tasks have constraints that require them to synchronize their start and possibly departure times. This is true of many parallelized machine learning workloads, and the popular Apache Spark processing engine has recently added support for Barrier Execution Mode, which allows users to add such barriers to their jobs. These barriers necessarily result in idle periods on some of the workers, which reduces their stability and performance, compared to equivalent workloads with no barriers. In this paper we will consider and analyze the stability and performance penalties resulting from barriers. We include an analysis of the stability of $(s,k,l)$ barrier systems that allow jobs to depart after $l$ out of $k$ of their tasks complete. We also derive and evaluate performance bounds for hybrid barrier systems servicing a mix of jobs, both with and without barriers, and with varying degrees of parallelism. For the purely 1-barrier case we compare the bounds and simulation results to benchmark data from a standalone Spark system. We study the overhead in the real system, and based on its distribution we attribute it to the dual event and polling-driven mechanism used to schedule barrier-mode jobs. We develop a model for this type of overhead and validate it against the real system through simulation.

翻译：在某些并行计算模型中，作业被拆分为较小的任务，并可以完全异步执行。而在其他情况下，并行任务存在约束条件，要求其同步启动时间，有时还需同步完成时间。许多并行化机器学习工作负载即属于此类情况，流行的Apache Spark处理引擎近期新增了屏障执行模式（Barrier Execution Mode）支持，允许用户为作业添加此类屏障。与无屏障的等效工作负载相比，这些屏障必然导致部分工作节点出现空闲期，从而降低系统稳定性与性能。本文将对屏障引入的稳定性与性能损失进行系统分析与量化研究。我们首先分析了$(s,k,l)$屏障系统的稳定性，该系统允许作业在其$k$个子任务中完成$l$个后即可退出。同时推导并评估了混合屏障系统的性能边界，该系统需同时处理含屏障与无屏障的混合作业流，且作业具有不同的并行度。针对纯单屏障场景，我们将理论边界与仿真结果同独立Spark系统的基准测试数据进行对比。通过研究实际系统中的开销分布，我们将其归因于调度屏障模式作业所采用的双重事件与轮询驱动机制。为此我们建立了该类开销的理论模型，并通过仿真验证了其与实际系统的一致性。