在汇编者指导下有效共享 GPU (Effective GPU Sharing Under Compiler Guidance)

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and challenging problem. Although services like MPS support simultaneous execution of multiple co-operative kernels on a single device, they do not solve the above problem for uncooperative kernels, MPS being oblivious to the resource needs of each kernel. We propose a fully automated compiler-assisted scheduling framework. The compiler constructs GPU tasks by identifying kernel launches and their related GPU operations (e.g. memory allocations). For each GPU task, a probe is instrumented in the host-side code right before its launch point. At runtime, the probe conveys the information about the task's resource requirements (e.g. memory and compute cores) to a scheduler, such that the scheduler can place the task on an appropriate device based on the task's resource requirements and devices' load in a memory-safe, resource-aware manner. To demonstrate its advantages, we prototyped a throughput-oriented scheduler based on the framework, and evaluated it with the Rodinia benchmark suite and the Darknet neural network framework on NVIDIA GPUs. The results show that the proposed solution outperforms existing state-of-the-art solutions by leveraging its knowledge about applications' multiple resource requirements, which include memory as well as SMs. It improves throughput by up to 2.5x for Rodinia benchmarks, and up to 2.7x for Darknet neural networks. In addition, it improves job turnaround time by up to 4.9x, and limits individual kernel performance degradation to at most 2.5%.

翻译：现代计算机平台倾向于在一个单一节点上部署多个 GPU (2、4或更多), 以提高系统性能, 每一个 GPU 都拥有强大的全球内存和流式多处理器( SMs) 能力。 GPU 是一个昂贵的资源, 并且提高 GPU 的利用率, 同时又不造成单个工作量的性能退化。尽管像 MPS 这样的服务支持在单个设备上同时执行多个合作内核, 但对于不合作的内核来说, 无法解决上述问题, MPS 无法理解每个内核的资源需求。我们提议了一个完全自动化的编译器辅助的日程表框架。编译者通过识别内核的内核运行器运行程序来构建 G. 搜索器支持同时执行多个合作内核内核的内核内核内核内核内核内核内核内核内核, 将内核的内核内存和内核内核内核内核内核内核的内核需求信息, 使表员可以将任务放到一个适当的内空内存内置的内存内存内存内存内存内空的内存内存内核。