Over the last decade the relative latency of access to shared memory by multicore increased as wire resistance dominated latency and low wire density layout pushed multiport memories farther away from their ports. Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which is unacceptable in real time systems. Smart DMAs can be used to directly copy data into a layer1 SRAM, but with overhead. The VLIW architecture, the de facto signal processing engine, suffers badly from a breakdown in lockstep execution of scalar and vector instructions. We describe the Split Latency Adaptive Pipeline (SLAP) VLIW architecture, a cache performance improvement technology that requires zero change to object code, while removing smart DMAs and their overhead. SLAP builds on the Decoupled Access and Execute concept by 1) breaking lockstep execution of functional units, 2) enabling variable vector length for variable data level parallelism, and 3) adding a novel triangular load mechanism. We discuss the SLAP architecture and demonstrate the performance benefits on real traces from a wireless baseband system (where even the most compute intensive functions suffer from an Amdahls law limitation due to a mixture of scalar and vector processing).
翻译:过去十年来,随着电线阻力控制延迟和低电线密度布局将多端记忆推离港口,多层记忆存取的相对延迟性因电线阻力控制下拉动和低线密度布局将多端记忆拖离港口更远而增加。我们运用了各种技术来改善平均记忆存取延缓,例如投机性预拉和分管,往往导致执行时间差异很大,在实时系统中这是不可接受的。智能DMA可直接将数据复制到一层1 SRAM,但有间接费用。VLIW结构,即事实上的信号处理引擎,由于固定执行卡路和矢量指令的中断而严重受损。我们描述了SLIP结构,并展示了需要零修改对象代码的缓存性能改进技术,同时删除智能DMAs及其间接费用。 SLAP在分解式接入和执行概念的基础上,1)打破了功能单位的锁链条,2)允许数据级平行的可变矢量长度,3)添加了一个新的三角载荷机制。我们讨论了SLAP结构,并展示了无线控制层控制层控制室系统至最大限制矢量的容器系统的实际跟踪功能。