连续变换器:无裁员注意在线推断 (Continual Transformers: Redundancy-Free Attention for Online Inference)

Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance.

翻译：通用式的变换器在本质上限于按全等序列运行,而不是一次以一个符号运行。因此,在时间序列数据的在线推断中,由于连续的代号序列重叠,在时间序列数据的在线推断中使用这些变换器需要大量冗余。在这项工作中,我们提议了缩放点-Producle 注意的新配方,使变换器能够在连续输入流上高效的在线逐个象征性推论。重要的是,我们的修改完全按照计算顺序进行,而产出和学到的重量与原变换器的相同。我们用THUMOS14、TeVSeries和GTZAN数据集的实验来验证我们的连续变换转换器,结果显著:我们的连续一和两块结构将每个预测的浮点操作分别减少63x和2.6x,同时保留预测性能。

相关内容

Continuity

关注 0

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日