探索用于图像描述的变换器中的序列长度瓶颈 (Exploring the sequence length bottleneck in the Transformer for Image Captioning)

Most recent state of art architectures rely on combinations and variations of three approaches: convolutional, recurrent and self-attentive methods. Our work attempts in laying the basis for a new research direction for sequence modeling based upon the idea of modifying the sequence length. In order to do that, we propose a new method called "Expansion Mechanism" which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length. Furthermore, we introduce a novel architecture that exploits such method and achieves competitive performances on the MS-COCO 2014 data set, yielding 134.6 and 131.4 CIDEr-D on the Karpathy test split in the ensemble and single model configuration respectively and 130 CIDEr-D in the official online evaluation server, despite being neither recurrent nor fully attentive. At the same time we address the efficiency aspect in our design and introduce a convenient training strategy suitable for most computational resources in contrast to the standard one. Source code is available at https://github.com/jchenghu/ExpansionNet

翻译：最新艺术建筑状况依靠三种方法的组合和变异:革命性、经常性和自发性方法。我们努力为基于修改序列长度的构想的序列建模新研究方向奠定基础。为了这样做,我们提议了一个名为“扩展机制”的新方法,该方法将输入序列动态或静态地转化为一个新的,其序列长度不同。此外,我们引入了一个新结构,利用这种方法,在MS-CO 2014数据集上实现竞争性性能,产生134.6和131.4 CIDER-D分别分布在组合和单一模型配置中的卡路里测试和正式在线评价服务器上的130 CIDER-D,尽管它们既不重复,也不完全关注。与此同时,我们处理设计的效率问题,并引入适合大多数计算资源的方便培训战略,与标准资源相对应。源代码见https://github.com/jchenghu/ExpansionNet。