Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. Previous work has explored various layer-selection strategies (e.g., forward matching and in-order random matching) for intermediate-layer matching in KD, where a student layer is forced to resemble a certain teacher layer. In this work, we revisit such layer-selection strategies and observe an intriguing phenomenon that layer-selection strategy does not matter (much) in intermediate-layer matching -- even seemingly nonsensical matching strategies such as reverse matching still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective. Our work sheds light on KD practice, as layer-selection strategies may not be the main focus of KD system design, and vanilla forward matching works well in most setups.
翻译:知识蒸馏(KD)是一种将知识从大型“教师”模型迁移到小型“学生”模型的常用方法。先前的研究已探索了多种层选择策略(例如前向匹配和顺序随机匹配)用于KD中的中间层匹配,即强制学生层模仿特定教师层。在本研究中,我们重新审视此类层选择策略,并观察到一个有趣的现象:在中间层匹配中,层选择策略并不(太)重要——即使是看似无意义的匹配策略(如反向匹配)仍能带来令人惊讶的学生模型性能。我们通过从学生视角分析教师层之间的角度关系,为这一现象提供了解释。本研究为KD实践提供了启示,表明层选择策略可能并非KD系统设计的核心焦点,而朴素的前向匹配在多数设置中已表现良好。