关于为普遍分配而最佳选择分配以外的最佳选择的经验研究 (Empirical Study on Optimizer Selection for Out-of-Distribution Generalization)

Modern deep learning systems are fragile and do not generalize well under distribution shifts. While much promising work has been accomplished to address these concerns, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address the problem settings for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as out-of-distribution datasets for the exhaustive study. We search over a wide range of hyperparameters and examine the classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings: i) contrary to conventional wisdom, adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum-based SGD), ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset - linear returns, increasing returns, and diminishing returns. We believe these findings can help practitioners choose the right optimizer and know what behavior to expect.

翻译：现代深层学习系统是脆弱的,在分布变化中并不普遍。虽然在解决这些关切方面已经做了许多大有希望的工作,但还没有对优化者的作用及其分配外的普及性业绩进行系统研究。在本研究中,我们审查了在实验风险最小化和风险最小化下不同分配类别分配性转移的流行第一阶优化者的业绩。我们用DomeBed、WILDS和Afronications Fronts Fronts处理图像和文本分类问题设置,作为详尽研究的分发外数据集。我们搜索了范围广泛的超参数,并审查了20,000多个模型的分类准确性(分布和分配外的),我们得出了以下结论:(一) 与常规智慧相反,适应性优化者(例如Adam) 的表现比非适应性优化者(例如SGD、基于动力的SGD)、二) 分配性业绩和分配性表现显示三种类型的行为,取决于数据集-线性回报、增加回报和减少回报。我们相信这些行为能够帮助从业者选择什么是最佳的。