Python Flaky测试经验研究 (An Empirical Study of Flaky Tests in Python)

from arxiv, 11 pages, to be published in the Proceedings of the IEEE International Conference on Software Testing, Verification and Validation (ICST 2021)

Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22352 open source projects from the popular PyPI package index, and analyzed their 876186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.

翻译：导致虚假失败的测试,而没有任何代码变化,例如,片片测试,阻碍回归测试,增加维护成本,可能给真实的错误带来阴影,降低测试信任度。虽然不耐烦的流行程度和重要性已经确立,但先前的研究侧重于爪哇项目,从而提出了有关结果如何笼统化的问题。为了更好地理解在爪哇以外软件开发中的不耐烦作用,我们从经验上研究用目前最受欢迎的编程语言之一Python编写的软件中的流行程度、原因和不耐烦程度。为此,我们从流行的 PyPI 软件集指数中抽取了22352个开放源项目,并分析了它们中的876186个测试案例。我们的调查表明,在Python项目中,不耐烦躁性的情况同样普遍。但是,在Python软件开发过程中,对秩序的依赖性决定了59%。另外的28%是由测试基础设施问题造成的,而测试往往是无证的消燥性原因。剩下的13%的文献大多可以归结为在网络和A类随机性测试中,而常规测试要求的是,对A类常规的测试是常规的路径。