There's never enough time to do all the testing you want

K. Herzig    Software Development Engineer, Microsoft Corporation, Redmond, United States

Abstract

Software is present in nearly every aspect of our daily lives and also dominates large parts of the high-tech consumer market. Consumers love new features, and new features are what makes them buy software products, while features like reliability, security, and privacy are assumed. To respond to the consumer market demand, many software producers are following a trend to shorten software release cycles. As a consequence, software developers have to produce more features in less time while maintaining, or even increasing, product quality. Here at Microsoft (as well as other large software organizations) we have learned that testing is not free. Testing can slow down development processes and cost money in terms of infrastructure and human involvement. Thus, the effort associated with testing must be carefully monitored and managed.

Keywords

Code velocity; Testing processes; System and integration tests; Test execution history; False test alarms; Historic test execution data

Software is present in nearly every aspect of our daily lives and also dominates large parts of the high-tech consumer market. Consumers love new features, and new features are what makes them buy software products, while properties like reliability, security, and privacy are assumed. To respond to the consumer market demand, many software producers are following a trend to shorten software release cycles. As a consequence, software developers have to produce more features in less time while maintaining, or even increasing, product quality. Here at Microsoft (as well as other large software organizations [1]) we have learned that testing is not free. Testing can slow down development processes and cost money in terms of infrastructure and human involvement. Thus, the effort associated with testing must be carefully monitored and managed.

The Impact of Short Release Cycles (There's Not Enough Time)

To enable faster and more “agile” software development, processes have to change. We need to cut down time required to develop, verify, and ship new features or code changes in general. In other words, we need to increase code velocity by increasing the effectiveness, efficiency, and reliability of our development processes.

Focusing on testing processes, it is important to realize that verification time is a lower bound on how fast we can ship software. However, nowadays this lower bound frequently conflicts with the goal of faster release cycles. As a matter of fact, we simply cannot afford to execute all tests on all code changes anymore. Simply removing tests is easy, the challenge is to cut tests without negatively impacting product quality.

Testing Is More Than Functional Correctness (All the Testing You Want)

Often, testing is associated with checking for functional correctness and unit testing. While these tests are often fast, passing or failing in seconds, large complex software systems require tests to verify system constraints such as backward compatibility, performance, security, usability, and so on. These system and integration tests are complex and typically time-consuming, even though they relatively rarely find a bug. Nevertheless, these tests must be seen as an insurance process verifying that the software product complies with all necessary system constraints at all times (or at least at the time of release). Optimizing unit tests can be very helpful, but usually it is the system and integration testing part of verification processes that consumes most of the precious development time.

Learn From Your Test Execution History

Knowing that we cannot afford to run all tests on all code changes anymore, we face a difficult task: find the best combination of tests to verify the current code change, spending as little test execution time as possible. To achieve this goal, we need to think of testing as a risk management tool to minimize the risk of elapsing code defects to later stages of the development process or even to customers.

The basic assumption behind most test optimization and test selection approaches is that for given scenarios, or context C, not all tests are equally well suited. Some tests are more effective than others. For example, running a test for Internet Explorer on the Windows kernel code base is unlikely to find new code defects.

However, determining the effectiveness and reliability of tests and when to execute which subset is not trivial. One of the most popular metrics to determine test quality is code coverage. However, coverage is of very limited use in this case. First, coverage does not imply verification (especially not for system and integration tests). Second, it does not allow us to assess the effectiveness and reliability of single test cases. Last but not least, collecting coverage significantly slows down test runtime, which would require us to remove even more tests.

Instead, we want to execute only those tests that, for a given code change and a given execution context C, eg, branch, architecture, language, device type, has high reliability and high effectiveness. Independent from the definition of reliability and effectiveness, all tests that are not highly reliable and effective should be executed less frequently or not at all.

Test Effectiveness

Simplistically, a test is effective if it finds defects. This does not imply that tests that find no defects should be removed completely, but we should consider them of secondary importance (see “The Art of Testing Less” section). The beauty of this simplistic definition of test effectiveness is that we can use historic test execution data to measure test effectiveness. For a given execution context C, we determine how often the test failed due to a code defect. For example, a test T that has been executed 100 times on a given execution context C and that failed 20 times due to code defects has a historic code defect probability of 0.2. To compute such historic code defect probabilities, it usually suffices to query an existing test execution database and to link test failures to issue reports and code changes, a procedure that is also commonly used to assess the quality of source code. Please note that coverage information is partially included in this measurement. Not covering code means not being able to fail on a code change, implying a historic failure probability of 0.

Test Reliability/Not Every Test Failure Points to a Defect

Tests are usually designed to either pass or fail, and each failure should point to a code defect. In practice, however, many tests tend to report so-called false test alarms. These are test failures that are not due to code defects, but due to test issues or infrastructure issues. Common examples are: wrong test assertions, non-deterministic (flaky) tests, and tests that depend on network resources that fail when the network is unavailable.

Tests that report false test alarms regularly must be considered a serious threat to the verification and development processes. As with any other test failure, false alarms trigger manual investigations that must be regarded as wasted engineering time. The result of the investigation will not increase product quality, but rather, slow down code velocity of the current code change under the test.

Similar to test effectiveness, we can measure test reliability as a historic probability. Simplistically, we can count any test failure that did not lead to a code change (code defect) as false test alarm. Thus, a test T that has been executed 100 times on a given execution context C and that failed 10 times but did not trigger a product code change has a historic test unreliability probability of 0.1.

The Art of Testing Less

Combining both measurements for effectiveness and reliability (independent from their definition) allows a development team to assess the quality of individual tests and to act on it. Teams may decide to statically fix unreliable tests or to dynamically skip tests. Tests that show low effectiveness and/or low reliability should be executed only where necessary and as infrequently as possible. For more details in how to use these probabilities to design a system that dynamically determines which test to execute and which to not execute, we refer to Herzig et al. [2] and to Elbaum et al. [1].

Without Sacrificing Code Quality

However, the problem is that some tests might get disabled completely, either because they are too unreliable, or because they never found a code defect (in the last periods). To minimize the risk of elapsing severe bugs into the final product and in order to boost the confidence of development teams in the product under development, it is essential to prevent tests from being disabled completely. One possible solution is to regularly force test executions, eg, once a week. Similarly, you can also use a version control branch-based approach, eg, executing all tests on the trunk or release branch, but not on feature and integration branches.

Tests Evolve Over Time

Any complex test infrastructure evolves over time. New tests are added, while older tests might become less important or even deprecated. Maintaining tests and preventing test infrastructures from decay can grow to be a significant effort. For products with some history, some of the older tests may not be “owned” by anybody anymore, or show strongly distributed ownership across multiple product teams. Such ownership can impact test effectiveness and reliability, slowing down development speed [3]. Determining and monitoring new test cases being added, or changes to existing tests, can be very useful in assessing the healthiness of the verification process. For example, adding lots of new features to a product without experiencing an increase in a new test being written might indicate a drop in quality assurance. The amount of newly introduced code and newly written or at least modified test code should be well balanced.

In Summary

Software testing is expensive, in terms of time and money. Emulating millions of configurations and devices requires complex test infrastructures and scenarios. This contradicts today's trend of releasing complex software systems in ever-shorter periods of time. As a result, software testing became a bottleneck in development processes. The time spent in verification defines a lower bound on how fast companies can ship software. Resolving test bottlenecks requires us to rethink development and testing processes to allow new room for newer and better tests and to regain confidence in testing. We need to accept that testing is an important part of our daily development process, but that “There's never enough time to do all the testing we want.”

References

[1] Elbaum S., Rothermel G., Penix J. Techniques for improving regression testing in continuous integration development environments. In: Proceedings of the 22nd ACM SIGSOFT international symposium on Foundations of Software Engineering (FSE); 2014.

[2] Herzig K., Greiler M., Czerwonka J., Murphy B. The art of testing less without sacrificing quality. In: Proceedings of the 2015 international conference on software engineering; 2015.

[3] Herzig K., Nagappan N. Empirically detecting false test alarms using association rules. In: Companion proceedings of the 37th international conference on software engineering; 2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset