Effective Unit Test Generation for Java Null Pointer Exceptions | TechBlog | Research

This paper introduces NpeTest, a unit test generation technique specifically designed to effectively find Null Pointer Exceptions (NPEs), one of the most prevalent and critical errors in Java applications. While existing automatic unit test generation tools such as Randoop and EvoSuite focus on improving code coverage, they are not sufficiently effective at catching NPEs. NpeTest employs a strategy that combines static and dynamic analysis to guide the test case generator in targeting scenarios likely to trigger NPEs. Through experiments conducted on 108 NPE benchmarks collected from 96 real-world projects, NpeTest demonstrated a significant improvement in NPE detection reproducibility, achieving a rate of 78.9%, which is 38.7% increase compared to EvoSuite's 56.9%. Furthermore, NpeTest successfully detected 89 previously unknown NPEs from an industry project.

Check Out the Original Publication Here.

[Security, IEEE/ACM ASE 2024]
Effective Unit Test Generation for Java Null Pointer Exceptions
👉 See the Publication

NPE, Java Developer’s Nightmare

Null Pointer Exceptions (NPEs) are one of the most common and critical errors in Java applications. NPE is a critical software defect because dereferencing a null pointer always makes the program crash, causing undefined behavior of the entire system. According to recent industry reports, NPEs account for the most significant portion of the reported crashes in Java applications, making software testing a mandatory to reduce the risk of NPEs during the software development process.

The Complexity of Unit Testing and the Difficulty of Detecting NPEs

Unit testing has been one of the most widely used software testing techniques for object-oriented programming languages such as Java. With well-designed test cases, unit testing validates that each unit of software performs as expected and identifies bugs. However, finding bug-triggering unit tests is a complex and time-consuming task, which becomes more difficult with respect to the size and complexity of software systems.

To reduce the burdens of developers on designing unit tests, automatic test case generation techniques have been proposed with two major approaches: random testing and search-based software testing. Both methods generate test cases by automatically synthesizing method call sequences and other elements for the target unit.

However, we observed that unit test automatic generation tools such as Randoop and EvoSuite are not sufficiently effective at catching NPEs. These unit testing techniques primarily strive for high code coverage, but achieving higher code coverage does not necessarily result in better NPE finding performance. This is because software bugs, especially NPEs, usually occur under certain conditions

Test Case Generation Failures Through NPE Example

We will analyze the problem using an NPE found in Apache Qpid Proton-j project.

// MapType.java

// EncoderImpl.java

The root cause of this NPE is the null literal assigned to the variable amqpType in the deduceTypeFromClass method, which is returned without refinement. The NPE occurs in the calculateSize method during the call to t.getEncoding(k), where the getType method internally calls deduceTypeFromClass and returns the result directly.

However, the conditions under which the variable amqpType is not refined during the execution are not trivial. For this to occur, the type of the first argument should be set properly, which is determined by the argument of the calculateSize method. Additionally, the input map of the calculateSize method must contain at least one element.

In order to generate such test cases triggering the NPE, unit test generation tools must focus on mutating various types for generic type parameters of Map and find an appropriate one that bypasses the branch conditions in deduceTypeFromClass not to refine the value of amqpType. However, tools such as EvoSuite and Randoop failed to generate such test cases due to the large space of test cases and statements to be mutated.

Samsung SDS's Solution: "NpeTest" for Generating Test Cases to Detect NPEs More Effectively

NpeTest employs both static and dynamic analysis to generate test cases for better NPE detection.

Search-Based Software Testing

NpeTest relies on EvoSuite, a search-based software testing (SBST) tool. The simplified test case generation process of EvoSuite is as follows:

Identifies coverage goals
Builds an initial parent population
Generates offspring population from the parent population
Computes fitness values for all test cases
Selects the next parent test cases
Updates coverage goals
Repeats until the time budget is exhausted
Returns the set of test cases as the final solution

The workflow of NpeTest is as follows:

Performs static analysis on the given class
Computes NPE goals
Collects NPE functions
Builds an initial population
Generates NPE test cases
Updates the population based on the NPE detection coverage goals
Refines the methods
Computes test case scores
Repeats until the time budget is exhausted
Returns the set of test cases as the final solution

Static Analysis

The static analyzer is to (1) identify all NPE-prone regions and the methods in a class under test (CUT) and (2) to prioritize the statements to be mutated in a given test case.

Path Construction. We first construct a control flow graph (CFG) of each method and compute a set of target expressions for each method based on the CFG. Using these set target expressions, we classify whether a method is NPE-safe.

Nullable Path Identification. We analyze whether the target expression can be null in a given path. If the expression remains false for all paths, we can conclude that NPE never occurs when dereferencing that expression.

NPE-likely Score Computation. We compute the NPE-likely score for the given method. This score is later used for test case selection during mutation.

Mutation Target Selection. When given a test case for mutation through mutation target selection, NPE test selects statements and variables that can trigger NPEs instead of randomly selecting statements to be mutated.

Dynamic Analysis

The goal of dynamic analysis is to guide the mutation generation process to actively explore NPE-prone areas by monitoring the execution results of test cases.

Method Under Test refinement. NpeTest dynamically refines the set of methods under test using the information of runtime exceptions. If an NPEs occur during test case execution, NpeTest gathers the information of the method and the NPE0triggered error location, and removes the corresponding target expression.

Testcase-level NPE-likely Score Computation. NpeTest calculates and maintains testcase-level NPE- likely score using the aforementioned sequence of executed method calls. All test cases are annotated with the computed score, and NpeTest performs weighted sampling based on the score to select the test case from the population to be mutated.

Experimental Results and Applicability

Evaluation Setting

We implemented NpeTest on top of the latest version of EvoSuite, which was last updated on GitHub in February 2024. For performance comparison, we selected EvoSuite and Randoop as baselines. We conducted 25 evaluation experiments for each tool with a time budget of 5 minutes on the benchmark classes.

We collected real-world NPE benchmarks from the literature, resulting in a total of 96 buggy projects with 108 known NPEs.

Effectiveness of NpeTest

In terms of the average reproduction rates over 25 trials for generating NPE-triggering test cases, NpeTest successfully generated test cases detecting the known NPEs with 45.2% and 22.4% more reproduction rates than Randoop and EvoSuite, respectively. In terms of the number of NPEs detected in any of the 25 trials, NpeTest found 73 NPEs, while Randoop and EvoSuite detected 25 and 59 NPEs, respectively.

Correlation between Code Coverage and NPE Detection

To observe the correlation between code coverage and NPE detection ability, we evaluated EvoSuite with different options. The fine-tuned option significantly improved the performance of code coverage. EvoSuite achieved line coverage of 77.8% on average, compared to 64.5% with default options, representing a 20.8% improvement. However, regarding the reproduction rate of NPE detection, even with the fine-tuned options, the improvement was minimal— rising only from 55.7% to 56.9%, an increase of just 2.2%.

Interestingly, NpeTest demonstrated the best performance on NPE detection, as shown in Table 2, but achieved less code coverage than EvoSuite and EvoSuite with default options. The reason for low line coverage is that NpeTest has smaller search space (i.e., methods under test) than EvoSuite.

Industrial Case Study

To compare the practical feasibility, we conducted a case study focusing on a proprietary cryptographic library used within an IT company. This library consists of 84 public classes and 13,669 lines of code, with a 76% line coverage achieved through manually written unit tests

Surprisingly, the tools revealed a total of 91 previously unknown NPEs, all of which were confirmed as true positives by the library development team. NpeTest found 89 NPEs, including 9 that EvoSuite missed and 37 Randoop missed. On the other hand, EvoSuite and Randoop detected 82 and 52 NPEs, respectively, with EvoSuite finding only 2 additional NPEs not detected by NpeTest.

Significance of the Research and Conclusion

Lessons Learned

The current unit test generators are not sufficient for NPE detection. EvoSuite failed to detect NPEs that could easily be detected by Randoop, and could only detect 59 out of a total of 108 NPEs in total. In contrast, NpeTest could detect 73 unique NPEs.

Achieving high code coverage is not necessary to improve NPE detection capability. While fine-tuned option parameters of EvoSuite increased the achieved line coverage by 20.8%, it could only improve the reproduction rate of NPE detection by 2.2%. In contrast, NpeTest achieved 18.8% less line coverage than EvoSuite but was able to detect 15 more unique NPEs that EvoSuite failed to find, and showed a 22.4% higher reproduction rate on average.

Adopting an integrated approach to detect NPEs in industrial software development is important. The case study emphasizes the importance of adopting a comprehensive approach to detecting NPEs in industrial software development. Despite the rigorous testing and development process in place, the three subject tools were able to detect a significant number of previously unknown NPEs.

Limitation

Of course, NpeTest has inherent limitations in detecting other types of bugs than NPEs. Through static and dynamic analysis, NpeTest intentionally skips testing methods that are free of NPEs or those for which all NPEs have been detected throughout the testing process, potentially missing bugs that exist in the skipped methods.

Threats to Validity

To evaluate the best performance of EvoSuite, we used a set of fine-tuned options from SBST'22. However, these values for options may not be appropriate to achieve the best performance of EvoSuite on some of our benchmarks.

We eliminated the programs we failed to build in our experiment settings. The experiment results may become different from what we observed in our experiment if those programs were properly built and used for our evaluation.

We conducted our experiments for 5 minutes on each benchmark with 25 trials. The time-budget for experiments may not be sufficient to achieve the best performance for both EvoSuite and NpeTest.

Conclusion

In this paper, we shared our experience on enhancing automatic unit test generation to more effectively find Java null pointer exceptions (NPEs). NPEs are among the most common and critical errors in Java applications, however, existing unit test generation tools such as Randoop and EvoSuite are not sufficiently effective at catching NPEs.

Their primary strategy of achieving high code coverage does not necessarily result in triggering diverse NPEs in practice. In this paper, we detailed our observations on the limitations of current state-of-the-art unit testing tools in terms of NPE detection and introduced a new strategy to improve their effectiveness.

Our strategy utilizes both static and dynamic analysis to guide the test case generator to focus specifically on scenarios that are likely to trigger NPEs. We implemented this strategy on top of EvoSuite and evaluated our tool, NpeTest, on 108 NPE benchmarks collected from 96 real-world projects.

The results showed that our NPE-guidance strategy can increase EvoSuite's reproduction rate of the NPEs from 56.9% to 78.9%, a 38.7% improvement. Furthermore, NpeTest successfully detected 89 previously unknown NPEs from an industrial project.

👉 See the Publication

Did a Java App Suddenly Shut Down? It's because of NPE – Samsung SDS's Technology for Automatically Detecting Null Pointer Exceptions