Coverage of Fuzz Testing in Software Vulnerability Detection enhanced via Machine Learning Path-Input Correlation



Fuzz Testing has been widely used to detect security vulnerabilities and bugs in IT systems because of its high efficiency. The success of a fuzzing campaign is heavily depending on the quality of seed inputs used for test generation. It is however challenging to compose a corpus of seed inputs that enable high code and behavior coverage of the target program, especially when the target program requires complex input formats such as PDF files.


An associate professor CHENG Liang at Institute of Software, Chinese Academy of Sciences, developed a machine learning based framework that discovers and leverages the correlation between seed inputs and the execution of the target program to generate new seed inputs that trigger higher code coverage of the target program. The new seed inputs caused 24.30% more execution paths being covered than the original seed corpus.


Started with a collection of 40,000 PDF files crawled from the Internet, the framework first utilizes a generative model that bases on recurrent neural networks (RNNs) to learn the patterns in the execution paths of the original seed corpus and to compose new execution paths in turn. The new execution paths are then forwarded to a sequence-to-sequence translation model, which is trained in advance with the execution paths and their corresponding original seed files to discover the correlations between them, to translate into valid PDF files triggering them.


The researchers conducted a set of experiments on several widely used PDF viewers, which demonstrates that new seed inputs produced by our framework significantly increased the code coverage of the target program and the likelihood of detecting program crashes. Additional experiments also confirmed that the framework is applicable to other input formats such as PNG and TTF files with minimal customization.


Overall, this study demonstrated that the neural network can learn the correlations in the input files and their execution paths. More importantly, this correlation can be used to generate new seed inputs with better coverage and facilitate the fuzzing process when detecting software vulnerabilities.


The work entitled “Optimizing seed inputs in fuzzing with machine learning” has been published in the Proceedings of the International Conference on Software Engineering 2019(ICSE 2019), which is the premier software-engineering conference.


This work was financially supported by the National Natural Science Foundation of China and the National Key R&D Program of China.