|Abstract||Over the years, security researchers have developed a broad spectrum of automatic code scanners that aim to find security vulnerabilities in applications. Security benchmarks are commonly used to evaluate novel scanners or program analysis techniques. Each benchmark consists of multiple positive test cases that reflect typical implementations of vulnerabilities, as well as negative test cases, that reflect secure implementations without security flaws. Based on this ground truth, researchers can demonstrate the recall and precision of their novel contributions.
However, as we found, existing security benchmarks are often underspecified with respect to their underlying assumptions and threat models. This may lead to misleading evaluation results when testing code scanners, since it requires the scanner to follow unclear and sometimes even contradictory assumptions.
To help improve the quality of benchmarks, we propose SecExploitLang, a specification language that allows the authors of benchmarks to specify security assumptions along with their test cases. We further present Exploiter, a tool than can automatically generate exploit code based on a test case and its SecExploitLang specification to demonstrate the correctness of the test case.
We created SecExploitLang specifications for two common security benchmarks and used Exploiter to evaluate the adequacy of their test case implementations. Our results show clear shortcomings in both benchmarks, i.e., a significant number of positive test cases turn out to be unexploitable, and even some negative test case implementation turn out to be exploitable. As we explain, the reasons for this include implementation defects, as well as design flaws, which impacts the meaningfulness of evaluations that were based on them. Our work shall highlight the importance of thorough benchmark design and evaluation, and the concepts and tools we propose shall facilitate this task.|