Automated Vulnerability Prediction in Software Systems and Lightweight Identification of Design Patterns in Source Code

Poozhithara, Jeffy Jahfar

Automated Vulnerability Prediction in Software Systems and Lightweight Identification of Design Patterns in Source Code

Files

Poozhithara_washington_0250O_23223.pdf (2.33 MB)

Date

2021-08-26

relationships.isAuthorOf

Poozhithara, Jeffy Jahfar

Abstract

Software development companies put a heavy investment in fixing security vulnerabilities in their products after code development. This demands an automated mechanism to identify security vulnerabilities during and after software development. One approach is to include possible solutions like security design patterns during design. This reduces system-wide architectural changes required and enables efficient documentation and maintenance of the software systems. Further, identifying which design patterns already exist in source code can help maintenance engineers determine if new requirements can be satisfied. The current techniques for design pattern identification require either manually labeling training datasets or manually specifying rules or queries for each pattern. As part of this research, we took a two-pronged approach: 1. Pre-implementation: predict vulnerabilities before any source code is written, to increase awareness of possible risks while developing the system. 2. Post-implementation: check the source code to identify any missing security patterns, based on the identified vulnerabilities. For the first approach, we created a Keyword Extraction-based Vulnerability Identification System (KEVIS) that uses natural language processing techniques to extract keywords and n-grams from software documentation to predict security vulnerabilities in software systems. We analyzed the correlation of certain keywords and n-grams with the occurrence of various security vulnerabilities as well as the correlation between different vulnerabilities. Additionally, we analyzed the performance of classification algorithms (Logistic Regression, Support Vector Machines, K-Nearest Neighbors, Multi-level perception, and Random Forest) in the prediction. To enable the analysis, we also created a dataset by mapping over 200,000 vulnerability reports on the CVE website with technical/functional documentation of 3602 products. The preliminary analysis shows that the performance of KEVIS is comparable or better than the prediction using source code as well as other static analysis methods. For the second approach, we introduced PatternScout, a technique for automatically generating SPARQL queries by parsing UML diagrams of design patterns, ensuring that pattern characteristics are matched. We discuss key concepts and the design of PatternScout. Our results indicate that PatternScout can automatically generate queries for the three types of design patterns (i.e., creational, behavioral, structural), with accuracy that is comparable, or perform better than, existing techniques. Due to the difference in concepts used for both approaches and ease of explanation, the background, literature review, method, results, and discussions corresponding to each approach is discussed separately in their own sections (Approach 1 - Automated Vulnerability Prediction in Software Systems, and Approach 2 - Lightweight Identification of Design Patterns in Source Code, respectively).