QUT - Automatic Generation of Software Vulnerability Datasets for Machine Learning

Study level

PhD
Master of Philosophy

Faculty/School

Faculty of Science

School of Computer Science

Topic status

We're looking for students to study this topic.

Supervisors

Dr Yi Lu

Position: Senior Lecturer in Cybersecurity
Division / Faculty: Faculty of Science

Overview

In recent years, machine learning has enjoyed profound success in a range of interesting applications such as natural language processing, computer vision and speech recognition. It has been possible mainly due to, in addition to better computing resources, the availability of large amounts of training datasets to these applications. However, in software security research, the lack of large datasets is an open problem that makes it challenging for machine learning to reason about security vulnerabilities found in real-world software. The very limited number of existing datasets for software security are typically handcrafted test programs that are very small and imprecisely labelled.

This project aims to investigate novel automatic techniques for programmatically generating large training datasets for software security research.

Research activities

The project will:

explore modern program code analysis techniques for labelling real-world software that exhibits security vulnerabilities
investigate state-of-the-art code synthesis techniques to create datasets from known vulnerable software
implement a prototype that combines code analysis and synthesis techniques to generate large sample datasets
experiment with the prototype and evaluate sample datasets with supervised machine learning.

Outcomes

a large, labelled dataset for machine learning in software security research
novel techniques for generating large datasets of the vulnerable software
a prototype dataset generator.

Skills and experience

solid background in computer science
programming experience in languages like Python or Java
GPA > 5.5

Scholarships

You may be eligible to apply for a research scholarship.

Explore our research scholarships

Keywords

Contact

Contact the supervisor for more information.

Automatic Generation of Software Vulnerability Datasets for Machine Learning