Skip to content
/ HYDRA Public

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

Notifications You must be signed in to change notification settings

Wahed08/HYDRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 

Repository files navigation

HYDRA

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

HYDRA is a hybrid vulnerability analysis framework designed to uncover latent zero-day vulnerabilities in patched functions. It combines rule-based static analysis with deep learning techniques specifically, GraphCodeBERT embeddings and a Variational Autoencoder (VAE) to identify latent vulnerabilities that persist after fixes due to incomplete patches or overlooked risks. HYDRA operates in an unsupervised setting and was evaluated on three real-world projects Chrome, Android, and ImageMagick where it successfully predicted 13.7%, 20.6%, and 24% of patched functions, respectively, as containing potential latent risks. HYDRA outperforms baselines using only regex-based or hybrid symbolic models by surfacing deeply buried but risky patterns strengthening the case for hybrid vulnerability prediction in security audits. These results demonstrate HYDRA’s capability to surface hidden, previously undetected risks, advancing software security validation and supporting proactive zero-day vulnerabilities discovery.

Architecture of HYDRA

t-SNE visualization of HYDRA embeddings

HYDRA Framework Overview

HYDRA combines:

  1. Heuristic Feature Extraction: Encodes a set of expert defined vulnerability prediction rules derived from common insecure coding practices (e.g., missing null checks, unsafe memory allocation).
  2. GraphCodeBERT Encoder: A pretrained deep learning model specifically designed for source code, GraphCodeBERT transforms input functiones code into rich context-aware embeddings that capture both syntax (e.g., control/data flow) and semantics (e.g., variable interactions).
  3. VAE-Based Latent Space Projection: HYDRA uses a Variational Autoencoder (VAE) to compress high-dimensional code embeddings into a lower-dimensional latent space, making it easier to observe hidden structure in code representations.
  4. K-Means Clustering: K-Means clustering is employed to reveal semantic groupings of patched functions, highlighting similarities across risky and "None" labeled examples and enabling unsupervised discovery of latent vulnerability signals.

Environment Requirements

Dataset

The Dataset we used in the paper:

  1. Big-Vul [https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing]
    A large dataset of known vulnerabilities with vulnerable and fixed code function pairs.

t-SNE visualization of HYDRA embeddings
t-SNE visualization of HYDRA embeddings
t-SNE visualization of HYDRA embeddings

t-SNE visualization of HYDRA embeddings

t-SNE visualization of HYDRA embeddings


References

[1] Jiahao Fan, Yi Li, Shaohua Wang, and Tien Nguyen. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. MSR 2020.

About

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages