HYDRA

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

HYDRA is a hybrid vulnerability analysis framework designed to uncover latent zero-day vulnerabilities in patched functions. It combines rule-based static analysis with deep learning techniques specifically, GraphCodeBERT embeddings and a Variational Autoencoder (VAE) to identify latent vulnerabilities that persist after fixes due to incomplete patches or overlooked risks. HYDRA operates in an unsupervised setting and was evaluated on three real-world projects Chrome, Android, and ImageMagick where it successfully predicted 13.7%, 20.6%, and 24% of patched functions, respectively, as containing potential latent risks. HYDRA outperforms baselines using only regex-based or hybrid symbolic models by surfacing deeply buried but risky patterns strengthening the case for hybrid vulnerability prediction in security audits. These results demonstrate HYDRA’s capability to surface hidden, previously undetected risks, advancing software security validation and supporting proactive zero-day vulnerabilities discovery.

Architecture of HYDRA

HYDRA Framework Overview

HYDRA combines:

Heuristic Feature Extraction: Encodes a set of expert defined vulnerability prediction rules derived from common insecure coding practices (e.g., missing null checks, unsafe memory allocation).
GraphCodeBERT Encoder: A pretrained deep learning model specifically designed for source code, GraphCodeBERT transforms input functiones code into rich context-aware embeddings that capture both syntax (e.g., control/data flow) and semantics (e.g., variable interactions).
VAE-Based Latent Space Projection: HYDRA uses a Variational Autoencoder (VAE) to compress high-dimensional code embeddings into a lower-dimensional latent space, making it easier to observe hidden structure in code representations.
K-Means Clustering: K-Means clustering is employed to reveal semantic groupings of patched functions, highlighting similarities across risky and "None" labeled examples and enabling unsupervised discovery of latent vulnerability signals.

Environment Requirements

Python≥3.8
Torch>=1.12
Transformers>=4.25
NumPy
Pandas
Scikit-learn
Matplotlib
Seaborn
tqdm
re (regular expressions)

Dataset

The Dataset we used in the paper:

Big-Vul [https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing]
A large dataset of known vulnerabilities with vulnerable and fixed code function pairs.

References

[1] Jiahao Fan, Yi Li, Shaohua Wang, and Tien Nguyen. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. MSR 2020.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
artifacts/RQ2-&-RQ3		artifacts/RQ2-&-RQ3
code		code
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HYDRA

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

Architecture of HYDRA

HYDRA Framework Overview

Environment Requirements

Dataset

Code

t-SNE Visualization of HYDRA

t-SNE Visualization from Other Models

Unsupervised Clustering Metrics Score of HYDRA

Unsupervised Clustering Metrics Score from Other Models

VAE Reconstruction Loss Curve of HYDRA

References

About

Uh oh!

Releases

Packages

Languages

Wahed08/HYDRA

Folders and files

Latest commit

History

Repository files navigation

HYDRA

A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions

Architecture of HYDRA

HYDRA Framework Overview

Environment Requirements

Dataset

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages