A Hybrid Heuristic-Guided Deep Representation Architecture for Predicting Latent Zero-Day Vulnerabilities in Patched Functions
HYDRA is a hybrid vulnerability analysis framework designed to uncover latent zero-day vulnerabilities in patched functions. It combines rule-based static analysis with deep learning techniques specifically, GraphCodeBERT embeddings and a Variational Autoencoder (VAE) to identify latent vulnerabilities that persist after fixes due to incomplete patches or overlooked risks. HYDRA operates in an unsupervised setting and was evaluated on three real-world projects Chrome, Android, and ImageMagick where it successfully predicted 13.7%, 20.6%, and 24% of patched functions, respectively, as containing potential latent risks. HYDRA outperforms baselines using only regex-based or hybrid symbolic models by surfacing deeply buried but risky patterns strengthening the case for hybrid vulnerability prediction in security audits. These results demonstrate HYDRA’s capability to surface hidden, previously undetected risks, advancing software security validation and supporting proactive
zero-day vulnerabilities discovery.
HYDRA combines:
- Heuristic Feature Extraction: Encodes a set of expert defined vulnerability prediction rules derived from common insecure coding practices (e.g., missing null checks, unsafe memory allocation).
- GraphCodeBERT Encoder: A pretrained deep learning model specifically designed for source code, GraphCodeBERT transforms input functiones code into rich context-aware embeddings that capture both syntax (e.g., control/data flow) and semantics (e.g., variable interactions).
- VAE-Based Latent Space Projection: HYDRA uses a Variational Autoencoder (VAE) to compress high-dimensional code embeddings into a lower-dimensional latent space, making it easier to observe hidden structure in code representations.
- K-Means Clustering: K-Means clustering is employed to reveal semantic groupings of patched functions, highlighting similarities across risky and "None" labeled examples and enabling unsupervised discovery of latent vulnerability signals.
- Python≥3.8
- Torch>=1.12
- Transformers>=4.25
- NumPy
- Pandas
- Scikit-learn
- Matplotlib
- Seaborn
- tqdm
- re (regular expressions)
The Dataset we used in the paper:
- Big-Vul [https://drive.google.com/file/d/1-0VhnHBp9IGh90s2wCNjeCMuy70HPl8X/view?usp=sharing]
A large dataset of known vulnerabilities with vulnerable and fixed code function pairs.
[1] Jiahao Fan, Yi Li, Shaohua Wang, and Tien Nguyen. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. MSR 2020.





