Xikang Yang1,2, Biyu Zhou1, Xuehai Tang1, Jizhong Han1, Songlin Hu1,2
1Institute of Information Engineering, Chinese Academy of Sciences / Beijing, China
2School of Cyber Security, University of Chinese Academy of Sciences / Beijing, China
Warning: This paper may include harmful or unethical content from LLMs.
AAAI 2026 Oral
Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, yet their safety mechanisms remain susceptible to adversarial exploitation of cognitive biases---systematic deviations from rational judgment. Unlike prior studies focusing on isolated biases, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. Specifically, we propose CognitiveAttack, a novel red-teaming framework that adaptively selects optimal ensembles from 154 human social psychology-defined cognitive biases, engineering them into adversarial prompts to effectively compromise LLM safety mechanisms. Experimental results reveal systemic vulnerabilities across 30 mainstream LLMs, particularly open-source variants. CognitiveAttack achieves a substantially higher attack success rate than the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defenses. Through quantitative analysis of successful jailbreaks, we further identify vulnerability patterns in safety-aligned LLMs under synergistic cognitive biases, validating multi-bias interactions as a potent yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.
cd CognitiveAttack
pip install -r requirements.txtTo train the CognitiveAttack red-team model:
python main.pyTo evaluate the trained model against target LLMs:
python attack.py- Use only for authorized red-teaming and safety research
- Follow responsible disclosure practices
- Do not use for malicious purposes
- Refer to our ethics guidelines in the paper
If you use CognitiveAttack in your research, please cite:
@article{yang2025exploiting,
title={Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs},
author={Yang, Xikang and Zhou, Biyu and Tang, Xuehai and Han, Jizhong and Hu, Songlin},
journal={arXiv preprint arXiv:2507.22564},
year={2025}
}
This project is released under the MIT License for research use only.

