Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education [Arxiv]
Large Language Models (LLMs) are increasingly integrated into educational applications such as intelligent tutoring, automated grading, and personalized learning. However, they remain vulnerable to jailbreak attacks and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. To address these issues, we propose TriShield, a unified defense framework for educational LLMs that simultaneously mitigates both attack types without sacrificing utility. TriShield begins with the construction of EduHarm, a benchmark dataset of safe–unsafe instruction pairs across five educational scenarios. Our framework operates through three core stages to enhance safety. Safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TriShield effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.
This dataset contains 5 scenarios of instructional text, including 1044 training samples and 696 test samples.
- Defense Jailbreak Attack
sh ./defense-jailbreak_attack/1_Single_Refusal_Module/single_refusal.sh
- Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/1_Single_Refusal_Module/agnews-single_refusal.sh
For other datasets, please refer to the corresponding scripts (i.e., gsm8k-single_refusal.sh and SST2-single_refusal.sh).
- Defense Jailbreak Attack
sh ./defense-jailbreak_attack/2_Logit_Fusion_Classification/train.sh
- Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/2_Logit_Fusion_Classification/agnews-train.sh
For other datasets, please refer to the corresponding scripts(i.e., gsm8k-train.sh and SST2-train.sh).
- Defense Jailbreak Attack
sh ./defense-jailbreak_attack/3_Dual_Routing/safety_eval.sh
Supported jailbreak attack types include autodan, pair, artprompt, random_search, gpt4cipher, past_tense, deep_inception, gptfuzz, and gcg. To generate jailbreak attack samples, we use repository panda-guard.
- Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/3_Dual_Routing/agnews-safey_eval.sh
For fine-tuning attack datasets with different poisoning levels (i.e., p = 0.1 and p = 0.2), we adopt the datasets provided in the /ft_datasets directory from AsFT.
Some codes are build upon Llms_Encode_Harmfulness_Refusal_ Separately, Tokens Highlighter and ToxEdit. For generate jailbreak attack samples, we use repository panda-guard. For evaluate baselines of various defense methods, we use repository AISafetyLab.
If you find this work useful, please cite our paper:
@article{yi2025unified,
title={Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education},
author={Yi, Xin and Li, Yue and Shi, Dongsheng and Wang, Linlin and Wang, Xiaoling and He, Liang},
journal={arXiv preprint arXiv:2511.14423},
year={2025}
}
