Skip to content
/ TSSF Public

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

License

Notifications You must be signed in to change notification settings

xinykou/TSSF

Repository files navigation

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education [Arxiv]

Abstract

Large Language Models (LLMs) are increasingly integrated into educational applications such as intelligent tutoring, automated grading, and personalized learning. However, they remain vulnerable to jailbreak attacks and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. To address these issues, we propose TriShield, a unified defense framework for educational LLMs that simultaneously mitigates both attack types without sacrificing utility. TriShield begins with the construction of EduHarm, a benchmark dataset of safe–unsafe instruction pairs across five educational scenarios. Our framework operates through three core stages to enhance safety. Safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TriShield effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

Overview

overview

Dataset

Hugging Face Logo EduHarm

This dataset contains 5 scenarios of instructional text, including 1044 training samples and 696 test samples.

Three Stages Defense Framework

Construction of Edited FFN

  • Defense Jailbreak Attack
sh ./defense-jailbreak_attack/1_Single_Refusal_Module/single_refusal.sh
  • Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/1_Single_Refusal_Module/agnews-single_refusal.sh

For other datasets, please refer to the corresponding scripts (i.e., gsm8k-single_refusal.sh and SST2-single_refusal.sh).

Training of Layer-wise Safety Judgment

  • Defense Jailbreak Attack
sh ./defense-jailbreak_attack/2_Logit_Fusion_Classification/train.sh
  • Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/2_Logit_Fusion_Classification/agnews-train.sh

For other datasets, please refer to the corresponding scripts(i.e., gsm8k-train.sh and SST2-train.sh).

Safety-aware Attention Realignment & Dual Routing

  • Defense Jailbreak Attack
sh ./defense-jailbreak_attack/3_Dual_Routing/safety_eval.sh

Supported jailbreak attack types include autodan, pair, artprompt, random_search, gpt4cipher, past_tense, deep_inception, gptfuzz, and gcg. To generate jailbreak attack samples, we use repository panda-guard.

  • Defense Fine-Tuning Attack
sh ./defense-finetuning_attack/3_Dual_Routing/agnews-safey_eval.sh

For fine-tuning attack datasets with different poisoning levels (i.e., p = 0.1 and p = 0.2), we adopt the datasets provided in the /ft_datasets directory from AsFT.

Acknowledgement

Some codes are build upon Llms_Encode_Harmfulness_Refusal_ Separately, Tokens Highlighter and ToxEdit. For generate jailbreak attack samples, we use repository panda-guard. For evaluate baselines of various defense methods, we use repository AISafetyLab.

Citation

If you find this work useful, please cite our paper:

@article{yi2025unified,
  title={Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education},
  author={Yi, Xin and Li, Yue and Shi, Dongsheng and Wang, Linlin and Wang, Xiaoling and He, Liang},
  journal={arXiv preprint arXiv:2511.14423},
  year={2025}
}

About

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published