Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education [Arxiv]

Abstract

Large Language Models (LLMs) are increasingly integrated into educational applications such as intelligent tutoring, automated grading, and personalized learning. However, they remain vulnerable to jailbreak attacks and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. To address these issues, we propose TriShield, a unified defense framework for educational LLMs that simultaneously mitigates both attack types without sacrificing utility. TriShield begins with the construction of EduHarm, a benchmark dataset of safe–unsafe instruction pairs across five educational scenarios. Our framework operates through three core stages to enhance safety. Safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TriShield effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.

Overview

Dataset

EduHarm

This dataset contains 5 scenarios of instructional text, including 1044 training samples and 696 test samples.

Three Stages Defense Framework

Construction of Edited FFN

Defense Jailbreak Attack

sh ./defense-jailbreak_attack/1_Single_Refusal_Module/single_refusal.sh

Defense Fine-Tuning Attack

sh ./defense-finetuning_attack/1_Single_Refusal_Module/agnews-single_refusal.sh

For other datasets, please refer to the corresponding scripts (i.e., gsm8k-single_refusal.sh and SST2-single_refusal.sh).

Training of Layer-wise Safety Judgment

Defense Jailbreak Attack

sh ./defense-jailbreak_attack/2_Logit_Fusion_Classification/train.sh

Defense Fine-Tuning Attack

sh ./defense-finetuning_attack/2_Logit_Fusion_Classification/agnews-train.sh

For other datasets, please refer to the corresponding scripts(i.e., gsm8k-train.sh and SST2-train.sh).

Safety-aware Attention Realignment & Dual Routing

Defense Jailbreak Attack

sh ./defense-jailbreak_attack/3_Dual_Routing/safety_eval.sh

Supported jailbreak attack types include autodan, pair, artprompt, random_search, gpt4cipher, past_tense, deep_inception, gptfuzz, and gcg. To generate jailbreak attack samples, we use repository panda-guard.

Defense Fine-Tuning Attack

sh ./defense-finetuning_attack/3_Dual_Routing/agnews-safey_eval.sh

For fine-tuning attack datasets with different poisoning levels (i.e., p = 0.1 and p = 0.2), we adopt the datasets provided in the /ft_datasets directory from AsFT.

Acknowledgement

Some codes are build upon Llms_Encode_Harmfulness_Refusal_ Separately, Tokens Highlighter and ToxEdit. For generate jailbreak attack samples, we use repository panda-guard. For evaluate baselines of various defense methods, we use repository AISafetyLab.

Citation

If you find this work useful, please cite our paper:

@article{yi2025unified,
  title={Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education},
  author={Yi, Xin and Li, Yue and Shi, Dongsheng and Wang, Linlin and Wang, Xiaoling and He, Liang},
  journal={arXiv preprint arXiv:2511.14423},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Edu_dataset		Edu_dataset
Figure		Figure
TriShield		TriShield
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
download.py		download.py
judgement_safety_batch.py		judgement_safety_batch.py
safety_template.py		safety_template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education [Arxiv]

Abstract

Overview

Dataset

Three Stages Defense Framework

Construction of Edited FFN

Training of Layer-wise Safety Judgment

Safety-aware Attention Realignment & Dual Routing

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

xinykou/TSSF

Folders and files

Latest commit

History

Repository files navigation

Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education [Arxiv]

Abstract

Overview

Dataset

Three Stages Defense Framework

Construction of Edited FFN

Training of Layer-wise Safety Judgment

Safety-aware Attention Realignment & Dual Routing

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages