TrustAIRLab

All

30 repositories

Unsafe-LLM-Based-Search
Public
Python
•
Apache License 2.0
•0•5•0•0•Updated Jan 4, 2026Jan 4, 2026
JAIL-CON
Public
[NeurIPS'25] Adjacent Words, Divergent Intents: Jailbreaking Large Language Models via Task Concurrency (https://arxiv.org/abs/2510.21189)
Python
•
Creative Commons Attribution 4.0 International
•0•2•0•0•Updated Dec 24, 2025Dec 24, 2025
HatefulIllusion
Public
Python
•0•2•0•0•Updated Dec 9, 2025Dec 9, 2025
jades.github.io
Public
Official Website of JADES
SCSS
•
MIT License
•0•0•0•0•Updated Sep 12, 2025Sep 12, 2025
T-GPS
Public
Python
•
Apache License 2.0
•0•3•0•0•Updated Sep 7, 2025Sep 7, 2025
JADES
Public
This is the public code repository of paper 'JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring'
0•6•0•0•Updated Aug 27, 2025Aug 27, 2025
GPTracker
Public
[S&P'25] GPTracker: A Large-Scale Measurement of Misused GPTs
Python
•
GNU General Public License v3.0
•1•9•0•0•Updated Jul 25, 2025Jul 25, 2025
SaferVLM
Public
Python
•0•7•0•0•Updated Jul 19, 2025Jul 19, 2025
JailbreakRadar
Public
Python
•7•84•0•0•Updated Jun 8, 2025Jun 8, 2025
AIGT_on_Social_Media
Public
[ACL2025] Official repository for "Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media"
Python
•1•8•0•0•Updated May 29, 2025May 29, 2025
Conversation_Reconstruction_Attack
Public
This is the public code repository for the paper 'Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models'
Python
•1•10•0•0•Updated May 21, 2025May 21, 2025
HateBench
Public
[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns
hatespeech hatespeech-detection llm
Apache License 2.0
•3•13•0•0•Updated Mar 1, 2025Mar 1, 2025
synthetic_artifact_auditing
Public
[Usenix Security 2025] Synthetic Artifact Auditing: Tracing LLM-Generated Synthetic Data Usage in Downstream Applications
synthetic-data synthetic-dataset-generation llm synthetic-artifact-auditing
Python
•
Apache License 2.0
•1•5•0•0•Updated Jan 29, 2025Jan 29, 2025
proactive_unsafe_generation
Public
[Usenix Security 2025] On the Proactive Generation of Unsafe Images From Text-To-Image Models Using Benign Prompts
poisoning-attacks text-to-image-generation unsafe-image
Python
•
Apache License 2.0
•0•5•1•0•Updated Jan 29, 2025Jan 29, 2025
Hateful_Memes_in_VLM
Public
Apache License 2.0
•0•1•0•0•Updated Jan 28, 2025Jan 28, 2025
ModSCAN
Public
An official public repository of the paper "ModSCAN: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities" (https://arxiv.org/abs/2410.06967).
Python
•
MIT License
•1•3•0•0•Updated Jan 8, 2025Jan 8, 2025
ICL-MIA
Public
Python
•0•5•1•0•Updated Dec 19, 2024Dec 19, 2024
importance-in-mlattacks
Public
Python
•0•9•0•0•Updated Dec 18, 2024Dec 18, 2024
SecurityNet
Public
JavaScript
•
MIT License
•0•8•1•0•Updated Oct 30, 2024Oct 30, 2024
ZeroFake
Public
Python
•2•11•1•0•Updated Oct 30, 2024Oct 30, 2024
homepage
Public
JavaScript
•
MIT License
•0•0•0•0•Updated Oct 14, 2024Oct 14, 2024
T2I_Model_Evolution
Public
MIT License
•0•0•0•0•Updated Aug 28, 2024Aug 28, 2024
ML-Doctor
Public
Code for ML Doctor
Python
•
MIT License
•0•6•0•0•Updated Aug 14, 2024Aug 14, 2024
VoiceJailbreakAttack
Public
Code for Voice Jailbreak Attacks Against GPT-4o.
Python
•
MIT License
•1•36•1•0•Updated May 31, 2024May 31, 2024
easy-bib
Public
TeX
•
MIT License
•1•5•0•1•Updated Mar 9, 2024Mar 9, 2024
.github
Public
0•0•0•0•Updated Feb 28, 2024Feb 28, 2024
Label-Only-MIA
Public
Python
•
MIT License
•0•6•0•0•Updated Feb 23, 2024Feb 23, 2024
JailbreakLLMs
Public
A dataset consists of 6,387 ChatGPT prompts from Reddit, Discord, websites, and open-source datasets (including 666 jailbreak prompts).
MIT License
•2•14•0•0•Updated Feb 21, 2024Feb 21, 2024
Link-Stealing-Attack
Public
Python
•0•2•0•0•Updated Feb 21, 2024Feb 21, 2024
MGTBench
Public
Python
•
MIT License
•0•7•0•0•Updated Feb 21, 2024Feb 21, 2024