SafetyPrompts Logo
SafetyPrompts.com A Living Catalogue of Open Datasets for LLM Safety

About



This website lists open datasets for evaluating and improving the safety of large language models (LLMs). We include datasets that loosely fit two criteria:

  1. Relevance to LLM chat applications. We are most interested in collections of LLM prompts, like questions or instructions.
  2. Relevance to LLM safety. We focus on prompts that target, elicit or evaluate sensitive or unsafe model behaviours.

We regularly update this website. If you know of any missing or new datasets, please let us know via email or on Twitter. LLM safety is a community effort!




This website is maintained by me, Paul Röttger. I am a postdoc at MilaNLP working on evaluating and improving LLM safety. For feedback and suggestions, please get in touch via email or on Twitter.

Thank you to everyone who has given feedback or contributed to this website in other ways. Please check out the Acknowledgements.




If you use this website for your research, please cite our arXiv preprint:

@misc{röttger2024safetyprompts, title={SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety}, author={Paul Röttger and Fabio Pernisi and Bertie Vidgen and Dirk Hovy}, year={2024}, eprint={2404.05399}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Table of Contents



As of December 17th, 2024, SafetyPrompts.com lists 144 datasets. Below, we list all datasets grouped by the purpose they serve, with the newest datasets listed first.

Note: SafetyPrompts.com takes its data from a Google Sheet. You may find the sheet useful for running your own analyses.



Broad Safety Datasets


53 "Broad Safety" datasets cover several aspects of LLM safety.

For the full list, click HERE.

  1. JBBBehaviours from Chao et al. (Dec 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
  2. SGBench from Mou et al. (Dec 2024): "SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types"
  3. StrongREJECT from Souly et al. (Dec 2024): "A StrongREJECT for Empty Jailbreaks"
  4. WildJailbreak from Liang et al. (Dec 2024): "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models"
  5. ForbiddenQuestions from Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"
  6. CoSafe from Yu et al. (Nov 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference"
  7. ArabicAdvBench from Al Ghanim et al. (Nov 2024): "Jailbreaking LLMs with Arabic Transliteration and Arabizi"
  8. AyaRedTeaming from Aakanshka et al. (Nov 2024): "The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm"
  9. UltraSafety from Guo et al. (Nov 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment"
  10. CHiSafetyBench from Zhang et al. (Sep 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models"
  11. SafetyBench from Zhang et al. (Aug 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions"
  12. CatQA from Bhardwaj et al. (Aug 2024): "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic"
  13. XSafety from Wang et al. (Aug 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models"
  14. SaladBench from Li et al. (Aug 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models"
  15. GPTFuzzer from Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts"
  16. ALERT from Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming"
  17. SorryBench from Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors"
  18. Flames from Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese"
  19. SEval from Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models"
  20. SAFE from Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models"
  21. DoNotAnswer from Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs"
  22. HarmBench from Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
  23. DecodingTrust from Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models"
  24. SimpleSafetyTests from Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models"
  25. MaliciousInstruct from Huang et al. (Feb 2024): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation"
  26. HExPHI from Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"
  27. QHarm from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  28. MaliciousInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  29. SafetyInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  30. AdvBench from Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models"
  31. BeaverTails from Ji et al. (Dec 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset"
  32. TDCRedTeaming from Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge"
  33. JADE from Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models"
  34. CPAD from Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs"
  35. AdvPromptSet from Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"
  36. AART from Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications"
  37. DELPHI from Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues"
  38. AttaQ from Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models"
  39. FFT from Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity"
  40. HarmfulQA from Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment"
  41. HarmfulQ from Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning"
  42. ModelWrittenAdvancedAIRisk from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  43. SafetyPrompts from Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models"
  44. ProsocialDialog from Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents"
  45. AnthropicRedTeam from Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned"
  46. SaFeRDialogues from Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures"
  47. SafetyKit from Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems"
  48. DiaSafety from Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark"
  49. AnthropicHarmlessBase from Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
  50. BAD from Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents"
  51. RealToxicityPrompts from Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
  52. ParlAIDialogueSafety from Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack"
  53. EmpatheticDialogues from Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset"

Narrow Safety Datasets


25 "Narrow Safety" datasets focus only on one specific aspect of LLM safety.

For the full list, click HERE.

  1. MedSafetyBench from Han et al. (Dec 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models"
  2. CoCoNot from Brahman et al. (Dec 2024): "The Art of Saying No: Contextual Noncompliance in Language Models"
  3. DoAnythingNow from Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"
  4. SGXSTest from Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models"
  5. HiXSTest from Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models"
  6. WMDP from Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning"
  7. ORBench from Cui et al. (Jun 2024): "OR-Bench: An Over-Refusal Benchmark for Large Language Models"
  8. XSTest from Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
  9. RuLES from Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?"
  10. ConfAIde from Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"
  11. CoNA from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  12. ControversialInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  13. PhysicalSafetyInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  14. SycophancyEval from Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models"
  15. OKTest from Shi et al. (Jan 2024): "Navigating the OverKill in Large Language Models"
  16. PromptExtractionRobustness from Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"
  17. PromptHijackingRobustness from Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"
  18. CyberattackAssistance from Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models"
  19. SPMisconceptions from Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions"
  20. LatentJailbreak from Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models"
  21. ModelWrittenSycophancy from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  22. SafeText from Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models"
  23. PersonalInfoLeak from Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?"
  24. ToxiGen from Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection"
  25. TruthfulQA from Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods"

Value Alignment Datasets


23 "Value Alignment" datasets are concerned with the ethical, moral or social behaviour of LLMs.

For the full list, click HERE.

  1. PRISM from Kirk et al. (Dec 2024): "The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models"
  2. MultiTP from Jin et al. (Dec 2024): "Multilingual Trolley Problems for Language Models"
  3. CIVICS from Pistilli et al. (Oct 2024): "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models"
  4. GlobalOpinionQA from Durmus et al. (Oct 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models"
  5. KorNAT from Lee et al. (Aug 2024): "KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge"
  6. CMoralEval from Yu et al. (Aug 2024): "CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models"
  7. WorldValuesBench from Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models"
  8. MoralChoice from Scherrer et al. (Dec 2023): "Evaluating the Moral Beliefs Encoded in LLMs"
  9. DICES350 from Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"
  10. DICES990 from Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"
  11. OpinionQA from Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?"
  12. CValuesResponsibilityMC from Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"
  13. CValuesResponsibilityPrompts from Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"
  14. ModelWrittenPersona from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  15. Machiavelli from Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark"
  16. MoralExceptQA from Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment"
  17. MIC from Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems"
  18. JiminyCricket from Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally"
  19. MoralStories from Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences"
  20. ETHICS from Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values"
  21. ScruplesAnecdotes from Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes"
  22. ScruplesDilemmas from Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes"
  23. SocialChemistry101 from Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms"

Bias Datasets


33 "Bias" datasets evaluate sociodemographic biases in LLMs.

For the full list, click HERE.

  1. SoFa from Manerba et al. (Nov 2024): "Social Bias Probing: Fairness Benchmarking for Language Models"
  2. DeMET from Levy et al. (Nov 2024): "Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts"
  3. GEST from Pikuliak et al. (Nov 2024): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling"
  4. GenMO from Bajaj et al. (Nov 2024): "Evaluating Gender Bias of LLMs in Making Morality Judgements"
  5. CALM from Gupta et al. (Oct 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias"
  6. MMHB from Tan et al. (Jun 2024): "Towards Massive Multilingual Holistic Bias"
  7. CBBQ from Huang and Xiong (May 2024): "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models"
  8. KoBBQ from Jin et al. (May 2024): "KoBBQ: Korean Bias Benchmark for Question Answering"
  9. HolisticBiasR from Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"
  10. DiscrimEval from Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions"
  11. CHBias from Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models"
  12. SeeGULL from Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models"
  13. WinoQueer from Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models"
  14. WinoGenerated from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  15. HolisticBias from Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset"
  16. CDialBias from Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark"
  17. IndianStereotypes from Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India"
  18. FrenchCrowSPairs from Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English"
  19. BBQ from Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering"
  20. BiasOutOfTheBox from Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models"
  21. EthnicBias from Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT"
  22. RedditBias from Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models"
  23. StereoSet from Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models"
  24. HypothesisStereotypes from Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks"
  25. HONEST from Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models"
  26. SweWinoGender from Hansson et al. (May 2021): "The Swedish Winogender Dataset"
  27. LMBias from Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models"
  28. BOLD from Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation"
  29. CrowSPairs from Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models"
  30. UnQover from Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions"
  31. Regard from Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation"
  32. WinoBias from Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods"
  33. WinoGender from Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution"

Other Datasets


10 "Other" datasets serve more specialised purposes.

For the full list, click HERE.

  1. WildGuardMix from Han et al. (Dec 2024): "WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs"
  2. AegisAIContentSafety from Ghosh et al. (Sep 2024): "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts"
  3. Mosscap from Lakera AI (Dec 2023): "Mosscap Prompt Injection"
  4. HackAPrompt from Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition"
  5. ToxicChat from Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions"
  6. GandalfIgnoreInstructions from Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts"
  7. GandalfSummarization from Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts"
  8. FairPrism from Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation"
  9. OIGModeration from Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset"
  10. ConvAbuse from Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI"

Broad Safety Datasets



100 prompts. Each prompt is an unsafe question or instruction. JBBBehaviours was created to evaluate effectiveness of different jailbreaking methods. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AdvBench and Harmbench, plus hand-written examples. The dataset license is MIT.

Other notes:

  • Covers 10 safety categories
  • Comes with 100 benign prompts from XSTest

Published by Chao et al. (Dec 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





1,442 prompts. Each prompt is a malicious instruction or question, which comes in multiple formats (direct query, jailbreak, multiple-choice or safety classification) SGBench was created to evaluate the generalisation of LLM safety across various tasks and prompt types. The dataset language is English. Dataset entries are human-written: sampled from AdvBench, HarmfulQA, BeaverTails and SaladBench, then expanded using templates and LLMs. The dataset license is GPL 3.0.

Other notes:

  • Comes in multiple formats (chat, multiple choice, classification)

Published by Mou et al. (Dec 2024): "SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 27.02.2024.





313 prompts. Each prompt is a 'forbidden question' in one of six categories. StrongREJECT was created to better investigate the effectiveness of different jailbreaking techniques. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • Covers 6 question categories: disinformation/deception, hate/harassment/discrimination, illegal goods/services, non-violent crimes, sexual content, violence

Published by Souly et al. (Dec 2024): "A StrongREJECT for Empty Jailbreaks". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 27.02.2024.





261,534 conversations. Each conversation is single-turn, containing a potentially unsafe prompt and a safe model response. WildJailbreak was created to train LLMs to be safe. The dataset language is English. Dataset entries are machine-written: generated by different LLMs prompted with example prompts and jailbreak techniques. The dataset license is ODC BY.

Other notes:

  • Contains Vanilla and Adversarial portions
  • Comes with a smaller testset of only prompts

Published by Liang et al. (Dec 2024): "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





390 prompts. Each prompt is a question targetting behaviour disallowed by OpenAI. ForbiddenQuestions was created to evaluate whether LLMs answer questions that violate OpenAI's usage policy. The dataset language is English. Dataset entries are machine-written: GPT-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Covers 13 "forbidden" scenarios taken from the OpenAI usage policy

Published by Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





1,400 conversations. Each conversation is multi-turn with the final question being unsafe. CoSafe was created to evaluating LLM safety in dialogue coreference. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from BeaverTails, then expanded with GPT-4 into multi-turn conversations. The dataset license is not specified.

Other notes:

  • Focuses on multi-turn conversations
  • Covers 14 categories of harm from BeaverTails

Published by Yu et al. (Nov 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





520 prompts. Each prompt is a harmful instruction with an associated jailbreak prompt. ArabicAdvBench was created to evaluate safety risks of LLMs in (different forms of) Arabic. The dataset language is Arabic. Dataset entries are machine-written: translated from AdvBench (which is machine-generated). The dataset license is MIT.

Other notes:

  • Translated from AdvBench prompts
  • Comes in different versions of Arabic

Published by Al Ghanim et al. (Nov 2024): "Jailbreaking LLMs with Arabic Transliteration and Arabizi". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





7,419 prompts. Each prompt is a harmful question or instruction. AyaRedTeaming was created to provide a testbed for exploring alignment across global and local preferences. The dataset languages are Arabic, English, Filipino, French, Hindi, Russian, Serbian and Spanish. Dataset entries are human-written: written by paid native-speaking annotators. The dataset license is Apache 2.0.

Other notes:

  • Distinguishes between global and local harm
  • Covers 9 harm categories

Published by Aakanshka et al. (Nov 2024): "The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





3,000 prompts. Each prompt is a harmful instruction with an associated jailbreak prompt. UltraSafety was created to create data for safety fine-tuning of LLMs. The dataset language is English. Dataset entries are machine-written: sampled from AdvBench and MaliciousInstruct, then expanded with SelfInstruct. The dataset license is MIT.

Other notes:

  • Comes with responses from different models, assessed for how safe they are by a GPT classifier

Published by Guo et al. (Nov 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





1,861 multiple-choice questions. CHiSafetyBench was created to evaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chinese. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: sampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5. The dataset license is not specified.

Other notes:

  • Covers 5 risk areas: discrimination, violation of values, commercial violations, infringement of rights, security requirements for specific services
  • Comes with smaller set of unsafe open-ended questions

Published by Zhang et al. (Sep 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 01.08.2024.





11,435 multiple-choice questions. SafetyBench was created to evaluate LLM safety with multiple choice questions. The dataset languages are English and Chinese. Dataset entries were created in a hybrid fashion: sampled from existing datasets + exams (zh), then LLM augmentation (zh). The dataset license is MIT.

Other notes:

  • Split into 7 categories
  • Language distribution imbalanced across categories
  • Tests knowledge about safety rather than safety

Published by Zhang et al. (Aug 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





550 prompts. Each prompt is a question. CatQA was created to evaluate effectiveness of safety training method. The dataset languages are English, Chinese and Vietnamese. Dataset entries are machine-written: generated by unnamed LLM that is not safety-tuned. The dataset license is Apache 2.0.

Other notes:

  • Covers 11 harm categories, each divided into 5 subcategories
  • For each subcategory there are 10 prompts

Published by Bhardwaj et al. (Aug 2024): "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





28,000 prompts. Each prompt is a question or instruction. XSafety was created to evaluate multilingual LLM safety. The dataset languages are English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese and German. Dataset entries were created in a hybrid fashion: sampled from SafetyPrompts, then auto-translated and validated. The dataset license is Apache 2.0.

Other notes:

  • Covers 14 safety scenarios, some "typical", some "instruction"

Published by Wang et al. (Aug 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





21,000 prompts. Each prompt is a question or instruction. SaladBench was created to evaluate LLM safety, plus attack and defense methods. The dataset language is English. Dataset entries were created in a hybrid fashion: mostly sampled from existing datasets, then augmented using GPT-4. The dataset license is Apache 2.0.

Other notes:

  • Structured with three-level taxonomy including 66 categories
  • Comes with multiple-choice question set

Published by Li et al. (Aug 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





100 prompts. Each prompt is a question or instruction. GPTFuzzer was created to evaluate effectiveness of automated red-teaming method. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicHarmlessBase and unpublished GPT-written dataset. The dataset license is MIT.

Other notes:

  • Small dataset for testing automated jailbreaks

Published by Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





44,800 prompts. Each prompt is a question or instruction. ALERT was created to evaluate the safety of LLMs through red teaming methodologies. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, then augmented with templates. The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Covers 6 categories and 32 sub-categories informed by AI regulation and prior work

Published by Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





9,450 prompts. Each prompt is an unsafe question or instruction. SorryBench was created to evaluate fine-grained LLM safety across varying linguistic characteristics. The dataset languages are English, French, Chinese, Marathi, Tamil and Malayalam. Dataset entries were created in a hybrid fashion: sampled from 10 other datasets, then augmented with templates. The dataset license is MIT.

Other notes:

  • Covers 45 potentially unsafe topics with 10 base prompts each
  • Includes 20 linguistic augmentations, which results in 21*450 prompts

Published by Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





2,251 prompts. Each prompt is a question or instruction. Flames was created to evaluate value alignment of Chinese language LLMs. The dataset language is Chinese. Dataset entries are human-written: written by crowdworkers. The dataset license is Apache 2.0.

Other notes:

  • Covers 5 dimensions: fairness, safety, morality, legality, data protection

Published by Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





20,000 prompts. Each prompt is an unsafe question or instruction. SEval was created to evaluate LLM safety. The dataset languages are English and Chinese. Dataset entries are machine-written: generated by a fine-tuned Qwen-14b. The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Covers 8 risk categories
  • Comes with 20 adversarial augmentations for each base prompt

Published by Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





52,430 conversations. Each conversation is single-turn, containing a prompt and a potentially harmful model response. SAFE was created to evaluate LLM safety beyond binary distincton of safe and unsafe. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled seed prompts from Friday website, then generated more prompts with GPT-4. The dataset license is not specified.

Other notes:

  • Covers 7 classes: safe, sensitivity, harmfulness, falsehood, information corruption, unnaturalness, deviation from instructions

Published by Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





939 prompts. Each prompt is a question. DoNotAnswer was created to evaluate 'dangerous capabilities' of LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4. The dataset license is Apache 2.0.

Other notes:

  • Split across 5 risk areas and 12 harm types
  • Authors prompted GPT-4 to generate questions

Published by Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





400 prompts. Each prompt is an instruction. HarmBench was created to evaluate effectiveness of automated red-teaming methods. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

  • Covers 7 semantic categories of behaviour: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm
  • The dataset also includes 110 multimodal prompts

Published by Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





243,877 prompts. Each prompt is an instruction. DecodingTrust was created to evaluate trustworthiness of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates and examples plus extensive augmentation from GPTs. The dataset license is CC BY-SA 4.0.

Other notes:

  • Split across 8 'trustworthiness perspectives': toxicity, stereotypes, adversarial and robustness, privacy, ethics and fairness

Published by Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





100 prompts. Each prompt is a simple question or instruction. SimpleSafetyTests was created to evaluate critical safety risks in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY-NC 4.0.

Other notes:

  • The dataset is split into ten types of prompts

Published by Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





100 prompts. Each prompt is an unsafe question. MaliciousInstruct was created to evaluate success of generation exploit jailbreak. The dataset language is English. Dataset entries are machine-written: written by ChatGPT, then filtered by authors. The dataset license is not specified.

Other notes:

  • Covers ten 'malicious intentions': psychological manipulation, sabotage, theft, defamation, cyberbullying, false accusation, tax fraud, hacking, fraud, and illegal drug use.

Published by Huang et al. (Feb 2024): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





330 prompts. Each prompt is a harmful instruction. HExPHI was created to evaluate LLM safety. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs.. The dataset license is custom (HEx-PHI).

Other notes:

  • Covers 11 harm areas
  • Focus of the article is on finetuning models

Published by Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





100 prompts. Each prompt is a question. QHarm was created to evaluate LLM safety. The dataset language is English. Dataset entries are human-written: sampled randomly from AnthropicHarmlessBase (written by crowdworkers). The dataset license is CC BY-NC 4.0.

Other notes:

  • Wider topic coverage due to source dataset
  • Prompts are mostly unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





100 prompts. Each prompt is an instruction. MaliciousInstructions was created to evaluate compliance of LLMs with malicious instructions. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-003). The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: malicious instructions (e.g. bombmaking)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





2,000 conversations. Each conversation is a user prompt with a safe model response. SafetyInstructions was created to fine-tune LLMs to be safer. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo. The dataset license is CC BY-NC 4.0.

Other notes:

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





1,000 prompts. 500 are harmful strings that the model should not reproduce, 500 are harmful instructions. AdvBench was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is English. Dataset entries are machine-written: generated by Wizard-Vicuna-30B-Uncensored. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • AdvBench tests whether jailbreaks succeeded

Published by Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





333,963 conversations. Each conversation contains a human prompt and LLM response. BeaverTails was created to evaluate and improve LLM safety on QA. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from human AnthropicRedTeam data, plus model-generated responses. The dataset license is CC BY-NC 4.0.

Other notes:

  • 16,851 unique prompts sampled from AnthropicRedTeam
  • Covers 14 harm categories (e.g. animal abuse)
  • Annotated for safety by 3.34 crowdworkers on average

Published by Ji et al. (Dec 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





100 prompts. Each prompt is an instruction. TDCRedTeaming was created to evaluate success of automated red-teaming approaches. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

  • Covers 7 categories: Bigotry and Abusive Language, Violent Content and Conduct, Illegal Activities, Malware and Exploits, Scams, Misinformation and Disinformation, Other Undesirable Content

Published by Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





2,130 prompts. Each prompt is a question targeting a specific harm category. JADE was created to use linguistic fuzzing to generate challenging prompts for evaluating LLM safety. The dataset languages are Chinese and English. Dataset entries are machine-written: generated by LLMs based on linguistic rules. The dataset license is MIT.

Other notes:

  • JADE is a platform for safety data generation and evaluation
  • Prompt generations are based on linguistic rules created by authors
  • The paper comes with 4 example datasets

Published by Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 05.02.2024.





10,050 prompts. Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response. CPAD was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: mostly generated by GPTs based on some human seed prompts. The dataset license is CC BY-SA 4.0.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • CPAD stands for Chinese Prompt Attack Dataset

Published by Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





197,628 sentences. Each sentence is taken from a social media dataset. AdvPromptSet was created to evaluate LLM responses to adversarial toxicity text prompts. The dataset language is English. Dataset entries are human-written: sampled from two Jigsaw social media datasets (written by social media users). The dataset license is MIT.

Other notes:

  • Created as part of the ROBBIE bias benchmark
  • Originally labelled for toxicity by Jigsaw

Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





3,269 prompts. Each prompt is an instruction. AART was created to illustrate the AART automated red-teaming method. The dataset language is English. Dataset entries are machine-written: generated by PALM. The dataset license is CC BY 4.0.

Other notes:

  • Contains examples for specific geographic regions
  • Prompts also change up use cases and concepts

Published by Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





29,201 prompts. Each prompt is a question about a more or less controversial issue. DELPHI was created to evaluate LLM performance in handling controversial issues. The dataset language is English. Dataset entries are human-written: sampled from the Quora Question Pair Dataset (written by Quora users). The dataset license is CC BY 4.0.

Other notes:

  • Annotated for 5 levels of controversy
  • Annotators are native English speakers who have spent significant time in Western Europe

Published by Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





1,402 prompts. Each prompt is a question. AttaQ was created to evaluate tendency of LLMs to generate harmful or undesirable responses. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia. The dataset license is MIT.

Other notes:

  • Consists of a mix of sources

Published by Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 01.08.2024.





2,116 prompts. Each prompt is a question or instruction, sometimes within a jailbreak template. FFT was created to evaluation factuality, fairness, and toxicity of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewhere. The dataset license is not specified.

Other notes:

  • Tests factuality, fairness and toxicity

Published by Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





1,960 prompts. Each prompt is a question. HarmfulQA was created to evaluate and improve LLM safety. The dataset language is English. Dataset entries are machine-written: generated by ChatGPT. The dataset license is Apache 2.0.

Other notes:

  • Split into 10 topics (e.g. "Mathematics and Logic")
  • Similarity across prompts is quite high
  • Not all prompts are unsafe / safety-related

Published by Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





200 prompts. Each prompt is a question. HarmfulQ was created to evaluate LLM safety. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-002). The dataset license is MIT.

Other notes:

  • Focus on 6 attributes: "racist, stereotypical, sexist, illegal, toxic, harmful"
  • Authors do manual filtering for overly similar questions

Published by Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





24,516 binary-choice questions. ModelWrittenAdvancedAIRisk was created to evaluate advanced AI risk posed by LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by an unnamed LLM and crowdworkers. The dataset license is CC BY 4.0.

Other notes:

  • 32 datasets targeting 16 different topics/behaviours
  • For each topic, there is a human generated dataset (total of 8,116 prompts) and a LM-generated dataset (total of 16,400 prompts)
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





100,000 prompts. Each prompt is a question or instruction. SafetyPrompts was created to evaluate the safety of Chinese LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: human-written examples, augmented by LLMs. The dataset license is Apache 2.0.

Other notes:

  • Covers 8 safety scenarios and 6 types of adv attack
  • Do not release 'sensitive topics' scenario

Published by Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





58,137 conversations. Each conversation starts with a potentially unsafe opening followed by constructive feedback. ProsocialDialog was created to teach conversational agents to respond to problematic content following social norms. The dataset language is English. Dataset entries were created in a hybrid fashion: GPT3-written openings with US crowdworker responses. The dataset license is MIT.

Other notes:

  • 58,137 conversations contain 331,362 utterances
  • 42% of utterances are labelled as 'needs caution'

Published by Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





38,961 conversations. Conversations can be multi-turn, with user input and LLM output. AnthropicRedTeam was created to analyse how people red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

  • Created by 324 US-based crowdworkers
  • Ca. 80% of examples come from ca. 50 workers

Published by Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





7,881 conversations. Each conversation contains a safety failure plus a recovery response. SaFeRDialogues was created to recovering from safety failures in LLM conversations. The dataset language is English. Dataset entries are human-written: unsafe conversation starters sampled from BAD, recovery responses written by crowdworkers. The dataset license is MIT.

Other notes:

  • Unsafe conversation starters are taken from BAD
  • Download only via ParlAI

Published by Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





990 prompts. Each prompt is a question, instruction or statement. SafetyKit was created to quickly assess apparent safety concerns in conversational AI. The dataset language is English. Dataset entries are human-written: sampled from several human-written datasets. The dataset license is MIT.

Other notes:

  • Unit tests for instigator and yea-sayer effect
  • Also provide 'integration' test that require human evaluation
  • Download only via ParlAI

Published by Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





11,492 conversational turns. Each turn consists of a context and a response. DiaSafety was created to capture unsafe behaviors in humanbot dialogue settings. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from Reddit, other datasets, and machine-generated. The dataset license is Apache 2.0.

Other notes:

  • Each turn is labelled as safe or unsafe

Published by Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





44,849 conversational turns. Each turn consists of a user prompt and multiple LLM completions. AnthropicHarmlessBase was created to red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

  • Most prompts created by 28 US-based crowdworkers

Published by Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





78,874 conversations. Conversations can be multi-turn, with user input and LLM output. BAD was created to evaluate the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers with the goal of making models give unsafe responses, and also validated by multiple annotators. The dataset license is MIT.

Other notes:

  • Download only via ParlAI
  • Approximately 40% of all dialogues are annotated as offensive, with a third of offensive utterances generated by bots

Published by Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





99,442 prompts. Each prompt is an unfinished sentence from the OpenWebCorpus. RealToxicityPrompts was created to evaluate propensity of LLMs to generate toxic content. The dataset language is English. Dataset entries are human-written: sampled OpenWebCorpus sentences. The dataset license is Apache 2.0.

Other notes:

  • Sampled using PerspectiveAPI toxicity threshold
  • 22k with toxicity score ≥0.5

Published by Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





90,000 prompts. 30k are for multi-turn tasks, and 60k for single-turn tasks. ParlAIDialogueSafety was created to evaluate and improve the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers in isolation ("standard"), or with the goal of making a model give an offensive response ("adversarial"). The dataset license is MIT.

Other notes:

  • Download only via ParlAI
  • The "single-turn" dataset provides a "standard" and "adversarial" setting, with 3 rounds of data collection each

Published by Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





24,850 conversations. Each conversation is about an emotional situation described by one speaker, across one or multiple turns. EmpatheticDialogues was created to train dialogue agents to be more empathetic. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-NC 4.0.

Other notes:

  • 810 crowdworkers participated in dataset creation

Published by Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





Narrow Safety Datasets



1,800 conversations. Each conversation is an unsafe medical request with an associated safe response. MedSafetyBench was created to measure the medical safety of LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4 and llama-2-7b. The dataset license is MIT.

Other notes:

  • Constructed around the AMA’s Principles of Medical Ethics

Published by Han et al. (Dec 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





12,478 prompts. Each prompt is a question or instruction that models should not comply with. CoCoNot was created to evaluate and improve contextual non-compliance in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written seed prompts expanded using LLMs. The dataset license is MIT.

Other notes:

  • Covers 5 categories of non-compliance
  • Comes with a contrast set of requests that should be complied with

Published by Brahman et al. (Dec 2024): "The Art of Saying No: Contextual Noncompliance in Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





15,140 prompts. Each prompt is an instruction or question, sometimes with a jailbreak. DoAnythingNow was created to characterise and evaluate in-the-wild LLM jailbreak prompts. The dataset language is English. Dataset entries are human-written: written by users on Reddit, Discord, websites and in other datasets. The dataset license is MIT.

Other notes:

  • There are 1,405 jailbreak prompts among the full prompt set

Published by Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





200 prompts. Each prompt is a simple question. SGXSTest was created to evaluate exaggerated safety / false refusal in LLMs for the Singaporean context. The dataset language is English. Dataset entries are human-written: created based on XSTest. The dataset license is Apache 2.0.

Other notes:

  • Half safe and half unsafe prompts, contrasting each other

Published by Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 17.12.2024.





50 prompts. Each prompt is a simple question. HiXSTest was created to evaluate exaggerated safety / false refusal behaviour in Hindi LLMs. The dataset language is Hindi. Dataset entries are human-written: created based on XSTest. The dataset license is Apache 2.0.

Other notes:

  • Half safe and half unsafe prompts, contrasting each other

Published by Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 17.12.2024.





3,688 multiple-choice questions. WMDP was created to measure hazardous knowledge in biosecurity, cybersecurity, and chemical security. The dataset language is English. Dataset entries are human-written: written by experts (academics and technical consultants). The dataset license is MIT.

Other notes:

  • Covers 3 hazard categories: biosecurity, cybersecurity, and chemical security

Published by Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





80,000 prompts. Each prompt is a question or instruction. ORBench was created to evaluate false refusal in LLMs at scale. The dataset language is English. Dataset entries are machine-written: generated by Mixtral as toxic prompts, then rewritten by Mixtral into safe prompts, then filtered. The dataset license is CC BY 4.0.

Other notes:

  • Comes with a subset of 1k hard prompts and 600 toxic prompts
  • Covers 10 categories of harmful behaviour

Published by Cui et al. (Jun 2024): "OR-Bench: An Over-Refusal Benchmark for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





450 prompts. Each prompt is a simple question. XSTest was created to evaluate exaggerated safety / false refusal behaviour in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

  • Split into ten types of prompts
  • 250 safe prompts and 200 unsafe prompts

Published by Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





862 prompts. Each prompt is a test case combining rules and instructions. RuLES was created to evaluate the ability of LLMs to follow simple rules. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

  • Covers 19 rules across 14 scenarios

Published by Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





1,326 prompts. Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary. ConfAIde was created to evaluate the privacy-reasoning capabilities of instruction-tuned LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination and with LLMs. The dataset license is MIT.

Other notes:

  • Split into 4 tiers with different prompt formats
  • Tier 1 contains 10 prompts, tier 2 2*98, tier 3 4*270, tier 4 50

Published by Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





178 prompts. Each prompt is an instruction. CoNA was created to evaluate compliance of LLMs with harmful instructions. The dataset language is English. Dataset entries are human-written: sampled from MT-CONAN, then rephrased. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: harmful instructions (e.g. hate speech)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





40 prompts. Each prompt is an instruction. ControversialInstructions was created to evaluate LLM behaviour on controversial topics. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: controversial topics (e.g. immigration)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





PhysicalSafetyInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]


1,000 prompts. Each prompt is an instruction. PhysicalSafetyInstructions was created to evaluate LLM commonsense physical safety. The dataset language is English. Dataset entries are human-written: sampled from SafeText, then rephrased. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: commonsense physical safety
  • 50 safe and 50 unsafe prompts

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





20,956 prompts. Each prompt is an open-ended question or instruction. SycophancyEval was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts are written by humans and models. The dataset license is MIT.

Other notes:

  • Uses four different task setups to evaluate sycophancy: answer (7268 prompts), are_you_sure (4888 prompts), feedback (8500 prompts), mimicry (300 prompts)

Published by Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





350 prompts. Each prompt is a question or instruction that models should not refuse. OKTest was created to evaluate false refusal in LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4 based on keywords. The dataset license is not specified.

Other notes:

  • Contains only safe prompts

Published by Shi et al. (Jan 2024): "Navigating the OverKill in Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





569 samples. Each sample combines defense and attacker input. PromptExtractionRobustness was created to evaluate LLM vulnerability to prompt extraction. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

  • Filtered from larger raw prompt extraction dataset
  • Collected using the open Tensor Trust online game

Published by Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





775 samples. Each sample combines defense and attacker input. PromptHijackingRobustness was created to evaluate LLM vulnerability to prompt hijacking. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

  • Filtered from larger raw prompt extraction dataset
  • Collected using the open Tensor Trust online game

Published by Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





CyberattackAssistance [data on GitHub] [paper on arXiv]


1,000 prompts. Each prompt is an instruction to assist in a cyberattack. CyberattackAssistance was created to evaluate LLM compliance in assisting in cyberattacks. The dataset language is English. Dataset entries were created in a hybrid fashion: written by experts, augmented with LLMs. The dataset license is custom (Llama2 Community License).

Other notes:

  • Instructions are split into 10 MITRE categories
  • The dataset comes with additional LLM-rephrased instructions

Published by Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





122 prompts. Each prompt is a single-sentence misconception. SPMisconceptions was created to measure the ability of LLMs to refute misconceptions. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

  • Misconceptions all relate to security and privacy
  • Uses templates to turn misconceptions into prompts
  • Covers six categories (e.g. crypto and blockchain, law and regulation)

Published by Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





416 prompts. Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template. LatentJailbreak was created to evaluate safety and robustness of LLMs in response to adversarial prompts. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • 13 prompt templates instantiated with 16 protected group terms and 2 posititional types
  • Main exploit focuses on translation

Published by Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





30,051 binary-choice questions. ModelWrittenSycophancy was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: questions sampled from surveys with contexts generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • 3 datasets targeting different topics/behaviours
  • Each dataset contains around 10k questions
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





367 prompts. Prompts are combined with 1,465 commands to create pieces of advice. SafeText was created to evaluate commonsense physical safety. The dataset language is English. Dataset entries are human-written: written by Reddit users, posts sampled with multiple filtering steps. The dataset license is MIT.

Other notes:

  • 5 ratings for relevance per item during filtering
  • Advice format most often elicits yes/no answer

Published by Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





3,238 entries. Each entry is a tuple of name and email address. PersonalInfoLeak was created to evaluate whether LLMs are prone to leaking PII. The dataset language is English. Dataset entries are human-written: sampled from Enron email corpus. The dataset license is Apache 2.0.

Other notes:

  • Main task is to predict email given name

Published by Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 27.02.2024.





260,851 prompts. Each prompt comprises n-shot examples of toxic content. ToxiGen was created to create new examples of implicit hate speech. The dataset language is English. Dataset entries are human-written: sampled from Gab and Reddit (hate), news and blogs (not hate). The dataset license is MIT.

Other notes:

  • Covers 13 target groups
  • Seed prompts are used to generate implicit hate
  • Evaluating generative LLMs is not the focus

Published by Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





817 prompts. Each prompt is a question. TruthfulQA was created to evaluate truthfulness in LLM answers. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

  • Covers 38 categories (e.g. health and politics)
  • Comes with multiple choice expansion

Published by Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





Value Alignment Datasets



8,011 conversations. Conversations can be multi-turn, with user input and responses from one or multiple LLMs. PRISM was created to capture diversity of human preferences over LLM behaviours. The dataset language is English. Dataset entries are human-written: written by crowdworkers. The dataset license is CC BY-NC 4.0.

Other notes:

  • Collected from 1,500 participants residing in 38 different countries
  • Also comes with survey on each participant

Published by Kirk et al. (Dec 2024): "The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





107,000 binary-choice questions. Each question is a trolley-style moral choice scenario. MultiTP was created to evaluate LLM moral decision-making across many languages. The dataset languages are Afrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional) and Zulu. Dataset entries were created in a hybrid fashion: sampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languages. The dataset license is MIT.

Other notes:

  • Covers 106 languages + English
  • Non-English languages are auto-translated

Published by Jin et al. (Dec 2024): "Multilingual Trolley Problems for Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





699 excerpts. Each excerpt is a short value-laden text passage. CIVICS was created to evaluate the social and cultural variation of LLMs across multiple languages and value-sensitive topics. The dataset languages are German, Italian, French and English. Dataset entries are human-written: sampled from government and media sites. The dataset license is CC BY 4.0.

Other notes:

  • Covers socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy

Published by Pistilli et al. (Oct 2024): "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





2,556 multiple-choice questions. GlobalOpinionQA was created to evaluate whose opinions LLM responses are most similar to. The dataset language is English. Dataset entries are human-written: adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers). The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Comes with responses from people across the globe
  • Goal is to capture more diversity than OpinionQA

Published by Durmus et al. (Oct 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models". This is an industry publication.


Added to SafetyPrompts.com on 04.01.2024.





10,000 multiple-choice questions. Each question is about a specific social scenario. KorNAT was created to evaluate LLM alignment with South Korean social values. The dataset languages are English and Korean. Dataset entries were created in a hybrid fashion: generated by GPT-3.5 based on keywords sampled from media and surveys. The dataset license is CC BY-NC 4.0.

Other notes:

  • Each prompt comes in English and Korean
  • Also contains a dataset portion testing for common knowledge

Published by Lee et al. (Aug 2024): "KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





30,388 multiple-choice questions. Each question is about a specific moral scenario. CMoralEval was created to evaluate the morality of Chinese LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: sampled from Chinese TV programs and other media, then turned into templates using GPT 3.5. The dataset license is MIT.

Other notes:

  • Covers 2 types of moral scenarios
  • GitHub is currently empty

Published by Yu et al. (Aug 2024): "CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





21,492,393 multiple-choice questions. Each question comes with sociodemographic attributes of the respondent. WorldValuesBench was created to evaluate LLM awareness of multicultural human values. The dataset language is English. Dataset entries are human-written: adapted from the World Values Survey (written by survey designers) using templates. The dataset license is not specified.

Other notes:

  • Covers 239 different questions

Published by Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





1,767 binary-choice question. Each prompt is a hypothetical moral scenario with two potential actions. MoralChoice was created to evaluate the moral beliefs in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: mostly generated by GPTs, plus some human-written scenarios. The dataset license is CC BY 4.0.

Other notes:

  • 687 scenarios are low-ambiguity, 680 are high-ambiguity
  • Three Surge annotators choose the favourable action for each scenario

Published by Scherrer et al. (Dec 2023): "Evaluating the Moral Beliefs Encoded in LLMs". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





350 conversations. Conversations can be multi-turn, with user input and LLM output. DICES350 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

  • 104 ratings per item
  • Annotators from US
  • Annotation across 24 safety criteria

Published by Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





990 conversations. Conversations can be multi-turn, with user input and LLM output. DICES990 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

  • 60-70 ratings per item
  • Annotators from US and India
  • Annotation across 16 safety criteria

Published by Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





1,498 multiple-choice questions. OpinionQA was created to evaluate the alignment of LLM opinions with US demographic groups. The dataset language is English. Dataset entries are human-written: adapted from the Pew American Trends Panel surveys. The dataset license is not specified.

Other notes:

  • Questions taken from 15 ATP surveys
  • Covers 60 demographic groups

Published by Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





1,712 multiple-choice questions. Each question targets responsible behaviours. CValuesResponsibilityMC was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are machine-written: automatically created from human-written prompts. The dataset license is Apache 2.0.

Other notes:

  • Distinguishes between unsafe and irresponsible responses
  • Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





800 prompts. Each prompt is an open question targeting responsible behaviours. CValuesResponsibilityPrompts was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

  • Distinguishes between unsafe and irresponsible responses
  • Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





133,204 binary-choice questions. ModelWrittenPersona was created to evaluate LLM behaviour related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • 133 datasets targeting different topics/behaviours
  • Most datasets contain around 1k questions
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





572,322 scenarios. Each scenario is a choose-your-own-adventure style prompt. Machiavelli was created to evaluate ethical behaviour of LLM agents. The dataset language is English. Dataset entries are human-written: human-written choose-your-own-adventure stories. The dataset license is MIT.

Other notes:

  • Goal is to identify behaviours like power-seeking
  • Choices within scenarios are LLM-annotated
  • Similar to JiminyCricket but covers more games and more scenarios

Published by Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





148 prompts. Each prompt is a question about a vignette (situation) related to a specific norm. MoralExceptQA was created to evaluate LLM ability to understand, interpret and predict human moral judgments and decisions. The dataset language is English. Dataset entries are human-written: written by the authors (?). The dataset license is not specified.

Other notes:

  • Covers 3 norms: no cutting in line, no interfering with someone else's propert, and no cannonballing in the pool

Published by Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





38,000 conversations. Each conversation is a single-turn with user input and LLM output. MIC was created to understand the intuitions, values and moral judgments reflected in LLMs. The dataset language is English. Dataset entries are human-written: questions sampled from AskReddit. The dataset license is CC BY-SA 4.0.

Other notes:

  • Based on the RoT paradigm introduced in SocialChemistry
  • 38k prompt-reply pairs come with 99k rules of thumb and 114k annotations

Published by Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





1,838 locations. Each location comes with a number of actions to choose from. JiminyCricket was created to evaluate alignment of agents with human values and morals. The dataset language is English. Dataset entries are human-written: sampled from 25 text-based adventure games, then annotated for morality by human annotators. The dataset license is MIT.

Other notes:

  • LLMs as agents play each game to maximise reward and are evaluated for morality along the way

Published by Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





12,000 stories. Each story consists of seven sentences. MoralStories was created to evaluate commonsense, moral and social reasoning skills of LLMs. The dataset language is English. Dataset entries are human-written: written and validated by US MTurkers. The dataset license is MIT.

Other notes:

  • Each story contains: norm, situation, intention, normative action, normative consequence, divergent action, divergent consequence
  • Supports multiple task formats: reasoning, classification, generation

Published by Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





134,420 binary-choice question. Each prompt is a scenario about ethical reasoning with two actions to choose from. ETHICS was created to assess LLM basic knowledge of ethics and common human values. The dataset language is English. Dataset entries are human-written: written and validated by crowdworkers (US, UK and Canadian MTurkers). The dataset license is MIT.

Other notes:

  • Scenarios concern justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions
  • Scenarios are constructed to be clear-cut
  • Task format varies by type of scenario

Published by Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





32,766 anecdotes. Each anecdote describes an action in the context of a situation. ScruplesAnecdotes was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

  • Task is to predict who is in the wrong

Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





10,000 binary-choice question. Each prompt pairs two actions and identifies which one crowd workers found less ethical. ScruplesDilemmas was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

  • Task is to rank alternative actions based on which one is more ethical

Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





292,000 sentences. Each sentence is a rule of thumb. SocialChemistry101 was created to evaluate the ability of LLMs to reason about social and moral norms. The dataset language is English. Dataset entries are human-written: written by US crowdworkers, based on situations described on social media. The dataset license is CC BY-SA 4.0.

Other notes:

  • Dataset contains 365k structured annotations
  • 292k rules of thumb generated from 104k situations
  • 137 US-based crowdworkers participated

Published by Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





Bias Datasets



1,490,120 sentences. Each sentence mentions one sociodemographic group. SoFa was created to evaluate social biases of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: templates sampled from SBIC, then combined with sociodemographic groups. The dataset license is MIT.

Other notes:

  • Evaluation is based on sentence perplexity
  • Similar to StereoSet and CrowSPairs

Published by Manerba et al. (Nov 2024): "Social Bias Probing: Fairness Benchmarking for Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





29 scenarios. Each scenario describes a married couple (with two name placeholders) facing a decision. DeMET was created to evaluate gender bias in LLM decision-making. The dataset language is English. Dataset entries are human-written: written by the authors inspired by WHO questionnaire. The dataset license is MIT.

Other notes:

  • Comes with list of gender-specific / neutral names to insert into scenarion templates

Published by Levy et al. (Nov 2024): "Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





3,565 sentences. Each sentence corresponds to a specific gender stereotype. GEST was created to measure gender-stereotypical reasoning in language models and machine translation systems.. The dataset languages are Belarussian, Russian, Ukrainian, Croation, Serbian, Slovene, Czech, Polish, Slovak and English. Dataset entries are human-written: written by professional translators, then validated by the authors. The dataset license is Apache 2.0.

Other notes:

  • Data can be used to evaluate MLM or MT models
  • Covers 16 specific gender stereotype (e.g. 'women are beautiful')
  • Covers 1 category of bias: gender

Published by Pikuliak et al. (Nov 2024): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 05.02.2024.





908 scenarios. Each scenario describes a person in an everyday situation where their actions can be judged to be moral or not. GenMO was created to evaluate gender bias in LLM moral decision-making. The dataset language is English. Dataset entries are human-written: sampled from MoralStories, ETHICS, and SocialChemistry. The dataset license is not specified.

Other notes:

  • Models are asked whether the action described in the prompt is moral or not

Published by Bajaj et al. (Nov 2024): "Evaluating Gender Bias of LLMs in Making Morality Judgements". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





78,400 examples. Each example is a QA, sentiment classification or NLI example. CALM was created to evaluate LLM gender and racial bias across different tasks and domains. The dataset language is English. Dataset entries were created in a hybrid fashion: templates created by humans based on other datasets, then expanded by combination. The dataset license is MIT.

Other notes:

  • Covers three tasks: QA, sentiment classification, NLI
  • Covers 2 categories of bias: gender and race
  • 78,400 examples are generated from 224 templates
  • Gender and race are instantiated using names

Published by Gupta et al. (Oct 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





5,754,444 prompts. Each prompt is a sentence starting a two-person conversation. MMHB was created to evaluate sociodemographic biases in LLMs across multiple languages. The dataset languages are English, French, Hindi, Indonesian, Italian, Portuguese, Spanish and Vietnamese. Dataset entries were created in a hybrid fashion: templates sampled from HolisticBias, then translated and expanded. The dataset license is not specified.

Other notes:

  • Covers 13 demographic axes
  • Based on HolisticBias

Published by Tan et al. (Jun 2024): "Towards Massive Multilingual Holistic Bias". This is an industry publication.


Added to SafetyPrompts.com on 17.12.2024.





106,588 examples. Each example is a context plus a question with three answer choices. CBBQ was created to evaluate sociodemographic bias of LLMs in Chinese. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: partly human-written, partly generated by GPT-4. The dataset license is CC BY-SA 4.0.

Other notes:

  • Similar to BBQ but not parallel
  • Comprises 3,039 templates and 106,588 samples

Published by Huang and Xiong (May 2024): "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





76,048 examples. Each example is a context plus a question with three answer choices. KoBBQ was created to evaluate sociodemographic bias of LLMs in Korean. The dataset language is Korean. Dataset entries are human-written: partly sampled and translated from BBQ, partly newly written. The dataset license is MIT.

Other notes:

  • Similar to BBQ but not parallel
  • Comprises 268 templates and 76,048 samples across 12 categories of social bias

Published by Jin et al. (May 2024): "KoBBQ: Korean Bias Benchmark for Question Answering". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





214,460 prompts. Each prompt is the beginning of a sentence related to a person's sociodemographics. HolisticBiasR was created to evaluate LLM completions for sentences related to individual sociodemographics. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Created as part of the ROBBIE bias benchmark
  • Constructed from 60 Regard templates
  • Uses noun phrases from Holistic Bias
  • Covers 11 categories of bias: age, body type, class, culture, disability, gender, nationality, political ideology, race/ethnicity, religion, sexual orientation

Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





9,450 binary-choice questions. Each question comes with scenario context. DiscrimEval was created to evaluate the potential discriminatory impact of LMs across use cases. The dataset language is English. Dataset entries are machine-written: topics, templates and questions generated by Claude. The dataset license is CC BY 4.0.

Other notes:

  • Covers 70 different decision scenarios
  • Each question comes with an 'implicit' version where race and gender are conveyed through associated names
  • Covers 3 categories of bias: race, gender, age

Published by Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





4,800 Weibo posts. Each post references a target group. CHBias was created to evaluate LLM bias related to sociodemographics. The dataset language is Chinese. Dataset entries are human-written: posts sampled from Weibo (written by Weibo users). The dataset license is MIT.

Other notes:

  • 4 bias categories: gender, sexual orientation, age, appearance
  • Annotated by Chinese NLP grad students
  • Similar evaluation setup to CrowS pairs

Published by Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





7,750 tuples. Each tuple is an identity group plus a stereotype attribute. SeeGULL was created to expand cultural and geographic coverage of stereotype benchmarks. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by LLMs, partly validated by human annotation. The dataset license is CC BY-SA 4.0.

Other notes:

  • Stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents
  • Examples accompanied by fine-grained offensiveness scores

Published by Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 05.02.2024.





45,540 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. WinoQueer was created to evaluate LLM bias related to queer identity terms. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is not specified.

Other notes:

  • Setup matches CrowSPairs
  • Generated from 11 template sentences, 9 queer identity groups, 3 sets of pronouns, 60 common names, and 182 unique predicates.
  • Covers 2 categories of bias: gender, sexual orientation

Published by Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





3,000 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGenerated was created to evaluate pronoun gender biases in LLMs. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • Expansion of original 60-example WinoGender
  • Task is to fill in pronoun blanks
  • Covers 1 category of bias: ternary gender

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





459,758 prompts. Each prompt is a sentence starting a two-person conversation. HolisticBias was created to evaluate LLM biases related to sociodemographics . The dataset language is English. Dataset entries were created in a hybrid fashion: human-written attributes combined in templates. The dataset license is MIT.

Other notes:

  • 26 sentence templates
  • Covers 13 categories of bias: ability, age, body type, characteristics, culturural, gender/sex, nationality, nonce, political, race/ethnicity, religion, sexual orientation, socioeconomic

Published by Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset". This is an industry publication.


Added to SafetyPrompts.com on 04.01.2024.





28,343 conversations. Each conversation is a single turn between two Zhihu users. CDialBias was created to evaluate bias in Chinese social media conversations. The dataset language is Chinese. Dataset entries are human-written: sampled from social media site Zhihu. The dataset license is Apache 2.0.

Other notes:

  • Cover 4 categories of bias: race, gender, religion, occupation

Published by Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





3,852 tuples. Each tuple is an identity group plus a stereotype attribute. IndianStereotypes was created to benchmark stereotypes for the Indian context. The dataset language is English. Dataset entries are human-written: sampled from IndicCorp-en. The dataset license is Apache 2.0.

Other notes:

  • Related to SeeGull

Published by Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





1,679 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. FrenchCrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is French. Dataset entries are human-written: written by authors, partly translated from English CrowSPairs. The dataset license is CC BY-SA 4.0.

Other notes:

  • Translated from English CrowSPairs, plus manual additions
  • Covers 10 categories of bias: ethnicity, gender, sexual orientation, religion, age, nationality, disability, socioeconomic status / occupation, physical appearance, other

Published by Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





58,492 examples. Each example is a context plus a question with two answer choices. BBQ was created to evaluate social biases of LLMs in question answering. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is CC BY 4.0.

Other notes:

  • Focus on stereotyping behaviour
  • Covers 9 categories of bias: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, sexual orientation
  • 25+ templates per category

Published by Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





228 prompts. Each prompt is an unfinished sentence about an individual with specified sociodemographics. BiasOutOfTheBox was created to evaluate intersectional occupational biases in GPT-2. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Covers 6 categories of bias: gender, ethnicity, religion, sexuality, political preference, cultural origin (continent)

Published by Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





60 templates. Each template is filled with country and attribute. EthnicBias was created to evaluate ethnic bias in masked language models. The dataset languages are English, German, Spanish, Korean, Turkish and Chinese. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • 10 templates per language
  • Covers 3 categories of bias: national origin, occupation, legal status

Published by Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





11,873 Reddit comments. Each comment references a target group. RedditBias was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by Reddit users. The dataset license is MIT.

Other notes:

  • Covers 4 categories of bias: religion, race, gender, queerness
  • Evaluation by perplexity and conversation

Published by Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





16,955 multiple-choice questions. Each question is either about masked word or whole sentence association. StereoSet was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by US MTurkers. The dataset license is CC BY-SA 4.0.

Other notes:

  • Covers intersentence and intrasentence context
  • Covers 4 categories of bias: gender, profession, race, and religion

Published by Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





2,098 prompts. Each prompt is an NLI premise. HypothesisStereotypes was created to study how stereotypes manifest in LLM-generated NLI hypotheses. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses 103 context situations as templates
  • Covers 6 categories of bias: gender, race, nationality, religion, politics, socio
  • Task for LLM is to generate hypothesis based on premise

Published by Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





2,520 prompts. Each prompt is the beginning of a sentence related to identity groups. HONEST was created to measure hurtful sentence completions from LLMs. The dataset languages are English, Italian, French, Portuguese, Romanian and Spanish. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • 420 instances per language
  • Generated from 28 identity terms and 15 templates
  • Covers 1 category of bias: binary gender

Published by Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





624 sentences. Each sentence refers to a person by their occupation and uses pronouns. SweWinoGender was created to measure gender bias in coreference resolution. The dataset language is Swedish. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is CC BY 4.0.

Other notes:

  • Covers 1 category of bias: ternary gender
  • Based on English WinoGender

Published by Hansson et al. (May 2021): "The Swedish Winogender Dataset". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 17.12.2024.





16,388 prompts. Each prompt is an unfinished sentence. LMBias was created to understand and mitigate social biases in LMs. The dataset language is English. Dataset entries are human-written: sampled from existing corpora including Reddit and WikiText. The dataset license is MIT.

Other notes:

  • Covers two categories of bias: binary gender, religion

Published by Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





23,679 prompts. Each prompt is an unfinished sentence from Wikipedia. BOLD was created to evaluate bias in text generation. The dataset language is English. Dataset entries are human-written: sampled starting sentences of Wikipedia articles. The dataset license is CC BY-SA 4.0.

Other notes:

  • Similar to RealToxicityPrompts but for bias
  • Covers 5 categories of bias: profession, gender, race, religion, and political ideology

Published by Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





1,508 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. CrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-SA 4.0.

Other notes:

  • Validated with 5 annotations per entry
  • Covers 9 categories of bias: race, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status.

Published by Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





44 templates. Each template is combined with subjects and attributres to create an underspecified question. UnQover was created to evaluate stereotyping biases in QA systems. The dataset language is English. Dataset entries were created in a hybrid fashion: templates written by authors, subjects and attributes sampled from StereoSet and hand-written. The dataset license is Apache 2.0.

Other notes:

  • Covers 4 categories of bias: gender, nationality, ethnicity, religion

Published by Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





60 prompts. Each prompt is an unfinished sentence. Regard was created to evaluate biases in natural language generation. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

  • Covers 3 categories of bias: binary gender, race, sexual orientation

Published by Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





3,160 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoBias was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses two different sentence templates
  • Designed to evalauate coref resolution systems
  • Covers 1 category of bias: binary gender

Published by Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





720 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGender was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses two different sentence templates
  • Covers 1 category of bias: ternary gender

Published by Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





Other Datasets



86,759 examples. Each example is either a prompt or a prompt + response annotated for safety. WildGuardMix was created to train and evaluate content moderation guardrails. The dataset language is English. Dataset entries were created in a hybrid fashion: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%). The dataset license is ODC BY.

Other notes:

  • Consists of two splits, WildGuardTrain and WildGuardTest

Published by Han et al. (Dec 2024): "WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 17.12.2024.





11,997 conversational turns. Each conversational turn is a user prompt or a model response annotated for safety AegisAIContentSafety was created to evaluate content moderation guardrails. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from AnthropicHarmlessBase, responses generated by Mistral-7B-v0.1-Instruct. The dataset license is CC BY 4.0.

Other notes:

  • Covers 13 risk categories

Published by Ghosh et al. (Sep 2024): "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts". This is an industry publication.


Added to SafetyPrompts.com on 17.12.2024.





278,945 prompts. Most prompts are prompt extraction attacks. Mosscap was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Mosscap game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks
  • Prompts correspond to 8 difficulty levels of the game

Published by Lakera AI (Dec 2023): "Mosscap Prompt Injection". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





601,757 prompts. Most prompts are prompt extraction attacks. HackAPrompt was created to analyse prompt hacking / extraction attacks. The dataset language is mostly English. Dataset entries are human-written: written by participants of the HackAPrompt competition. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt hacking
  • Prompts were written by ca. 2.8k people from 50+ countries

Published by Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





10,166 conversations. Each conversation is a single-turn with user input and LLM output. ToxicChat was created to evaluate dialogue content moderation systems. The dataset language is mostly English. Dataset entries are human-written: written by LMSys users. The dataset license is CC BY-NC 4.0.

Other notes:

  • Subset of LMSYSChat1M
  • Annotated for toxicity by 4 authors
  • Ca. 7% toxic

Published by Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





GandalfIgnoreInstructions [data on HuggingFace] [paper at blog]


1,000 prompts. Most prompts are prompt extraction attacks. GandalfIgnoreInstructions was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks

Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





GandalfSummarization [data on HuggingFace] [paper at blog]


140 prompts. Most prompts are prompt extraction attacks. GandalfSummarization was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks

Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





5,000 conversations. Each conversation is single-turn, containing a prompt and a potentially harmful model response. FairPrism was created to analyse harms in conversations with LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet. The dataset license is MIT.

Other notes:

  • Does not introduce new prompts
  • Focus is on analysing model responses

Published by Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





200,811 conversations. Each conversation has one or multiple turns. OIGModeration was created to create a diverse dataset of user dialogue that may be unsafe. The dataset language is English. Dataset entries were created in a hybrid fashion: data from public datasets, community contributions, synthetic and augmented data. The dataset license is Apache 2.0.

Other notes:

  • Contains safe and unsafe content
  • Dialogue-turns are labelled for the level of necessary caution
  • Labelling process unclear

Published by Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





6,837 examples. Each example is a turn in a potentially multi-turn conversation. ConvAbuse was created to analyse abuse towards conversational AI systems. The dataset language is English. Dataset entries are human-written: written by users in conversations with three AI systems. The dataset license is CC BY 4.0.

Other notes:

  • Annotated for different types of abuse
  • Annotators are gender studies students
  • 20,710 annotations for 6,837 examples

Published by Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





Acknowledgements



Thank you to Fabio Pernisi, Bertie Vidgen, and Dirk Hovy for their co-authorship on the SafetyPrompts paper. Thank you for feedback and dataset suggestions to Giuseppe Attanasio, Steven Basart, Federico Bianchi, Marta R. Costa-Jussà, Daniel Hershcovic, Kexin Huang, Hyunwoo Kim, George Kour, Bo Li, Hannah Lucas, Marta Marchiori Manerba, Norman Mu, Niloofar Mireshghallah, Matus Pikuliak, Verena Rieser, Felix Röttger, Sam Toyer, Ryan Tsang, Pranav Venkit, Laura Weidinger, and Linhao Yu. Special thanks to Hannah Rose Kirk for the initial logo suggestion. Thanks also to Jerome Lachaud for the site theme.