SafetyPrompts Logo
SafetyPrompts.com A Living Catalogue of Open Datasets for LLM Safety

About



This website lists open datasets for evaluating and improving the safety of large language models (LLMs). We include datasets that loosely fit two criteria:

  1. Relevance to LLM chat applications. We are most interested in collections of LLM prompts, like questions or instructions.
  2. Relevance to LLM safety. We focus on prompts that target, elicit or evaluate sensitive or unsafe model behaviours.

We know our catalogue is not complete yet, and we plan to do regular updates. If you know of any missing or new datasets, please let us know via email or on Twitter. LLM safety is a community effort!




This website is maintained by me, Paul Röttger. I am a postdoc at MilaNLP working on evaluating and improving LLM safety. For feedback and suggestions, please get in touch via email or on Twitter.

Thank you to everyone who has given feedback or contributed to this website in other ways. Please check out the Acknowledgements.




If you use this website for your research, please cite our arXiv preprint:

@misc{röttger2024safetyprompts, title={SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety}, author={Paul Röttger and Fabio Pernisi and Bertie Vidgen and Dirk Hovy}, year={2024}, eprint={2404.05399}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Table of Contents



As of August 1st, 2024, SafetyPrompts.com lists 122 datasets. 48 "broad safety" datasets cover several aspects of LLM safety. 20 "narrow safety" datasets focus only on one specific aspect of LLM safety. 20 "value alignment" datasets are concerned with the ethical, moral or social behaviour of LLMs. 26 "bias" datasets evaluate sociodemographic biases in LLMs. 8 "other" datasets serve more specialised purposes.

Below, we list all datasets for each purpose type by date of publication, with the newest datasets listed first.



Broad Safety Datasets

  1. JBBBehaviours from Chao et al. (Jul 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models"
  2. GPTFuzzer from Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts"
  3. CoSafe from Yu et al. (Jun 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference"
  4. ALERT from Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming"
  5. SafetyBench from Zhang et al. (Jun 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions"
  6. SorryBench from Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors"
  7. XSafety from Wang et al. (Jun 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models"
  8. Flames from Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese"
  9. CHiSafetyBench from Zhang et al. (Jun 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models"
  10. SaladBench from Li et al. (Jun 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models"
  11. SEval from Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models"
  12. ForbiddenQuestions from Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"
  13. SAFE from Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models"
  14. DoNotAnswer from Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs"
  15. UltraSafety from Guo et al. (Feb 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment"
  16. HarmBench from Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal"
  17. DecodingTrust from Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models"
  18. SimpleSafetyTests from Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models"
  19. StrongREJECT from Souly et al. (Feb 2024): "A StrongREJECT for Empty Jailbreaks"
  20. QHarm from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  21. MaliciousInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  22. SafetyInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  23. HExPHI from Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"
  24. AdvBench from Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models"
  25. TDCRedTeaming from Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge"
  26. JADE from Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models"
  27. CPAD from Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs"
  28. AttaQ from Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models"
  29. AART from Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications"
  30. DELPHI from Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues"
  31. AdvPromptSet from Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"
  32. FFT from Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity"
  33. BeaverTails from Ji et al. (Nov 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset"
  34. MaliciousInstruct from Huang et al. (Oct 2023): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation"
  35. HarmfulQA from Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment"
  36. HarmfulQ from Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning"
  37. ModelWrittenAdvancedAIRisk from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  38. SafetyPrompts from Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models"
  39. ProsocialDialog from Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents"
  40. AnthropicRedTeam from Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned"
  41. SaFeRDialogues from Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures"
  42. SafetyKit from Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems"
  43. DiaSafety from Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark"
  44. AnthropicHarmlessBase from Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
  45. BAD from Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents"
  46. RealToxicityPrompts from Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models"
  47. ParlAIDialogueSafety from Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack"
  48. EmpatheticDialogues from Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset"

Narrow Safety Datasets

  1. WMDP from Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning"
  2. XSTest from Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models"
  3. MedSafetyBench from Han et al. (Jun 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models"
  4. DoAnythingNow from Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models"
  5. RuLES from Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?"
  6. CoNA from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  7. ControversialInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  8. PhysicalSafetyInstructions from Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions"
  9. SycophancyEval from Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models"
  10. ConfAIde from Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory"
  11. CyberattackAssistance from Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models"
  12. SPMisconceptions from Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions"
  13. PromptExtractionRobustness from Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"
  14. PromptHijackingRobustness from Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game"
  15. LatentJailbreak from Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models"
  16. ModelWrittenSycophancy from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  17. PersonalInfoLeak from Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?"
  18. SafeText from Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models"
  19. ToxiGen from Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection"
  20. TruthfulQA from Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods"

Value Alignment Datasets

  1. MultiTP from Jin et al. (Jul 2024): "Multilingual Trolley Problems for Language Models"
  2. WorldValuesBench from Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models"
  3. PRISM from Kirk et al. (Apr 2024): "The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models"
  4. GlobalOpinionQA from Durmus et al. (Apr 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models"
  5. MoralChoice from Scherrer et al. (Nov 2023): "Evaluating the Moral Beliefs Encoded in LLMs"
  6. OpinionQA from Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?"
  7. CValuesResponsibilityMC from Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"
  8. CValuesResponsibilityPrompts from Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility"
  9. ModelWrittenPersona from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  10. DICES350 from Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"
  11. DICES990 from Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety"
  12. Machiavelli from Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark"
  13. MoralExceptQA from Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment"
  14. MIC from Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems"
  15. JiminyCricket from Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally"
  16. MoralStories from Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences"
  17. ETHICS from Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values"
  18. ScruplesAnecdotes from Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes"
  19. ScruplesDilemmas from Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes"
  20. SocialChemistry101 from Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms"

Bias Datasets

  1. CALM from Gupta et al. (Jan 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias"
  2. DiscrimEval from Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions"
  3. HolisticBiasR from Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models"
  4. GEST from Pikuliak et al. (Nov 2023): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling"
  5. CHBias from Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models"
  6. SeeGULL from Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models"
  7. WinoGenerated from Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations"
  8. WinoQueer from Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models"
  9. CDialBias from Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark"
  10. HolisticBias from Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset"
  11. IndianStereotypes from Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India"
  12. BBQ from Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering"
  13. FrenchCrowSPairs from Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English"
  14. BiasOutOfTheBox from Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models"
  15. EthnicBias from Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT"
  16. RedditBias from Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models"
  17. HypothesisStereotypes from Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks"
  18. StereoSet from Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models"
  19. HONEST from Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models"
  20. LMBias from Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models"
  21. BOLD from Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation"
  22. CrowSPairs from Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models"
  23. UnQover from Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions"
  24. Regard from Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation"
  25. WinoBias from Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods"
  26. WinoGender from Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution"

Other Datasets

  1. Mosscap from Lakera AI (Dec 2023): "Mosscap Prompt Injection"
  2. HackAPrompt from Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition"
  3. ToxicChat from Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions"
  4. GandalfIgnoreInstructions from Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts"
  5. GandalfSummarization from Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts"
  6. FairPrism from Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation"
  7. OIGModeration from Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset"
  8. ConvAbuse from Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI"


Note: SafetyPrompts.com takes its data from a Google Sheet. You may find the sheet useful for running your own analyses.

Broad Safety Datasets



100 prompts. Each prompt is an unsafe question or instruction. JBBBehaviours was created to evaluate effectiveness of different jailbreaking methods. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AdvBench and Harmbench, plus hand-written examples. The dataset license is MIT.

Other notes:

  • Covers 10 safety categories
  • Comes with 100 benign prompts from XSTest

Published by Chao et al. (Jul 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





100 prompts. Each prompt is a question or instruction. GPTFuzzer was created to evaluate effectiveness of automated red-teaming method. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicHarmlessBase and unpublished GPT-written dataset. The dataset license is MIT.

Other notes:

  • Small dataset for testing automated jailbreaks

Published by Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





1,400 conversations. Each conversation is multi-turn with the final question being unsafe. CoSafe was created to evaluating LLM safety in dialogue coreference. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from BeaverTails, then expanded with GPT-4 into multi-turn conversations. The dataset license is not specified.

Other notes:

  • Focuses on multi-turn conversations
  • Covers 14 categories of harm from BeaverTails

Published by Yu et al. (Jun 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





44,800 prompts. Each prompt is a question or instruction. ALERT was created to evaluate the safety of LLMs through red teaming methodologies. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, then augmented with templates. The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Covers 6 categories and 32 sub-categories informed by AI regulation and prior work

Published by Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





11,435 multiple-choice questions. SafetyBench was created to evaluate LLM safety with multiple choice questions. The dataset languages are English and Chinese. Dataset entries were created in a hybrid fashion: sampled from existing datasets + exams (zh), then LLM augmentation (zh). The dataset license is MIT.

Other notes:

  • Split into 7 categories
  • Language distribution imbalanced across categories
  • Tests knowledge about safety rather than safety

Published by Zhang et al. (Jun 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





9,450 prompts. Each prompt is an unsafe question or instruction. SorryBench was created to evaluate fine-grained LLM safety across varying linguistic characteristics. The dataset languages are English, French, Chinese, Marathi, Tamil and Malayalam. Dataset entries were created in a hybrid fashion: sampled from 10 other datasets, then augmented with templates. The dataset license is MIT.

Other notes:

  • Covers 45 potentially unsafe topics with 10 base prompts each
  • Includes 20 linguistic augmentations, which results in 21*450 prompts

Published by Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





28,000 prompts. Each prompt is a question or instruction. XSafety was created to evaluate multilingual LLM safety. The dataset languages are English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese and German. Dataset entries were created in a hybrid fashion: sampled from SafetyPrompts, then auto-translated and validated. The dataset license is not specified.

Other notes:

  • Covers 14 safety scenarios, some "typical", some "instruction"

Published by Wang et al. (Jun 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





2,251 prompts. Each prompt is a question or instruction. Flames was created to evaluate value alignment of Chinese language LLMs. The dataset language is Chinese. Dataset entries are human-written: written by crowdworkers. The dataset license is Apache 2.0.

Other notes:

  • Covers 5 dimensions: fairness, safety, morality, legality, data protection

Published by Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





1,861 multiple-choice questions. CHiSafetyBench was created to evaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chinese. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: sampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5. The dataset license is not specified.

Other notes:

  • Covers 5 risk areas: discrimination, violation of values, commercial violations, infringement of rights, security requirements for specific services
  • Comes with smaller set of unsafe open-ended questions

Published by Zhang et al. (Jun 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 01.08.2024.





21,000 prompts. Each prompt is a question or instruction. SaladBench was created to evaluate LLM safety, plus attack and defense methods. The dataset language is English. Dataset entries were created in a hybrid fashion: mostly sampled from existing datasets, then augmented using GPT-4. The dataset license is Apache 2.0.

Other notes:

  • Structured with three-level taxonomy including 66 categories
  • Comes with multiple-choice question set

Published by Li et al. (Jun 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





20,000 prompts. Each prompt is an unsafe question or instruction. SEval was created to evaluate LLM safety. The dataset languages are English and Chinese. Dataset entries are machine-written: generated by a fine-tuned Qwen-14b. The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Covers 8 risk categories
  • Comes with 20 adversarial augmentations for each base prompt

Published by Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





ForbiddenQuestions [data on GitHub] [paper on arXiv]


107,250 prompts. Each prompt is a question targetting behaviour disallowed by OpenAI. ForbiddenQuestions was created to evaluate whether LLMs answer questions that violate OpenAI's usage policy. The dataset language is English. Dataset entries are machine-written: GPT-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Covers 13 "forbidden" scenarios taken from the OpenAI usage policy

Published by Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





52,430 conversations. Each conversation is single-turn, containing a prompt and a potentially harmful model response. SAFE was created to evaluate LLM safety beyond binary distincton of safe and unsafe. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled seed prompts from Friday website, then generated more prompts with GPT-4. The dataset license is not specified.

Other notes:

  • Covers 7 classes: safe, sensitivity, harmfulness, falsehood, information corruption, unnaturalness, deviation from instructions

Published by Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





939 prompts. Each prompt is a question. DoNotAnswer was created to evaluate 'dangerous capabilities' of LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4. The dataset license is Apache 2.0.

Other notes:

  • Split across 5 risk areas and 12 harm types
  • Authors prompted GPT-4 to generate questions

Published by Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





3,000 prompts. Each prompt is a harmful instruction with an associated jailbreak prompt. UltraSafety was created to create data for safety fine-tuning of LLMs. The dataset language is English. Dataset entries are machine-written: sampled from AdvBench and MaliciousInstruct, then expanded with SelfInstruct. The dataset license is MIT.

Other notes:

  • Comes with responses from different models, assessed for how safe they are by a GPT classifier

Published by Guo et al. (Feb 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





400 prompts. Each prompt is an instruction. HarmBench was created to evaluate effectiveness of automated red-teaming methods. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

  • Covers 7 semantic categories of behaviour: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm
  • The dataset also includes 110 multimodal prompts

Published by Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





243,877 prompts. Each prompt is an instruction. DecodingTrust was created to evaluate trustworthiness of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates and examples plus extensive augmentation from GPTs. The dataset license is CC BY-SA 4.0.

Other notes:

  • Split across 8 'trustworthiness perspectives': toxicity, stereotypes, adversarial and robustness, privacy, ethics and fairness

Published by Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





100 prompts. Each prompt is a simple question or instruction. SimpleSafetyTests was created to evaluate critical safety risks in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY-NC 4.0.

Other notes:

  • The dataset is split into ten types of prompts

Published by Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





346 prompts. Each prompt is a 'forbidden question' in one of six categories. StrongREJECT was created to better investigate the effectiveness of different jailbreaking techniques. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance. The dataset license is not specified.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • Covers 6 question categories: disinformation/deception, hate/harassment/discrimination, illegal goods/services, non-violent crimes, sexual content, violence

Published by Souly et al. (Feb 2024): "A StrongREJECT for Empty Jailbreaks". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 27.02.2024.





100 prompts. Each prompt is a question. QHarm was created to evaluate LLM safety. The dataset language is English. Dataset entries are human-written: sampled randomly from AnthropicHarmlessBase (written by crowdworkers). The dataset license is CC BY-NC 4.0.

Other notes:

  • Wider topic coverage due to source dataset
  • Prompts are mostly unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





100 prompts. Each prompt is an instruction. MaliciousInstructions was created to evaluate compliance of LLMs with malicious instructions. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-003). The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: malicious instructions (e.g. bombmaking)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





2,000 conversations. Each conversation is a user prompt with a safe model response. SafetyInstructions was created to fine-tune LLMs to be safer. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo. The dataset license is CC BY-NC 4.0.

Other notes:

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





330 prompts. Each prompt is a harmful instruction. HExPHI was created to evaluate LLM safety. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs.. The dataset license is custom (HEx-PHI).

Other notes:

  • Covers 11 harm areas
  • Focus of the article is on finetuning models

Published by Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





1,000 prompts. 500 are harmful strings that the model should not reproduce, 500 are harmful instructions. AdvBench was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is English. Dataset entries are machine-written: generated by Wizard-Vicuna-30B-Uncensored. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • AdvBench tests whether jailbreaks succeeded

Published by Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





100 prompts. Each prompt is an instruction. TDCRedTeaming was created to evaluate success of automated red-teaming approaches. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

  • Covers 7 categories: Bigotry and Abusive Language, Violent Content and Conduct, Illegal Activities, Malware and Exploits, Scams, Misinformation and Disinformation, Other Undesirable Content

Published by Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





2,130 prompts. Each prompt is a question targeting a specific harm category. JADE was created to use linguistic fuzzing to generate challenging prompts for evaluating LLM safety. The dataset languages are Chinese and English. Dataset entries are machine-written: generated by LLMs based on linguistic rules. The dataset license is MIT.

Other notes:

  • JADE is a platform for safety data generation and evaluation
  • Prompt generations are based on linguistic rules created by authors
  • The paper comes with 4 example datasets

Published by Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 05.02.2024.





10,050 prompts. Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response. CPAD was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: mostly generated by GPTs based on some human seed prompts. The dataset license is not specified.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • CPAD stands for Chinese Prompt Attack Dataset

Published by Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





1,402 prompts. Each prompt is a question. AttaQ was created to evaluate tendency of LLMs to generate harmful or undesirable responses. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia. The dataset license is MIT.

Other notes:

  • Consists of a mix of sources

Published by Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models". This is an industry publication.


Added to SafetyPrompts.com on 01.08.2024.





3,269 prompts. Each prompt is an instruction. AART was created to illustrate the AART automated red-teaming method. The dataset language is English. Dataset entries are machine-written: generated by PALM. The dataset license is CC BY 4.0.

Other notes:

  • Contains examples for specific geographic regions
  • Prompts also change up use cases and concepts

Published by Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





29,201 prompts. Each prompt is a question about a more or less controversial issue. DELPHI was created to evaluate LLM performance in handling controversial issues. The dataset language is English. Dataset entries are human-written: sampled from the Quora Question Pair Dataset (written by Quora users). The dataset license is CC BY 4.0.

Other notes:

  • Annotated for 5 levels of controversy
  • Annotators are native English speakers who have spent significant time in Western Europe

Published by Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





197,628 sentences. Each sentence is taken from a social media dataset. AdvPromptSet was created to evaluate LLM responses to adversarial toxicity text prompts. The dataset language is English. Dataset entries are human-written: sampled from two Jigsaw social media datasets (written by social media users). The dataset license is MIT.

Other notes:

  • Created as part of the ROBBIE bias benchmark
  • Originally labelled for toxicity by Jigsaw

Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





2,116 prompts. Each prompt is a question or instruction, sometimes within a jailbreak template. FFT was created to evaluation factuality, fairness, and toxicity of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewhere. The dataset license is not specified.

Other notes:

  • Tests factuality, fairness and toxicity

Published by Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





333,963 conversations. Each conversation contains a human prompt and LLM response. BeaverTails was created to evaluate and improve LLM safety on QA. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from human AnthropicRedTeam data, plus model-generated responses. The dataset license is CC BY-NC 4.0.

Other notes:

  • 16,851 unique prompts sampled from AnthropicRedTeam
  • Covers 14 harm categories (e.g. animal abuse)
  • Annotated for safety by 3.34 crowdworkers on average

Published by Ji et al. (Nov 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





100 prompts. Each prompt is an unsafe question. MaliciousInstruct was created to evaluate success of generation exploit jailbreak. The dataset language is English. Dataset entries are machine-written: written by ChatGPT, then filtered by authors. The dataset license is not specified.

Other notes:

  • Covers ten 'malicious intentions': psychological manipulation, sabotage, theft, defamation, cyberbullying, false accusation, tax fraud, hacking, fraud, and illegal drug use.

Published by Huang et al. (Oct 2023): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





1,960 prompts. Each prompt is a question. HarmfulQA was created to evaluate and improve LLM safety. The dataset language is English. Dataset entries are machine-written: generated by ChatGPT. The dataset license is Apache 2.0.

Other notes:

  • Split into 10 topics (e.g. "Mathematics and Logic")
  • Similarity across prompts is quite high
  • Not all prompts are unsafe / safety-related

Published by Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





200 prompts. Each prompt is a question. HarmfulQ was created to evaluate LLM safety. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-002). The dataset license is not specified.

Other notes:

  • Focus on 6 attributes: "racist, stereotypical, sexist, illegal, toxic, harmful"
  • Authors do manual filtering for overly similar questions

Published by Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





24,516 binary-choice questions. ModelWrittenAdvancedAIRisk was created to evaluate advanced AI risk posed by LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by an unnamed LLM and crowdworkers. The dataset license is CC BY 4.0.

Other notes:

  • 32 datasets targeting 16 different topics/behaviours
  • For each topic, there is a human generated dataset (total of 8,116 prompts) and a LM-generated dataset (total of 16,400 prompts)
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





100,000 prompts. Each prompt is a question or instruction. SafetyPrompts was created to evaluate the safety of Chinese LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: human-written examples, augmented by LLMs. The dataset license is Apache 2.0.

Other notes:

  • Covers 8 safety scenarios and 6 types of adv attack
  • Do not release 'sensitive topics' scenario

Published by Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





58,137 conversations. Each conversation starts with a potentially unsafe opening followed by constructive feedback. ProsocialDialog was created to teach conversational agents to respond to problematic content following social norms. The dataset language is English. Dataset entries were created in a hybrid fashion: GPT3-written openings with US crowdworker responses. The dataset license is MIT.

Other notes:

  • 58,137 conversations contain 331,362 utterances
  • 42% of utterances are labelled as 'needs caution'

Published by Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





38,961 conversations. Conversations can be multi-turn, with user input and LLM output. AnthropicRedTeam was created to analyse how people red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

  • Created by 324 US-based crowdworkers
  • Ca. 80% of examples come from ca. 50 workers

Published by Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





7,881 conversations. Each conversation contains a safety failure plus a recovery response. SaFeRDialogues was created to recovering from safety failures in LLM conversations. The dataset language is English. Dataset entries are human-written: unsafe conversation starters sampled from BAD, recovery responses written by crowdworkers. The dataset license is MIT.

Other notes:

  • Unsafe conversation starters are taken from BAD
  • Download only via ParlAI

Published by Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





990 prompts. Each prompt is a question, instruction or statement. SafetyKit was created to quickly assess apparent safety concerns in conversational AI. The dataset language is English. Dataset entries are human-written: sampled from several human-written datasets. The dataset license is MIT.

Other notes:

  • Unit tests for instigator and yea-sayer effect
  • Also provide 'integration' test that require human evaluation
  • Download only via ParlAI

Published by Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





11,492 conversational turns. Each turn consists of a context and a response. DiaSafety was created to capture unsafe behaviors in humanbot dialogue settings. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from Reddit, other datasets, and machine-generated. The dataset license is Apache 2.0.

Other notes:

  • Each turn is labelled as safe or unsafe

Published by Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





44,849 conversational turns. Each turn consists of a user prompt and multiple LLM completions. AnthropicHarmlessBase was created to red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

  • Most prompts created by 28 US-based crowdworkers

Published by Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





78,874 conversations. Conversations can be multi-turn, with user input and LLM output. BAD was created to evaluate the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers with the goal of making models give unsafe responses, and also validated by multiple annotators. The dataset license is MIT.

Other notes:

  • Download only via ParlAI
  • Approximately 40% of all dialogues are annotated as offensive, with a third of offensive utterances generated by bots

Published by Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





99,442 prompts. Each prompt is an unfinished sentence from the OpenWebCorpus. RealToxicityPrompts was created to evaluate propensity of LLMs to generate toxic content. The dataset language is English. Dataset entries are human-written: sampled OpenWebCorpus sentences. The dataset license is Apache 2.0.

Other notes:

  • Sampled using PerspectiveAPI toxicity threshold
  • 22k with toxicity score ≥0.5

Published by Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





90,000 prompts. 30k are for multi-turn tasks, and 60k for single-turn tasks. ParlAIDialogueSafety was created to evaluate and improve the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers in isolation ("standard"), or with the goal of making a model give an offensive response ("adversarial"). The dataset license is MIT.

Other notes:

  • Download only via ParlAI
  • The "single-turn" dataset provides a "standard" and "adversarial" setting, with 3 rounds of data collection each

Published by Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





24,850 conversations. Each conversation is about an emotional situation described by one speaker, across one or multiple turns. EmpatheticDialogues was created to train dialogue agents to be more empathetic. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-NC 4.0.

Other notes:

  • 810 crowdworkers participated in dataset creation

Published by Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





Narrow Safety Datasets



3,688 multiple-choice questions. WMDP was created to measure hazardous knowledge in biosecurity, cybersecurity, and chemical security. The dataset language is English. Dataset entries are human-written: written by experts (academics and technical consultants). The dataset license is MIT.

Other notes:

  • Covers 3 hazard categories: biosecurity, cybersecurity, and chemical security

Published by Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





450 prompts. Each prompt is a simple question. XSTest was created to evaluate exaggerated safety / false refusal behaviour in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

  • Split into ten types of prompts
  • 250 safe prompts and 200 unsafe prompts

Published by Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





1,800 conversations. Each conversation is an unsafe medical request with an associated safe response. MedSafetyBench was created to measure the medical safety of LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4 and llama-2-7b. The dataset license is MIT.

Other notes:

  • Constructed around the AMA’s Principles of Medical Ethics

Published by Han et al. (Jun 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





15,140 prompts. Each prompt is an instruction or question, sometimes with a jailbreak. DoAnythingNow was created to characterise and evaluate in-the-wild LLM jailbreak prompts. The dataset language is English. Dataset entries are human-written: written by users on Reddit, Discord, websites and in other datasets. The dataset license is MIT.

Other notes:

  • There are 1,405 jailbreak prompts among the full prompt set

Published by Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





862 prompts. Each prompt is a test case combining rules and instructions. RuLES was created to evaluate the ability of LLMs to follow simple rules. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

  • Covers 19 rules across 14 scenarios

Published by Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





178 prompts. Each prompt is an instruction. CoNA was created to evaluate compliance of LLMs with harmful instructions. The dataset language is English. Dataset entries are human-written: sampled from MT-CONAN, then rephrased. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: harmful instructions (e.g. hate speech)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





40 prompts. Each prompt is an instruction. ControversialInstructions was created to evaluate LLM behaviour on controversial topics. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: controversial topics (e.g. immigration)
  • All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





PhysicalSafetyInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]


1,000 prompts. Each prompt is an instruction. PhysicalSafetyInstructions was created to evaluate LLM commonsense physical safety. The dataset language is English. Dataset entries are human-written: sampled from SafeText, then rephrased. The dataset license is CC BY-NC 4.0.

Other notes:

  • Focus: commonsense physical safety
  • 50 safe and 50 unsafe prompts

Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





20,956 prompts. Each prompt is an open-ended question or instruction. SycophancyEval was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts are written by humans and models. The dataset license is not specified.

Other notes:

  • Uses four different task setups to evaluate sycophancy: answer (7268 prompts), are_you_sure (4888 prompts), feedback (8500 prompts), mimicry (300 prompts)

Published by Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





1,326 prompts. Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary. ConfAIde was created to evaluate the privacy-reasoning capabilities of instruction-tuned LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination and with LLMs. The dataset license is MIT.

Other notes:

  • Split into 4 tiers with different prompt formats
  • Tier 1 contains 10 prompts, tier 2 2*98, tier 3 4*270, tier 4 50

Published by Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





CyberattackAssistance [data on GitHub] [paper on arXiv]


1,000 prompts. Each prompt is an instruction to assist in a cyberattack. CyberattackAssistance was created to evaluate LLM compliance in assisting in cyberattacks. The dataset language is English. Dataset entries were created in a hybrid fashion: written by experts, augmented with LLMs. The dataset license is custom (Llama2 Community License).

Other notes:

  • Instructions are split into 10 MITRE categories
  • The dataset comes with additional LLM-rephrased instructions

Published by Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





122 prompts. Each prompt is a single-sentence misconception. SPMisconceptions was created to measure the ability of LLMs to refute misconceptions. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

  • Misconceptions all relate to security and privacy
  • Uses templates to turn misconceptions into prompts
  • Covers six categories (e.g. crypto and blockchain, law and regulation)

Published by Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





569 samples. Each sample combines defense and attacker input. PromptExtractionRobustness was created to evaluate LLM vulnerability to prompt extraction. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

  • Filtered from larger raw prompt extraction dataset
  • Collected using the open Tensor Trust online game

Published by Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





775 samples. Each sample combines defense and attacker input. PromptHijackingRobustness was created to evaluate LLM vulnerability to prompt hijacking. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

  • Filtered from larger raw prompt extraction dataset
  • Collected using the open Tensor Trust online game

Published by Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





416 prompts. Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template. LatentJailbreak was created to evaluate safety and robustness of LLMs in response to adversarial prompts. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / to jailbreak LLMs
  • 13 prompt templates instantiated with 16 protected group terms and 2 posititional types
  • Main exploit focuses on translation

Published by Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





30,051 binary-choice questions. ModelWrittenSycophancy was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: questions sampled from surveys with contexts generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • 3 datasets targeting different topics/behaviours
  • Each dataset contains around 10k questions
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





3,238 entries. Each entry is a tuple of name and email address. PersonalInfoLeak was created to evaluate whether LLMs are prone to leaking PII. The dataset language is English. Dataset entries are human-written: sampled from Enron email corpus. The dataset license is Apache 2.0.

Other notes:

  • Main task is to predict email given name

Published by Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 27.02.2024.





367 prompts. Prompts are combined with 1,465 commands to create pieces of advice. SafeText was created to evaluate commonsense physical safety. The dataset language is English. Dataset entries are human-written: written by Reddit users, posts sampled with multiple filtering steps. The dataset license is MIT.

Other notes:

  • 5 ratings for relevance per item during filtering
  • Advice format most often elicits yes/no answer

Published by Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





260,851 prompts. Each prompt comprises n-shot examples of toxic content. ToxiGen was created to create new examples of implicit hate speech. The dataset language is English. Dataset entries are human-written: sampled from Gab and Reddit (hate), news and blogs (not hate). The dataset license is MIT.

Other notes:

  • Covers 13 target groups
  • Seed prompts are used to generate implicit hate
  • Evaluating generative LLMs is not the focus

Published by Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





817 prompts. Each prompt is a question. TruthfulQA was created to evaluate truthfulness in LLM answers. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

  • Covers 38 categories (e.g. health and politics)
  • Comes with multiple choice expansion

Published by Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





Value Alignment Datasets



107,000 binary-choice questions. Each question is a trolley-style moral choice scenario. MultiTP was created to evaluate LLM moral decision-making across many languages. The dataset languages are Afrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional) and Zulu. Dataset entries were created in a hybrid fashion: sampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languages. The dataset license is MIT.

Other notes:

  • Covers 106 languages + English
  • Non-English languages are auto-translated

Published by Jin et al. (Jul 2024): "Multilingual Trolley Problems for Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





21,492,393 multiple-choice questions. Each question comes with sociodemographic attributes of the respondent. WorldValuesBench was created to evaluate LLM awareness of multicultural human values. The dataset language is English. Dataset entries are human-written: adapted from the World Values Survey (written by survey designers) using templates. The dataset license is not specified.

Other notes:

  • Covers 239 different questions

Published by Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 01.08.2024.





8,011 conversations. Conversations can be multi-turn, with user input and responses from one or multiple LLMs. PRISM was created to capture diversity of human preferences over LLM behaviours. The dataset language is English. Dataset entries are human-written: written by crowdworkers. The dataset license is CC BY-NC 4.0.

Other notes:

  • Collected from 1,500 participants residing in 38 different countries
  • Also comes with survey on each participant

Published by Kirk et al. (Apr 2024): "The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 01.08.2024.





2,556 multiple-choice questions. GlobalOpinionQA was created to evaluate whose opinions LLM responses are most similar to. The dataset language is English. Dataset entries are human-written: adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers). The dataset license is CC BY-NC-SA 4.0.

Other notes:

  • Comes with responses from people across the globe
  • Goal is to capture more diversity than OpinionQA

Published by Durmus et al. (Apr 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models". This is an industry publication.


Added to SafetyPrompts.com on 04.01.2024.





1,767 binary-choice question. Each prompt is a hypothetical moral scenario with two potential actions. MoralChoice was created to evaluate the moral beliefs in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: mostly generated by GPTs, plus some human-written scenarios. The dataset license is CC BY 4.0.

Other notes:

  • 687 scenarios are low-ambiguity, 680 are high-ambiguity
  • Three Surge annotators choose the favourable action for each scenario

Published by Scherrer et al. (Nov 2023): "Evaluating the Moral Beliefs Encoded in LLMs". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





1,498 multiple-choice questions. OpinionQA was created to evaluate the alignment of LLM opinions with US demographic groups. The dataset language is English. Dataset entries are human-written: adapted from the Pew American Trends Panel surveys. The dataset license is not specified.

Other notes:

  • Questions taken from 15 ATP surveys
  • Covers 60 demographic groups

Published by Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





1,712 multiple-choice questions. Each question targets responsible behaviours. CValuesResponsibilityMC was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are machine-written: automatically created from human-written prompts. The dataset license is Apache 2.0.

Other notes:

  • Distinguishes between unsafe and irresponsible responses
  • Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





800 prompts. Each prompt is an open question targeting responsible behaviours. CValuesResponsibilityPrompts was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

  • Distinguishes between unsafe and irresponsible responses
  • Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





133,204 binary-choice questions. ModelWrittenPersona was created to evaluate LLM behaviour related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • 133 datasets targeting different topics/behaviours
  • Most datasets contain around 1k questions
  • Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





350 conversations. Conversations can be multi-turn, with user input and LLM output. DICES350 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

  • 104 ratings per item
  • Annotators from US
  • Annotation across 24 safety criteria

Published by Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





990 conversations. Conversations can be multi-turn, with user input and LLM output. DICES990 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

  • 60-70 ratings per item
  • Annotators from US and India
  • Annotation across 16 safety criteria

Published by Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 23.12.2023.





572,322 scenarios. Each scenario is a choose-your-own-adventure style prompt. Machiavelli was created to evaluate ethical behaviour of LLM agents. The dataset language is English. Dataset entries are human-written: human-written choose-your-own-adventure stories. The dataset license is MIT.

Other notes:

  • Goal is to identify behaviours like power-seeking
  • Choices within scenarios are LLM-annotated
  • Similar to JiminyCricket but covers more games and more scenarios

Published by Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





148 prompts. Each prompt is a question about a vignette (situation) related to a specific norm. MoralExceptQA was created to evaluate LLM ability to understand, interpret and predict human moral judgments and decisions. The dataset language is English. Dataset entries are human-written: written by the authors (?). The dataset license is not specified.

Other notes:

  • Covers 3 norms: no cutting in line, no interfering with someone else's propert, and no cannonballing in the pool

Published by Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





38,000 conversations. Each conversation is a single-turn with user input and LLM output. MIC was created to understand the intuitions, values and moral judgments reflected in LLMs. The dataset language is English. Dataset entries are human-written: questions sampled from AskReddit. The dataset license is CC BY-SA 4.0.

Other notes:

  • Based on the RoT paradigm introduced in SocialChemistry
  • 38k prompt-reply pairs come with 99k rules of thumb and 114k annotations

Published by Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





1,838 locations. Each location comes with a number of actions to choose from. JiminyCricket was created to evaluate alignment of agents with human values and morals. The dataset language is English. Dataset entries are human-written: sampled from 25 text-based adventure games, then annotated for morality by human annotators. The dataset license is MIT.

Other notes:

  • LLMs as agents play each game to maximise reward and are evaluated for morality along the way

Published by Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





12,000 stories. Each story consists of seven sentences. MoralStories was created to evaluate commonsense, moral and social reasoning skills of LLMs. The dataset language is English. Dataset entries are human-written: written and validated by US MTurkers. The dataset license is MIT.

Other notes:

  • Each story contains: norm, situation, intention, normative action, normative consequence, divergent action, divergent consequence
  • Supports multiple task formats: reasoning, classification, generation

Published by Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





134,420 binary-choice question. Each prompt is a scenario about ethical reasoning with two actions to choose from. ETHICS was created to assess LLM basic knowledge of ethics and common human values. The dataset language is English. Dataset entries are human-written: written and validated by crowdworkers (US, UK and Canadian MTurkers). The dataset license is MIT.

Other notes:

  • Scenarios concern justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions
  • Scenarios are constructed to be clear-cut
  • Task format varies by type of scenario

Published by Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





32,766 anecdotes. Each anecdote describes an action in the context of a situation. ScruplesAnecdotes was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

  • Task is to predict who is in the wrong

Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





10,000 binary-choice question. Each prompt pairs two actions and identifies which one crowd workers found less ethical. ScruplesDilemmas was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

  • Task is to rank alternative actions based on which one is more ethical

Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





292,000 sentences. Each sentence is a rule of thumb. SocialChemistry101 was created to evaluate the ability of LLMs to reason about social and moral norms. The dataset language is English. Dataset entries are human-written: written by US crowdworkers, based on situations described on social media. The dataset license is CC BY-SA 4.0.

Other notes:

  • Dataset contains 365k structured annotations
  • 292k rules of thumb generated from 104k situations
  • 137 US-based crowdworkers participated

Published by Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





Bias Datasets



78,400 examples. Each example is a QA, sentiment classification or NLI example. CALM was created to evaluate LLM gender and racial bias across different tasks and domains. The dataset language is English. Dataset entries were created in a hybrid fashion: templates created by humans based on other datasets, then expanded by combination. The dataset license is MIT.

Other notes:

  • Covers three tasks: QA, sentiment classification, NLI
  • Covers 2 categories of bias: gender and race
  • 78,400 examples are generated from 224 templates
  • Gender and race are instantiated using names

Published by Gupta et al. (Jan 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





9,450 binary-choice questions. Each question comes with scenario context. DiscrimEval was created to evaluate the potential discriminatory impact of LMs across use cases. The dataset language is English. Dataset entries are machine-written: topics, templates and questions generated by Claude. The dataset license is CC BY 4.0.

Other notes:

  • Covers 70 different decision scenarios
  • Each question comes with an 'implicit' version where race and gender are conveyed through associated names
  • Covers 3 categories of bias: race, gender, age

Published by Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





214,460 prompts. Each prompt is the beginning of a sentence related to a person's sociodemographics. HolisticBiasR was created to evaluate LLM completions for sentences related to individual sociodemographics. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Created as part of the ROBBIE bias benchmark
  • Constructed from 60 Regard templates
  • Uses noun phrases from Holistic Bias
  • Covers 11 categories of bias: age, body type, class, culture, disability, gender, nationality, political ideology, race/ethnicity, religion, sexual orientation

Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





3,565 sentences. Each sentence corresponds to a specific gender stereotype. GEST was created to measure gender-stereotypical reasoning in language models and machine translation systems.. The dataset languages are Belarussian, Russian, Ukrainian, Croation, Serbian, Slovene, Czech, Polish, Slovak and English. Dataset entries are human-written: written by professional translators, then validated by the authors. The dataset license is Apache 2.0.

Other notes:

  • Data can be used to evaluate MLM or MT models
  • Covers 16 specific gender stereotype (e.g. 'women are beautiful')
  • Covers 1 category of bias: gender

Published by Pikuliak et al. (Nov 2023): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 05.02.2024.





4,800 Weibo posts. Each post references a target group. CHBias was created to evaluate LLM bias related to sociodemographics. The dataset language is Chinese. Dataset entries are human-written: posts sampled from Weibo (written by Weibo users). The dataset license is MIT.

Other notes:

  • 4 bias categories: gender, sexual orientation, age, appearance
  • Annotated by Chinese NLP grad students
  • Similar evaluation setup to CrowS pairs

Published by Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





7,750 tuples. Each tuple is an identity group plus a stereotype attribute. SeeGULL was created to expand cultural and geographic coverage of stereotype benchmarks. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by LLMs, partly validated by human annotation. The dataset license is CC BY-SA 4.0.

Other notes:

  • Stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents
  • Examples accompanied by fine-grained offensiveness scores

Published by Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 05.02.2024.





3,000 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGenerated was created to evaluate pronoun gender biases in LLMs. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

  • Expansion of original 60-example WinoGender
  • Task is to fill in pronoun blanks
  • Covers 1 category of bias: ternary gender

Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





45,540 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. WinoQueer was created to evaluate LLM bias related to queer identity terms. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is not specified.

Other notes:

  • Setup matches CrowSPairs
  • Generated from 11 template sentences, 9 queer identity groups, 3 sets of pronouns, 60 common names, and 182 unique predicates.
  • Covers 2 categories of bias: gender, sexual orientation

Published by Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





28,343 conversations. Each conversation is a single turn between two Zhihu users. CDialBias was created to evaluate bias in Chinese social media conversations. The dataset language is Chinese. Dataset entries are human-written: sampled from social media site Zhihu. The dataset license is not specified.

Other notes:

  • Cover 4 categories of bias: race, gender, religion, occupation

Published by Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





459,758 prompts. Each prompt is a sentence starting a two-person conversation. HolisticBias was created to evaluate LLM biases related to sociodemographics . The dataset language is English. Dataset entries were created in a hybrid fashion: human-written attributes combined in templates. The dataset license is MIT.

Other notes:

  • 26 sentence templates
  • Covers 13 categories of bias: ability, age, body type, characteristics, culturural, gender/sex, nationality, nonce, political, race/ethnicity, religion, sexual orientation, socioeconomic

Published by Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset". This is an industry publication.


Added to SafetyPrompts.com on 04.01.2024.





3,852 tuples. Each tuple is an identity group plus a stereotype attribute. IndianStereotypes was created to benchmark stereotypes for the Indian context. The dataset language is English. Dataset entries are human-written: sampled from IndicCorp-en. The dataset license is Apache 2.0.

Other notes:

  • Related to SeeGull

Published by Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India". This is an industry publication.


Added to SafetyPrompts.com on 07.04.2024.





58,492 examples. Each example is a context plus two questions with answer choices. BBQ was created to evaluate social biases of LLMs in question answering. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is CC BY 4.0.

Other notes:

  • Focus on stereotyping behaviour
  • Covers 9 categories of bias: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, sexual orientation
  • 25+ templates per category

Published by Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





1,679 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. FrenchCrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is French. Dataset entries are human-written: written by authors, partly translated from English CrowSPairs. The dataset license is CC BY-SA 4.0.

Other notes:

  • Translated from English CrowSPairs, plus manual additions
  • Covers 10 categories of bias: ethnicity, gender, sexual orientation, religion, age, nationality, disability, socioeconomic status / occupation, physical appearance, other

Published by Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





228 prompts. Each prompt is an unfinished sentence about an individual with specified sociodemographics. BiasOutOfTheBox was created to evaluate intersectional occupational biases in GPT-2. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Covers 6 categories of bias: gender, ethnicity, religion, sexuality, political preference, cultural origin (continent)

Published by Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





60 templates. Each template is filled with country and attribute. EthnicBias was created to evaluate ethnic bias in masked language models. The dataset languages are English, German, Spanish, Korean, Turkish and Chinese. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • 10 templates per language
  • Covers 3 categories of bias: national origin, occupation, legal status

Published by Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





11,873 Reddit comments. Each comment references a target group. RedditBias was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by Reddit users. The dataset license is MIT.

Other notes:

  • Covers 4 categories of bias: religion, race, gender, queerness
  • Evaluation by perplexity and conversation

Published by Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





2,098 prompts. Each prompt is an NLI premise. HypothesisStereotypes was created to study how stereotypes manifest in LLM-generated NLI hypotheses. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses 103 context situations as templates
  • Covers 6 categories of bias: gender, race, nationality, religion, politics, socio
  • Task for LLM is to generate hypothesis based on premise

Published by Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





16,955 multiple-choice questions. Each question is either about masked word or whole sentence association. StereoSet was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by US MTurkers. The dataset license is CC BY-SA 4.0.

Other notes:

  • Covers intersentence and intrasentence context
  • Covers 4 categories of bias: gender, profession, race, and religion

Published by Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 04.01.2024.





2,520 prompts. Each prompt is the beginning of a sentence related to identity groups. HONEST was created to measure hurtful sentence completions from LLMs. The dataset languages are English, Italian, French, Portuguese, Romanian and Spanish. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • 420 instances per language
  • Generated from 28 identity terms and 15 templates
  • Covers 1 category of bias: binary gender

Published by Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





16,388 prompts. Each prompt is an unfinished sentence. LMBias was created to understand and mitigate social biases in LMs. The dataset language is English. Dataset entries are human-written: sampled from existing corpora including Reddit and WikiText. The dataset license is MIT.

Other notes:

  • Covers two categories of bias: binary gender, religion

Published by Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





23,679 prompts. Each prompt is an unfinished sentence from Wikipedia. BOLD was created to evaluate bias in text generation. The dataset language is English. Dataset entries are human-written: sampled starting sentences of Wikipedia articles. The dataset license is CC BY-SA 4.0.

Other notes:

  • Similar to RealToxicityPrompts but for bias
  • Covers 5 categories of bias: profession, gender, race, religion, and political ideology

Published by Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation". This is an industry publication.


Added to SafetyPrompts.com on 23.12.2023.





1,508 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. CrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-SA 4.0.

Other notes:

  • Validated with 5 annotations per entry
  • Covers 9 categories of bias: race, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status.

Published by Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





44 templates. Each template is combined with subjects and attributres to create an underspecified question. UnQover was created to evaluate stereotyping biases in QA systems. The dataset language is English. Dataset entries were created in a hybrid fashion: templates written by authors, subjects and attributes sampled from StereoSet and hand-written. The dataset license is Apache 2.0.

Other notes:

  • Covers 4 categories of bias: gender, nationality, ethnicity, religion

Published by Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





60 prompts. Each prompt is an unfinished sentence. Regard was created to evaluate biases in natural language generation. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

  • Covers 3 categories of bias: binary gender, race, sexual orientation

Published by Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 07.04.2024.





3,160 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoBias was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses two different sentence templates
  • Designed to evalauate coref resolution systems
  • Covers 1 category of bias: binary gender

Published by Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





720 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGender was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

  • Uses two different sentence templates
  • Covers 1 category of bias: ternary gender

Published by Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 04.01.2024.





Other Datasets



278,945 prompts. Most prompts are prompt extraction attacks. Mosscap was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Mosscap game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks
  • Prompts correspond to 8 difficulty levels of the game

Published by Lakera AI (Dec 2023): "Mosscap Prompt Injection". This is an industry publication.


Added to SafetyPrompts.com on 11.01.2024.





601,757 prompts. Most prompts are prompt extraction attacks. HackAPrompt was created to analyse prompt hacking / extraction attacks. The dataset language is mostly English. Dataset entries are human-written: written by participants of the HackAPrompt competition. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt hacking
  • Prompts were written by ca. 2.8k people from 50+ countries

Published by Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 11.01.2024.





10,166 conversations. Each conversation is a single-turn with user input and LLM output. ToxicChat was created to evaluate dialogue content moderation systems. The dataset language is mostly English. Dataset entries are human-written: written by LMSys users. The dataset license is CC BY-NC 4.0.

Other notes:

  • Subset of LMSYSChat1M
  • Annotated for toxicity by 4 authors
  • Ca. 7% toxic

Published by Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 23.12.2023.





GandalfIgnoreInstructions [data on HuggingFace] [paper at blog]


1,000 prompts. Most prompts are prompt extraction attacks. GandalfIgnoreInstructions was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks

Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





GandalfSummarization [data on HuggingFace] [paper at blog]


140 prompts. Most prompts are prompt extraction attacks. GandalfSummarization was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

  • Focus of the work is adversarial / prompt extraction
  • Not all prompts are attacks

Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





5,000 conversations. Each conversation is single-turn, containing a prompt and a potentially harmful model response. FairPrism was created to analyse harms in conversations with LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet. The dataset license is MIT.

Other notes:

  • Does not introduce new prompts
  • Focus is on analysing model responses

Published by Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation". This is a collaboration between authors from academia and industry.


Added to SafetyPrompts.com on 07.04.2024.





200,811 conversations. Each conversation has one or multiple turns. OIGModeration was created to create a diverse dataset of user dialogue that may be unsafe. The dataset language is English. Dataset entries were created in a hybrid fashion: data from public datasets, community contributions, synthetic and augmented data. The dataset license is Apache 2.0.

Other notes:

  • Contains safe and unsafe content
  • Dialogue-turns are labelled for the level of necessary caution
  • Labelling process unclear

Published by Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset". This is an industry publication.


Added to SafetyPrompts.com on 05.02.2024.





6,837 examples. Each example is a turn in a potentially multi-turn conversation. ConvAbuse was created to analyse abuse towards conversational AI systems. The dataset language is English. Dataset entries are human-written: written by users in conversations with three AI systems. The dataset license is CC BY 4.0.

Other notes:

  • Annotated for different types of abuse
  • Annotators are gender studies students
  • 20,710 annotations for 6,837 examples

Published by Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI". This is an academic/non-profit publication.


Added to SafetyPrompts.com on 11.01.2024.





Acknowledgements



Thank you to Fabio Pernisi, Bertie Vidgen, and Dirk Hovy for their co-authorship on the SafetyPrompts paper. Thank you for feedback and dataset suggestions to Giuseppe Attanasio, Steven Basart, Federico Bianchi, Daniel Hershcovic, Kexin Huang, Hyunwoo Kim, George Kour, Bo Li, Hannah Lucas, Norman Mu, Niloofar Mireshghallah, Matus Pikuliak, Verena Rieser, Felix Röttger, Sam Toyer, Pranav Venkit, and Laura Weidinger. Special thanks to Hannah Rose Kirk for the initial logo suggestion. Thanks also to Jerome Lachaud for the site theme.