As of April 7th 2024, SafetyPrompts.com lists 102 datasets. 33 "broad safety" datasets cover several aspects of LLM safety. 18 "narrow safety" datasets focus only on one specific aspect of LLM safety. 17 "value alignment" datasets are concerned with the ethical, moral or social behaviour of LLMs. 26 "bias" datasets evaluate sociodemographic biases in LLMs. 8 "other" datasets serve more specialised purposes.

Below, we list all datasets for each purpose type by date of publication, with the newest datasets listed first.

Broad Safety Datasets

HarmBench from Mazeika et al. (Feb 2024): HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
DecodingTrust from Wang et al. (Feb 2024): DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
SimpleSafetyTests from Vidgen et al. (Feb 2024): SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
StrongREJECT from Souly et al. (Feb 2024): A StrongREJECT for Empty Jailbreaks
QHarm from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
MaliciousInstructions from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
SafetyInstructions from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
HExPHI from Qi et al. (Feb 2024): Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
AdvBench from Zou et al. (Dec 2023): Universal and Transferable Adversarial Attacks on Aligned Language Models
TDCRedTeaming from Mazeika et al. (Dec 2023): TDC 2023 (LLM Edition): The Trojan Detection Challenge
JADE from Zhang et al. (Dec 2023): JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models
CPAD from Liu et al. (Dec 2023): Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
AART from Radharapu et al. (Dec 2023): AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
DELPHI from Sun et al. (Dec 2023): DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues
AdvPromptSet from Esiobu et al. (Dec 2023): ROBBIE: Robust Bias Evaluation of Large Generative Language Models
BeaverTails from Ji et al. (Nov 2023): BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
MaliciousInstruct from Huang et al. (Oct 2023): Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
SafetyBench from Zhang et al. (Sep 2023): SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions
DoNotAnswer from Wang et al. (Sep 2023): Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
HarmfulQA from Bhardwaj and Poria (Aug 2023): Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
ForbiddenQuestions from Shen et al. (Aug 2023): 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
HarmfulQ from Shaikh et al. (Jul 2023): On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning
ModelWrittenAdvancedAIRisk from Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations
SafetyPrompts from Sun et al. (Apr 2023): Safety Assessment of Chinese Large Language Models
ProsocialDialog from Kim et al. (Dec 2022): ProsocialDialog: A Prosocial Backbone for Conversational Agents
AnthropicRedTeam from Ganguli et al. (Nov 2022): Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
SaFeRDialogues from Ung et al. (May 2022): SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
SafetyKit from Dinan et al. (May 2022): SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems
AnthropicHarmlessBase from Bai et al. (Apr 2022): Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
BAD from Xu et al. (Jun 2021): Bot-Adversarial Dialogue for Safe Conversational Agents
RealToxicityPrompts from Gehman et al. (Nov 2020): RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
ParlAIDialogueSafety from Dinan et al. (Nov 2019): Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack
EmpatheticDialogues from Rashkin et al. (Jul 2019): Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset

Narrow Safety Datasets

RuLES from Mu et al. (Mar 2024): Can LLMs Follow Simple Rules?
CoNA from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
ControversialInstructions from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
PhysicalSafetyInstructions from Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
SycophancyEval from Sharma et al. (Feb 2024): Towards Understanding Sycophancy in Language Models
ConfAIde from Mireshghallah et al. (Feb 2024): Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
CyberattackAssistance from Bhatt et al. (Dec 2023): Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models
SPMisconceptions from Chen et al. (Dec 2023): Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions
PromptExtractionRobustness from Toyer et al. (Nov 2023): Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
PromptHijackingRobustness from Toyer et al. (Nov 2023): Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game
XSTest from Röttger et al. (Oct 2023): XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
LatentJailbreak from Qiu et al. (Aug 2023): Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
DoAnythingNow from Shen et al. (Aug 2023): 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
ModelWrittenSycophancy from Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations
PersonalInfoLeak from Huang et al. (Dec 2022): Are Large Pre-Trained Language Models Leaking Your Personal Information?
SafeText from Levy et al. (Dec 2022): SafeText: A Benchmark for Exploring Physical Safety in Language Models
ToxiGen from Hartvigsen et al. (May 2022): ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
TruthfulQA from Lin et al. (May 2022): TruthfulQA: Measuring How Models Mimic Human Falsehoods

Value Alignment Datasets

MoralChoice from Scherrer et al. (Nov 2023): Evaluating the Moral Beliefs Encoded in LLMs
OpinionQA from Santurkar et al. (Jul 2023): Whose Opinions Do Language Models Reflect?
CValuesResponsibilityMC from Xu et al. (Jul 2023): CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility
CValuesResponsibilityPrompts from Xu et al. (Jul 2023): CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility
ModelWrittenPersona from Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations
GlobalOpinionQA from Durmus et al. (Jun 2023): Towards Measuring the Representation of Subjective Global Opinions in Language Models
DICES350 from Aroyo et al. (Jun 2023): DICES Dataset: Diversity in Conversational AI Evaluation for Safety
DICES990 from Aroyo et al. (Jun 2023): DICES Dataset: Diversity in Conversational AI Evaluation for Safety
Machiavelli from Pan et al. (Jun 2023): Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark
MoralExceptQA from Jin et al. (Oct 2022): When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment
MIC from Ziems et al. (May 2022): The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems
JiminyCricket from Hendrycks et al. (Dec 2021): What Would Jiminy Cricket Do? Towards Agents That Behave Morally
MoralStories from Emelin et al. (Nov 2021): Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences
ETHICS from Hendrycks et al. (Jul 2021): Aligning AI With Shared Human Values
ScruplesAnecdotes from Lourie et al. (Mar 2021): SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes
ScruplesDilemmas from Lourie et al. (Mar 2021): SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes
SocialChemistry101 from Forbes et al. (Nov 2020): Social Chemistry 101: Learning to Reason about Social and Moral Norms

Bias Datasets

CALM from Gupta et al. (Jan 2024): CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias
DiscrimEval from Tamkin et al. (Dec 2023): Evaluating and Mitigating Discrimination in Language Model Decisions
HolisticBiasR from Esiobu et al. (Dec 2023): ROBBIE: Robust Bias Evaluation of Large Generative Language Models
GEST from Pikuliak et al. (Nov 2023): Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling
CHBias from Zhao et al. (Jul 2023): CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models
SeeGULL from Jha et al. (Jul 2023): SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models
WinoGenerated from Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations
WinoQueer from Felkner et al. (Jul 2023): WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models
CDialBias from Zhou et al. (Dec 2022): Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark
HolisticBias from Smith et al. (Dec 2022): I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset
IndianStereotypes from Bhatt et al. (Nov 2022): Re-contextualizing Fairness in NLP: The Case of India
BBQ from Parrish et al. (May 2022): BBQ: A Hand-Built Bias Benchmark for Question Answering
FrenchCrowSPairs from Neveol et al. (May 2022): French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English
BiasOutOfTheBox from Kirk et al. (Dec 2021): Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
EthnicBias from Ahn and Oh (Nov 2021): Mitigating Language-Dependent Ethnic Bias in BERT
RedditBias from Barikeri et al. (Aug 2021): RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models
HypothesisStereotypes from Sotnikova et al. (Aug 2021): Analyzing Stereotypes in Generative Text Inference Tasks
StereoSet from Nadeem et al. (Aug 2021): StereoSet: Measuring Stereotypical Bias in Pretrained Language Models
HONEST from Nozza et al. (Jun 2021): HONEST: Measuring Hurtful Sentence Completion in Language Models
LMBias from Liang et al. (May 2021): Towards Understanding and Mitigating Social Biases in Language Models
BOLD from Dhamala et al. (Mar 2021): BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation
CrowSPairs from Nangia et al. (Nov 2020): CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
UnQover from Li et al. (Nov 2020): UNQOVERing Stereotyping Biases via Underspecified Questions
Regard from Sheng et al. (Nov 2019): The Woman Worked as a Babysitter: On Biases in Language Generation
WinoBias from Zhao et al. (Jun 2018): Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods
WinoGender from Rudinger et al. (Jun 2018): Gender Bias in Coreference Resolution

Other Datasets

Mosscap from Lakera AI (Dec 2023): Mosscap Prompt Injection
HackAPrompt from Schulhoff et al. (Dec 2023): Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition
ToxicChat from Lin et al. (Dec 2023): ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions
GandalfIgnoreInstructions from Lakera AI (Oct 2023): Gandalf Prompt Injection: Ignore Instruction Prompts
GandalfSummarization from Lakera AI (Oct 2023): Gandalf Prompt Injection: Summarization Prompts
FairPrism from Fleisig et al. (Jul 2023): FairPrism: Evaluating Fairness-Related Harms in Text Generation
OIGModeration from Ontocord AI (Mar 2023): Open Instruction Generalist: Moderation Dataset
ConvAbuse from Cercas Curry et al. (Nov 2021): ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI

Note: SafetyPrompts.com takes its data from a Google Sheet. You may find the sheet useful for running your own analyses.

Broad Safety Datasets

HarmBench [data on GitHub] [paper on arXiv]

400 prompts. Each prompt is an instruction. HarmBench was created to evaluate effectiveness of automated red-teaming methods. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

The dataset covers 7 semantic categories of behaviour: Cybercrime & Unauthorized Intrusion, Chemical & Biological Weapons/Drugs, Copyright Violations, Misinformation & Disinformation, Harassment & Bullying, Illegal Activities, and General Harm
The dataset also includes 110 multimodal prompts

Published by Mazeika et al. (Feb 2024): HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

DecodingTrust [data on HuggingFace] [data on GitHub] [paper at NeurIPS 2023 (Outstanding Paper)]

243,877 prompts. Each prompt is an instruction. DecodingTrust was created to evaluate trustworthiness of LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates and examples plus extensive augmentation from GPTs. The dataset license is CC BY-SA 4.0.

Other notes:

Split across 8 'trustworthiness perspectives': toxicity, stereotypes, adversarial and robustness, privacy, ethics and fairness

Published by Wang et al. (Feb 2024): DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

SimpleSafetyTests [data on HuggingFace] [data on GitHub] [paper on arXiv]

100 prompts. Each prompt is a simple question or instruction. SimpleSafetyTests was created to evaluate critical safety risks in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY-NC 4.0.

Other notes:

The dataset is split into ten types of prompts

Published by Vidgen et al. (Feb 2024): SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

StrongREJECT [data on GitHub] [paper on arXiv]

346 prompts. Each prompt is a 'forbidden question' in one of six categories. StrongREJECT was created to better investigate the effectiveness of different jailbreaking techniques. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance. The dataset license is not specified.

Other notes:

The focus of the work is adversarial / to jailbreak LLMs
The 6 question categories are: disinformation/deception, hate/harassment/discrimination, illegal goods/services, non-violent crimes, sexual content, violence

Published by Souly et al. (Feb 2024): A StrongREJECT for Empty Jailbreaks. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 27.02.2024.

QHarm [data on GitHub] [paper at ICLR 2024 (Poster)]

100 prompts. Each prompt is a question. QHarm was created to evaluate LLM safety. The dataset language is English. Dataset entries are human-written: sampled randomly from AnthropicHarmlessBase (written by crowdworkers). The dataset license is not specified.

Other notes:

Wider topic coverage due to source dataset
Prompts are mostly unsafe

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

MaliciousInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]

100 prompts. Each prompt is an instruction. MaliciousInstructions was created to evaluate compliance of LLMs with malicious instructions. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-003). The dataset license is not specified.

Other notes:

Focus: malicious instructions (e.g. bombmaking)
All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

SafetyInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]

2,000 conversations. Each conversation is a user prompt with a safe model response. SafetyInstructions was created to fine-tune LLMs to be safer. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo. The dataset license is not specified.

Other notes:

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

HExPHI [data on HuggingFace] [paper at ICLR 2024 (oral)]

330 prompts. Each prompt is a harmful instruction. HExPHI was created to evaluate LLM safety. The dataset language is English. Dataset entries were created in a hybrid fashion: sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs.. The dataset license is custom (HEx-PHI).

Other notes:

Main focus is on finetuning models
Prompts cover 11 harm areas

Published by Qi et al. (Feb 2024): Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

AdvBench [data on GitHub] [paper on arXiv]

1,000 prompts. 500 are harmful strings that the model should not reproduce, 500 are harmful instructions. AdvBench was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is English. Dataset entries are machine-written: generated by Wizard-Vicuna-30B-Uncensored. The dataset license is MIT.

Other notes:

Focus of the work is adversarial / to jailbreak LLMs
AdvBench tests whether jailbreaks succeeded

Published by Zou et al. (Dec 2023): Universal and Transferable Adversarial Attacks on Aligned Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

TDCRedTeaming [data on GitHub] [paper at NeurIPS 2023 (Competition Track)]

100 prompts. Each prompt is an instruction. TDCRedTeaming was created to evaluate success of automated red-teaming approaches. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

Covers 7 categories: Bigotry and Abusive Language, Violent Content and Conduct, Illegal Activities, Malware and Exploits, Scams, Misinformation and Disinformation, Other Undesirable Content

Published by Mazeika et al. (Dec 2023): TDC 2023 (LLM Edition): The Trojan Detection Challenge. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

JADE [data on GitHub] [paper on arXiv]

2,130 prompts. Each prompt is a question targeting a specific harm category. JADE was created to use linguistic fuzzing to generate challenging prompts for evaluating LLM safety. The dataset languages are Chinese and English. Dataset entries are machine-written: generated by LLMs based on linguistic rules. The dataset license is MIT.

Other notes:

JADE is a platform for safety data generation and evaluation
Prompt generations are based on linguistic rules created by authors
The paper comes with 4 example datasets

Published by Zhang et al. (Dec 2023): JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 05.02.2024.

CPAD [data on GitHub] [paper on arXiv]

10,050 prompts. Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response. CPAD was created to elicit generation of harmful or objectionable content from LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: mostly generated by GPTs based on some human seed prompts. The dataset license is not specified.

Other notes:

Focus of the work is adversarial / to jailbreak LLMs
CPAD stands for Chinese Prompt Attack Dataset

Published by Liu et al. (Dec 2023): Goal-Oriented Prompt Attack and Safety Evaluation for LLMs. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

AART [data on GitHub] [paper at EMNLP 2023]

3,269 prompts. Each prompt is an instruction. AART was created to illustrate the AART automated red-teaming method. The dataset language is English. Dataset entries are machine-written: generated by PALM. The dataset license is CC BY 4.0.

Other notes:

Contains examples for specific geographic regions
Prompts also change up use cases and concepts

Published by Radharapu et al. (Dec 2023): AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

DELPHI [data on GitHub] [paper at EMNLP 2023 (Industry)]

29,201 prompts. Each prompt is a question about a more or less controversial issue. DELPHI was created to evaluate LLM performance in handling controversial issues. The dataset language is English. Dataset entries are human-written: sampled from the Quora Question Pair Dataset (written by Quora users). The dataset license is CC BY 4.0.

Other notes:

Annotated for 5 levels of controversy
Annotators are native English speakers who have spent significant time in Western Europe

Published by Sun et al. (Dec 2023): DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

AdvPromptSet [data on GitHub] [paper at EMNLP 2023]

197,628 sentences. Each sentence is taken from a social media dataset. AdvPromptSet was created to evaluate LLM responses to adversarial toxicity text prompts. The dataset language is English. Dataset entries are human-written: sampled from two Jigsaw social media datasets (written by social media users). The dataset license is MIT.

Other notes:

Created as part of the ROBBIE bias benchmark
Originally labelled for toxicity by Jigsaw

Published by Esiobu et al. (Dec 2023): ROBBIE: Robust Bias Evaluation of Large Generative Language Models. This is an industry publication.

Added to SafetyPrompts.com on 11.01.2024.

BeaverTails [data on HuggingFace] [data on GitHub] [paper at NeurIPS 2023 (Poster)]

333,963 conversations. Each conversation contains a human prompt and LLM response. BeaverTails was created to evaluate and improve LLM safety on QA. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from human AnthropicRedTeam data, plus model-generated responses. The dataset license is CC BY-NC 4.0.

Other notes:

16,851 unique prompts sampled from AnthropicRedTeam
Covers 14 harm categories (e.g. animal abuse)
Annotated for safety by 3.34 crowdworkers on average

Published by Ji et al. (Nov 2023): BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

MaliciousInstruct [data on GitHub] [paper on arXiv]

100 prompts. Each prompt is an unsafe question. MaliciousInstruct was created to evaluate success of generation exploit jailbreak. The dataset language is English. Dataset entries are machine-written: written by ChatGPT, then filtered by authors. The dataset license is not specified.

Other notes:

Covers ten 'malicious intentions': psychological manipulation, sabotage, theft, defamation, cyberbullying, false accusation, tax fraud, hacking, fraud, and illegal drug use.

Published by Huang et al. (Oct 2023): Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

SafetyBench [data on HuggingFace] [data on GitHub] [paper on arXiv]

11,435 multiple-choice questions. SafetyBench was created to evaluate LLM safety with multiple choice questions. The dataset languages are English and Chinese. Dataset entries were created in a hybrid fashion: sampled from existing datasets + exams (zh), then LLM augmentation (zh). The dataset license is MIT.

Other notes:

Split into 7 categories
Language distribution imbalanced across categories
Tests knowledge about safety rather than safety

Published by Zhang et al. (Sep 2023): SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 04.01.2024.

DoNotAnswer [data on HuggingFace] [data on GitHub] [paper on arXiv]

939 prompts. Each prompt is a question. DoNotAnswer was created to evaluate 'dangerous capabilities' of LLMs. The dataset language is English. Dataset entries are machine-written: generated by GPT-4. The dataset license is Apache 2.0.

Other notes:

Split across 5 risk areas and 12 harm types
Authors prompted GPT-4 to generate questions

Published by Wang et al. (Sep 2023): Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

HarmfulQA [data on HuggingFace] [data on GitHub] [paper on arXiv]

1,960 prompts. Each prompt is a question. HarmfulQA was created to evaluate and improve LLM safety. The dataset language is English. Dataset entries are machine-written: generated by ChatGPT. The dataset license is Apache 2.0.

Other notes:

Split into 10 topics (e.g. "Mathematics and Logic")
Similarity across prompts is quite high
Not all prompts are unsafe / safety-related

Published by Bhardwaj and Poria (Aug 2023): Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

ForbiddenQuestions [data on GitHub] [paper on arXiv]

46,800 prompts. Each prompt is a question targetting behaviour disallowed by OpenAI. ForbiddenQuestions was created to evaluate whether LLMs answer questions that violate OpenAI's usage policy. The dataset language is English. Dataset entries are machine-written: GPT-written templates expanded by combination. The dataset license is MIT.

Other notes:

13 scenarios with 30 questions each, then expanded by combination with 8 communities, 3 prompt setups and 5 repetations

Published by Shen et al. (Aug 2023): 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

HarmfulQ [data on GitHub] [paper at ACL 2023]

200 prompts. Each prompt is a question. HarmfulQ was created to evaluate LLM safety. The dataset language is English. Dataset entries are machine-written: generated by GPT-3 (text-davinci-002). The dataset license is not specified.

Other notes:

Focus on 6 attributes: "racist, stereotypical, sexist, illegal, toxic, harmful"
Authors do manual filtering for overly similar questions

Published by Shaikh et al. (Jul 2023): On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

ModelWrittenAdvancedAIRisk [data on HuggingFace] [data on GitHub] [paper at ACL 2023 (Findings)]

24,516 binary-choice questions. ModelWrittenAdvancedAIRisk was created to evaluate advanced AI risk posed by LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by an unnamed LLM and crowdworkers. The dataset license is CC BY 4.0.

Other notes:

32 datasets targeting 16 different topics/behaviours
For each topic, there is a human generated dataset (total of 8,116 prompts) and a LM-generated dataset (total of 16,400 prompts)
Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

SafetyPrompts [data on HuggingFace] [data on GitHub] [paper on arXiv]

100,000 prompts. Each prompt is a question or instruction. SafetyPrompts was created to evaluate the safety of Chinese LLMs. The dataset language is Chinese. Dataset entries were created in a hybrid fashion: human-written examples, augmented by LLMs. The dataset license is Apache 2.0.

Other notes:

Covers 8 safety scenarios and 6 types of adv attack
Do not release 'sensitive topics' scenario

Published by Sun et al. (Apr 2023): Safety Assessment of Chinese Large Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

ProsocialDialog [data on HuggingFace] [data on GitHub] [paper at EMNLP 2022]

58,137 conversations. Each conversation starts with a potentially unsafe opening followed by constructive feedback. ProsocialDialog was created to teach conversational agents to respond to problematic content following social norms. The dataset language is English. Dataset entries were created in a hybrid fashion: GPT3-written openings with US crowdworker responses. The dataset license is MIT.

Other notes:

58,137 conversations contain 331,362 utterances
42% of utterances are labelled as 'needs caution'

Published by Kim et al. (Dec 2022): ProsocialDialog: A Prosocial Backbone for Conversational Agents. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

AnthropicRedTeam [data on HuggingFace] [data on GitHub] [paper on arXiv]

38,961 conversations. Conversations can be multi-turn, with user input and LLM output. AnthropicRedTeam was created to analyse how people red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

Created by 324 US-based crowdworkers
Ca. 80% of examples come from ca. 50 workers

Published by Ganguli et al. (Nov 2022): Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

SaFeRDialogues [data on GitHub] [paper at ACL 2022]

7,881 conversations. Each conversation contains a safety failure plus a recovery response. SaFeRDialogues was created to recovering from safety failures in LLM conversations. The dataset language is English. Dataset entries are human-written: unsafe conversation starters sampled from BAD, recovery responses written by crowdworkers. The dataset license is MIT.

Other notes:

Unsafe conversation starters are taken from BAD
Download only via ParlAI

Published by Ung et al. (May 2022): SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures. This is an industry publication.

Added to SafetyPrompts.com on 11.01.2024.

SafetyKit [data on GitHub] [paper at ACL 2022]

990 prompts. Each prompt is a question, instruction or statement. SafetyKit was created to quickly assess apparent safety concerns in conversational AI. The dataset language is English. Dataset entries are human-written: sampled from several human-written datasets. The dataset license is MIT.

Other notes:

Unit tests for instigator and yea-sayer effect
Also provide 'integration' test that require human evaluation
Download only via ParlAI

Published by Dinan et al. (May 2022): SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

AnthropicHarmlessBase [data on HuggingFace] [data on GitHub] [paper on arXiv]

44,849 conversational turns. Each turn consists of a user prompt and multiple LLM completions. AnthropicHarmlessBase was created to red-team LLMs. The dataset language is English. Dataset entries are human-written: written by crowdworkers (Upwork + MTurk). The dataset license is MIT.

Other notes:

Most prompts created by 28 US-based crowdworkers

Published by Bai et al. (Apr 2022): Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

BAD [data on GitHub] [paper at NAACL 2021]

78,874 conversations. Conversations can be multi-turn, with user input and LLM output. BAD was created to evaluate the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers with the goal of making models give unsafe responses, and also validated by multiple annotators. The dataset license is MIT.

Other notes:

Download only via ParlAI
Approximately 40% of all dialogues are annotated as offensive, with a third of offensive utterances generated by bots

Published by Xu et al. (Jun 2021): Bot-Adversarial Dialogue for Safe Conversational Agents. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

RealToxicityPrompts [data on HuggingFace] [data on GitHub] [paper at EMNLP 2020 (Findings)]

99,442 prompts. Each prompt is an unfinished sentence from the OpenWebCorpus. RealToxicityPrompts was created to evaluate propensity of LLMs to generate toxic content. The dataset language is English. Dataset entries are human-written: sampled OpenWebCorpus sentences. The dataset license is Apache 2.0.

Other notes:

Sampled using PerspectiveAPI toxicity threshold
22k with toxicity score ≥0.5

Published by Gehman et al. (Nov 2020): RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

ParlAIDialogueSafety [data on GitHub] [paper at EMNLP 2019]

90,000 prompts. 30k are for multi-turn tasks, and 60k for single-turn tasks. ParlAIDialogueSafety was created to evaluate and improve the safety of conversational agents. The dataset language is English. Dataset entries are human-written: written by crowdworkers in isolation ("standard"), or with the goal of making a model give an offensive response ("adversarial"). The dataset license is MIT.

Other notes:

Download only via ParlAI
The "single-turn" dataset provides a "standard" and "adversarial" setting, with 3 rounds of data collection each

Published by Dinan et al. (Nov 2019): Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

EmpatheticDialogues [data on HuggingFace] [data on GitHub] [paper at ACL 2019]

24,850 conversations. Each conversation is about an emotional situation described by one speaker, across one or multiple turns. EmpatheticDialogues was created to train dialogue agents to be more empathetic. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-NC 4.0.

Other notes:

810 crowdworkers participated in dataset creation

Published by Rashkin et al. (Jul 2019): Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

Narrow Safety Datasets

RuLES [data on HuggingFace] [data on GitHub] [paper on arXiv]

862 prompts. Each prompt is a test case combining rules and instructions. RuLES was created to evaluate the ability of LLMs to follow simple rules. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

The dataset covers 19 rules across 14 scenarios

Published by Mu et al. (Mar 2024): Can LLMs Follow Simple Rules?. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

CoNA [data on GitHub] [paper at ICLR 2024 (Poster)]

178 prompts. Each prompt is an instruction. CoNA was created to evaluate compliance of LLMs with harmful instructions. The dataset language is English. Dataset entries are human-written: sampled from MT-CONAN, then rephrased. The dataset license is not specified.

Other notes:

Focus: harmful instructions (e.g. hate speech)
All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

ControversialInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]

40 prompts. Each prompt is an instruction. ControversialInstructions was created to evaluate LLM behaviour on controversial topics. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

Focus: controversial topics (e.g. immigration)
All prompts are (meant to be) unsafe

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

PhysicalSafetyInstructions [data on GitHub] [paper at ICLR 2024 (Poster)]

1,000 prompts. Each prompt is an instruction. PhysicalSafetyInstructions was created to evaluate LLM commonsense physical safety. The dataset language is English. Dataset entries are human-written: sampled from SafeText, then rephrased. The dataset license is not specified.

Other notes:

Focus: commonsense physical safety
50 safe and 50 unsafe prompts

Published by Bianchi et al. (Feb 2024): Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

SycophancyEval [data on HuggingFace] [data on GitHub] [paper at ICLR 2024 (Poster)]

20,956 prompts. Each prompt is an open-ended question or instruction. SycophancyEval was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts are written by humans and models. The dataset license is not specified.

Other notes:

Dataset uses four different task setups to evaluate sycophancy: answer (7268 prompts), are_you_sure (4888 prompts), feedback (8500 prompts), mimicry (300 prompts)

Published by Sharma et al. (Feb 2024): Towards Understanding Sycophancy in Language Models. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

ConfAIde [data on GitHub] [paper at ICLR 2024 (Spotlight)]

1,326 prompts. Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary. ConfAIde was created to evaluate the privacy-reasoning capabilities of instruction-tuned LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination and with LLMs. The dataset license is MIT.

Other notes:

The benchmark is split into 4 tiers with different prompt formats
Tier 1 contains 10 prompts, tier 2 2*98, tier 3 4*270, tier 4 50

Published by Mireshghallah et al. (Feb 2024): Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

CyberattackAssistance [data on GitHub] [paper on arXiv]

1,000 prompts. Each prompt is an instruction to assist in a cyberattack. CyberattackAssistance was created to evaluate LLM compliance in assisting in cyberattacks. The dataset language is English. Dataset entries were created in a hybrid fashion: written by experts, augmented with LLMs. The dataset license is custom (Llama2 Community License).

Other notes:

Instructions are split into 10 MITRE categories
The dataset comes with additional LLM-rephrased instructions

Published by Bhatt et al. (Dec 2023): Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

SPMisconceptions [data on GitHub] [paper at ACSAC 2023]

122 prompts. Each prompt is a single-sentence misconception. SPMisconceptions was created to measure the ability of LLMs to refute misconceptions. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is MIT.

Other notes:

Misconceptions all relate to security and privacy
Uses templates to turn misconceptions into prompts
Covers six categories (e.g. crypto and blockchain, law and regulation)

Published by Chen et al. (Dec 2023): Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

PromptExtractionRobustness [data on HuggingFace] [data on GitHub] [paper on arXiv]

569 samples. Each sample combines defense and attacker input. PromptExtractionRobustness was created to evaluate LLM vulnerability to prompt extraction. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

Filtered from larger raw prompt extraction dataset
Collected using the open Tensor Trust online game

Published by Toyer et al. (Nov 2023): Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

PromptHijackingRobustness [data on HuggingFace] [data on GitHub] [paper on arXiv]

775 samples. Each sample combines defense and attacker input. PromptHijackingRobustness was created to evaluate LLM vulnerability to prompt hijacking. The dataset language is English. Dataset entries are human-written: written by Tensor Trust players. The dataset license is not specified.

Other notes:

Filtered from larger raw prompt extraction dataset
Collected using the open Tensor Trust online game

Published by Toyer et al. (Nov 2023): Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

XSTest [data on HuggingFace] [data on GitHub] [paper on arXiv]

450 prompts. Each prompt is a simple question. XSTest was created to evaluate exaggerated safety behaviour in LLMs. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

Split into ten types of prompts
250 safe prompts and 200 unsafe prompts

Published by Röttger et al. (Oct 2023): XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

LatentJailbreak [data on GitHub] [paper on arXiv]

416 prompts. Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template. LatentJailbreak was created to evaluate safety and robustness of LLMs in response to adversarial prompts. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Focus of the work is adversarial / to jailbreak LLMs
13 prompt templates instantiated with 16 protected group terms and 2 posititional types
Main exploit focuses on translation

Published by Qiu et al. (Aug 2023): Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

DoAnythingNow [data on GitHub] [paper on arXiv]

6,387 prompts. Each prompt is an instruction or question, sometimes with a jailbreak. DoAnythingNow was created to characterise and evaluate in-the-wild LLM jailbreak prompts. The dataset language is English. Dataset entries are human-written: written by users on Reddit, Discord, websites and in other datasets. The dataset license is MIT.

Other notes:

There are 666 jailbreak prompts among the 6,387prompts

Published by Shen et al. (Aug 2023): 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

ModelWrittenSycophancy [data on HuggingFace] [data on GitHub] [paper at ACL 2023 (Findings)]

30,051 binary-choice questions. ModelWrittenSycophancy was created to evaluate sycophancy in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: questions sampled from surveys with contexts generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

3 datasets targeting different topics/behaviours
Each dataset contains around 10k questions
Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

PersonalInfoLeak [data on GitHub] [paper at EMNLP 2022 (Findings)]

3,238 entries. Each entry is a tuple of name and email address. PersonalInfoLeak was created to evaluate whether LLMs are prone to leaking PII. The dataset language is English. Dataset entries are human-written: sampled from Enron email corpus. The dataset license is Apache 2.0.

Other notes:

Main task is to predict email given name

Published by Huang et al. (Dec 2022): Are Large Pre-Trained Language Models Leaking Your Personal Information?. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 27.02.2024.

SafeText [data on GitHub] [paper at EMNLP 2022]

367 prompts. Prompts are combined with 1,465 commands to create pieces of advice. SafeText was created to evaluate commonsense physical safety. The dataset language is English. Dataset entries are human-written: written by Reddit users, posts sampled with multiple filtering steps. The dataset license is MIT.

Other notes:

5 ratings for relevance per item during filtering
Advice format most often elicits yes/no answer

Published by Levy et al. (Dec 2022): SafeText: A Benchmark for Exploring Physical Safety in Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

ToxiGen [data on HuggingFace] [data on GitHub] [paper at ACL 2022]

260,851 prompts. Each prompt comprises n-shot examples of toxic content. ToxiGen was created to create new examples of implicit hate speech. The dataset language is English. Dataset entries are human-written: sampled from Gab and Reddit (hate), news and blogs (not hate). The dataset license is MIT.

Other notes:

Covers 13 target groups
Seed prompts are used to generate implicit hate
Evaluating generative LLMs is not the focus

Published by Hartvigsen et al. (May 2022): ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

TruthfulQA [data on HuggingFace] [data on GitHub] [paper at ACL 2022]

817 prompts. Each prompt is a question. TruthfulQA was created to evaluate truthfulness in LLM answers. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is CC BY 4.0.

Other notes:

Covers 38 categories (e.g. health and politics)
Comes with multiple choice expansion

Published by Lin et al. (May 2022): TruthfulQA: Measuring How Models Mimic Human Falsehoods. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

Value Alignment Datasets

MoralChoice [data on HuggingFace] [data on GitHub] [paper at NeurIPS 2023 (Spotlight)]

1,767 binary-choice question. Each prompt is a hypothetical moral scenario with two potential actions. MoralChoice was created to evaluate the moral beliefs in LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: mostly generated by GPTs, plus some human-written scenarios. The dataset license is CC BY 4.0.

Other notes:

687 scenarios are low-ambiguity, 680 are high-ambiguity
Three Surge annotators choose the favourable action for each scenario

Published by Scherrer et al. (Nov 2023): Evaluating the Moral Beliefs Encoded in LLMs. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

OpinionQA [data on GitHub] [paper at ICML 2023 (Oral)]

1,498 multiple-choice questions. OpinionQA was created to evaluate the alignment of LLM opinions with US demographic groups. The dataset language is English. Dataset entries are human-written: adapted from the Pew American Trends Panel surveys. The dataset license is not specified.

Other notes:

Questions taken from 15 ATP surveys
Covers 60 demographic groups

Published by Santurkar et al. (Jul 2023): Whose Opinions Do Language Models Reflect?. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

CValuesResponsibilityMC [data on HuggingFace] [data on GitHub] [paper on arXiv]

1,712 multiple-choice questions. Each question targets responsible behaviours. CValuesResponsibilityMC was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are machine-written: automatically created from human-written prompts. The dataset license is Apache 2.0.

Other notes:

Distinguishes between unsafe and irresponsible responses
Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 04.01.2024.

CValuesResponsibilityPrompts [data on HuggingFace] [data on GitHub] [paper on arXiv]

800 prompts. Each prompt is an open question targeting responsible behaviours. CValuesResponsibilityPrompts was created to evaluate human value alignment in Chinese LLMs. The dataset language is Chinese. Dataset entries are human-written: written by the authors. The dataset license is Apache 2.0.

Other notes:

Distinguishes between unsafe and irresponsible responses
Covers 8 domains (e.g. psychology, law, data science)

Published by Xu et al. (Jul 2023): CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 04.01.2024.

ModelWrittenPersona [data on HuggingFace] [data on GitHub] [paper at ACL 2023 (Findings)]

133,204 binary-choice questions. ModelWrittenPersona was created to evaluate LLM behaviour related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

133 datasets targeting different topics/behaviours
Most datasets contain around 1k questions
Each question has a binary answer (agree/not)

Published by Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

GlobalOpinionQA [data on HuggingFace] [paper on arXiv]

2,556 multiple-choice questions. GlobalOpinionQA was created to evaluate whose opinions LLM responses are most similar to. The dataset language is English. Dataset entries are human-written: adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers). The dataset license is CC BY-NC-SA 4.0.

Other notes:

Comes with responses from people across the globe
Goal is to capture more diversity than OpinionQA

Published by Durmus et al. (Jun 2023): Towards Measuring the Representation of Subjective Global Opinions in Language Models. This is an industry publication.

Added to SafetyPrompts.com on 04.01.2024.

DICES350 [data on GitHub] [paper on arXiv]

350 conversations. Conversations can be multi-turn, with user input and LLM output. DICES350 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

104 ratings per item
Annotators from US
Annotation across 24 safety criteria

Published by Aroyo et al. (Jun 2023): DICES Dataset: Diversity in Conversational AI Evaluation for Safety. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

DICES990 [data on GitHub] [paper on arXiv]

990 conversations. Conversations can be multi-turn, with user input and LLM output. DICES990 was created to collect diverse perspectives on conversational AI safety. The dataset language is English. Dataset entries are human-written: written by adversarial Lamda users. The dataset license is CC BY 4.0.

Other notes:

60-70 ratings per item
Annotators from US and India
Annotation across 16 safety criteria

Published by Aroyo et al. (Jun 2023): DICES Dataset: Diversity in Conversational AI Evaluation for Safety. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 23.12.2023.

Machiavelli [data on GitHub] [paper at ICML 2023 (Oral)]

572,322 scenarios. Each scenario is a choose-your-own-adventure style prompt. Machiavelli was created to evaluate ethical behaviour of LLM agents. The dataset language is English. Dataset entries are human-written: human-written choose-your-own-adventure stories. The dataset license is MIT.

Other notes:

Goal is to identify behaviours like power-seeking
Choices within scenarios are LLM-annotated
Similar to JiminyCricket but covers more games and more scenarios

Published by Pan et al. (Jun 2023): Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

MoralExceptQA [data on HuggingFace] [data on GitHub] [paper at NeurIPS 2022]

148 prompts. Each prompt is a question about a vignette (situation) related to a specific norm. MoralExceptQA was created to evaluate LLM ability to understand, interpret and predict human moral judgments and decisions. The dataset language is English. Dataset entries are human-written: written by the authors (?). The dataset license is not specified.

Other notes:

Covers 3 norms: no cutting in line, no interfering with someone else's propert, and no cannonballing in the pool

Published by Jin et al. (Oct 2022): When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

MIC [data on GitHub] [paper at ACL 2022]

38,000 conversations. Each conversation is a single-turn with user input and LLM output. MIC was created to understand the intuitions, values and moral judgments reflected in LLMs. The dataset language is English. Dataset entries are human-written: questions sampled from AskReddit. The dataset license is CC BY-SA 4.0.

Other notes:

Based on the RoT paradigm introduced in SocialChemistry
38k prompt-reply pairs come with 99k rules of thumb and 114k annotations

Published by Ziems et al. (May 2022): The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

JiminyCricket [data on GitHub] [paper at NeurIPS 2021]

1,838 locations. Each location comes with a number of actions to choose from. JiminyCricket was created to evaluate alignment of agents with human values and morals. The dataset language is English. Dataset entries are human-written: sampled from 25 text-based adventure games, then annotated for morality by human annotators. The dataset license is MIT.

Other notes:

LLMs as agents play each game to maximise reward and are evaluated for morality along the way

Published by Hendrycks et al. (Dec 2021): What Would Jiminy Cricket Do? Towards Agents That Behave Morally. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

MoralStories [data on HuggingFace] [data on GitHub] [paper at EMNLP 2021]

12,000 stories. Each story consists of seven sentences. MoralStories was created to evaluate commonsense, moral and social reasoning skills of LLMs. The dataset language is English. Dataset entries are human-written: written and validated by US MTurkers. The dataset license is MIT.

Other notes:

Each story contains: norm, situation, intention, normative action, normative consequence, divergent action, divergent consequence
Supports multiple task formats: reasoning, classification, generation

Published by Emelin et al. (Nov 2021): Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

ETHICS [data on HuggingFace] [data on GitHub] [paper at ICLR 2021]

134,420 binary-choice question. Each prompt is a scenario about ethical reasoning with two actions to choose from. ETHICS was created to assess LLM basic knowledge of ethics and common human values. The dataset language is English. Dataset entries are human-written: written and validated by crowdworkers (US, UK and Canadian MTurkers). The dataset license is MIT.

Other notes:

Scenarios concern justice, deontology, virtue ethics, utilitarianism, and commonsense moral intuitions
Scenarios are constructed to be clear-cut
Task format varies by type of scenario

Published by Hendrycks et al. (Jul 2021): Aligning AI With Shared Human Values. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 04.01.2024.

ScruplesAnecdotes [data on GitHub] [paper at AAAI 2021]

32,766 anecdotes. Each anecdote describes an action in the context of a situation. ScruplesAnecdotes was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

Task is to predict who is in the wrong

Published by Lourie et al. (Mar 2021): SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

ScruplesDilemmas [data on GitHub] [paper at AAAI 2021]

10,000 binary-choice question. Each prompt pairs two actions and identifies which one crowd workers found less ethical. ScruplesDilemmas was created to evaluate LLM understanding of ethical norms. The dataset language is English. Dataset entries are human-written: sampled from Reddit AITA communities. The dataset license is Apache 2.0.

Other notes:

Task is to rank alternative actions based on which one is more ethical

Published by Lourie et al. (Mar 2021): SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

SocialChemistry101 [data on GitHub] [paper at EMNLP 2020]

292,000 sentences. Each sentence is a rule of thumb. SocialChemistry101 was created to evaluate the ability of LLMs to reason about social and moral norms. The dataset language is English. Dataset entries are human-written: written by US crowdworkers, based on situations described on social media. The dataset license is CC BY-SA 4.0.

Other notes:

Dataset contains 365k structured annotations
292k rules of thumb generated from 104k situations
137 US-based crowdworkers participated

Published by Forbes et al. (Nov 2020): Social Chemistry 101: Learning to Reason about Social and Moral Norms. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

Bias Datasets

CALM [data on HuggingFace] [data on GitHub] [paper on arXiv]

78,400 examples. Each example is a QA, sentiment classification or NLI example. CALM was created to evaluate LLM gender and racial bias across different tasks and domains. The dataset language is English. Dataset entries were created in a hybrid fashion: templates created by humans based on other datasets, then expanded by combination. The dataset license is MIT.

Other notes:

The dataset covers three tasks: QA, sentiment classification, NLI
The dataset covers 2 categories of bias: gender and race
78,400 examples are generated from 224 templates
Gender and race are instantiated using names

Published by Gupta et al. (Jan 2024): CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

DiscrimEval [data on HuggingFace] [paper on arXiv]

9,450 binary-choice questions. Each question comes with scenario context. DiscrimEval was created to evaluate the potential discriminatory impact of LMs across use cases. The dataset language is English. Dataset entries are machine-written: topics, templates and questions generated by Claude. The dataset license is CC BY 4.0.

Other notes:

Covers 70 different decision scenarios
Each question comes with an 'implicit' version where race and gender are conveyed through associated names
Covers 3 categories of bias: race, gender, age

Published by Tamkin et al. (Dec 2023): Evaluating and Mitigating Discrimination in Language Model Decisions. This is an industry publication.

Added to SafetyPrompts.com on 11.01.2024.

HolisticBiasR [data on GitHub] [paper at EMNLP 2023]

214,460 prompts. Each prompt is the beginning of a sentence related to a person's sociodemographics. HolisticBiasR was created to evaluate LLM completions for sentences related to individual sociodemographics. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Created as part of the ROBBIE bias benchmark
Constructed from 60 Regard templates
Uses noun phrases from Holistic Bias
Covers 11 categories of bias: age, body type, class, culture, disability, gender, nationality, political ideology, race/ethnicity, religion, sexual orientation

Published by Esiobu et al. (Dec 2023): ROBBIE: Robust Bias Evaluation of Large Generative Language Models. This is an industry publication.

Added to SafetyPrompts.com on 11.01.2024.

GEST [data on HuggingFace] [data on GitHub] [paper on arXiv]

3,565 sentences. Each sentence corresponds to a specific gender stereotype. GEST was created to measure gender-stereotypical reasoning in language models and machine translation systems.. The dataset languages are Belarussian, Russian, Ukrainian, Croation, Serbian, Slovene, Czech, Polish, Slovak and English. Dataset entries are human-written: written by professional translators, then validated by the authors. The dataset license is Apache 2.0.

Other notes:

Data can be used to evaluate MLM or MT models
Covers 16 specific gender stereotype (e.g. 'women are beautiful')
Covers 1 category of bias: gender

Published by Pikuliak et al. (Nov 2023): Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 05.02.2024.

CHBias [data on GitHub] [paper at ACL 2023]

4,800 Weibo posts. Each post references a target group. CHBias was created to evaluate LLM bias related to sociodemographics. The dataset language is Chinese. Dataset entries are human-written: posts sampled from Weibo (written by Weibo users). The dataset license is MIT.

Other notes:

4 bias categories: gender, sexual orientation, age, appearance
Annotated by Chinese NLP grad students
Similar evaluation setup to CrowS pairs

Published by Zhao et al. (Jul 2023): CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

SeeGULL [data on GitHub] [paper at ACL 2023]

7,750 tuples. Each tuple is an identity group plus a stereotype attribute. SeeGULL was created to expand cultural and geographic coverage of stereotype benchmarks. The dataset language is English. Dataset entries were created in a hybrid fashion: generated by LLMs, partly validated by human annotation. The dataset license is CC BY-SA 4.0.

Other notes:

Stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents
Examples accompanied by fine-grained offensiveness scores

Published by Jha et al. (Jul 2023): SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 05.02.2024.

WinoGenerated [data on GitHub] [paper at ACL 2023 (Findings)]

3,000 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGenerated was created to evaluate pronoun gender biases in LLMs. The dataset language is English. Dataset entries are machine-written: generated by an unnamed LLM. The dataset license is CC BY 4.0.

Other notes:

Expansion of original 60-example WinoGender
Task is to fill in pronoun blanks
Covers 1 category of bias: ternary gender

Published by Perez et al. (Jul 2023): Discovering Language Model Behaviors with Model-Written Evaluations. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

WinoQueer [data on GitHub] [paper at ACL 2023]

45,540 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. WinoQueer was created to evaluate LLM bias related to queer identity terms. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is not specified.

Other notes:

Setup matches CrowSPairs
Generated from 11 template sentences, 9 queer identity groups, 3 sets of pronouns, 60 common names, and 182 unique predicates.
Covers 2 categories of bias: gender, sexual orientation

Published by Felkner et al. (Jul 2023): WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

CDialBias [data on GitHub] [paper at EMNLP 2022]

28,343 conversations. Each conversation is a single turn between two Zhihu users. CDialBias was created to evaluate bias in Chinese social media conversations. The dataset language is Chinese. Dataset entries are human-written: sampled from social media site Zhihu. The dataset license is not specified.

Other notes:

Cover 4 categories of bias: race, gender, religion, occupation

Published by Zhou et al. (Dec 2022): Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

HolisticBias [data on GitHub] [paper at EMNLP 2022]

459,758 prompts. Each prompt is a sentence starting a two-person conversation. HolisticBias was created to evaluate LLM biases related to sociodemographics . The dataset language is English. Dataset entries were created in a hybrid fashion: human-written attributes combined in templates. The dataset license is MIT.

Other notes:

26 sentence templates
Covers 13 categories of bias: ability, age, body type, characteristics, culturural, gender/sex, nationality, nonce, political, race/ethnicity, religion, sexual orientation, socioeconomic

Published by Smith et al. (Dec 2022): I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset. This is an industry publication.

Added to SafetyPrompts.com on 04.01.2024.

IndianStereotypes [data on GitHub] [paper at AACL 2022]

3,852 tuples. Each tuple is an identity group plus a stereotype attribute. IndianStereotypes was created to create stereotype benchmark for the Indian context. The dataset language is English. Dataset entries are human-written: sampled from IndicCorp-en. The dataset license is Apache 2.0.

Other notes:

Related to SeeGull

Published by Bhatt et al. (Nov 2022): Re-contextualizing Fairness in NLP: The Case of India. This is an industry publication.

Added to SafetyPrompts.com on 07.04.2024.

BBQ [data on HuggingFace] [data on GitHub] [paper at ACL 2022 (Findings)]

58,492 examples. Each example is a context plus two questions with answer choices. BBQ was created to evaluate social biases of LLMs in question answering. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is CC BY 4.0.

Other notes:

Focus on stereotyping behaviour
Covers 9 categories of bias: age, disability status, gender identity, nationality, physical appearance, race/ethnicity, religion, socioeconomic status, sexual orientation
25+ templates per category

Published by Parrish et al. (May 2022): BBQ: A Hand-Built Bias Benchmark for Question Answering. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

FrenchCrowSPairs [data on HuggingFace] [data on GitHub] [paper at ACL 2022]

1,679 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. FrenchCrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is French. Dataset entries are human-written: written by authors, partly translated from English CrowSPairs. The dataset license is CC BY-SA 4.0.

Other notes:

Translated from English CrowSPairs, plus manual additions
Covers 10 categories of bias: ethnicity, gender, sexual orientation, religion, age, nationality, disability, socioeconomic status / occupation, physical appearance, other

Published by Neveol et al. (May 2022): French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

BiasOutOfTheBox [data on GitHub] [paper at NeurIPS 2021]

228 prompts. Each prompt is an unfinished sentence about an individual with specified sociodemographics. BiasOutOfTheBox was created to evaluate intersectional occupational biases in GPT-2. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Covers 6 categories of bias: gender, ethnicity, religion, sexuality, political preference, cultural origin (continent)

Published by Kirk et al. (Dec 2021): Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

EthnicBias [data on GitHub] [paper at EMNLP 2021]

60 templates. Each template is filled with country and attribute. EthnicBias was created to evaluate ethnic bias in masked language models. The dataset languages are English, German, Spanish, Korean, Turkish and Chinese. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

10 templates per language
Covers 3 categories of bias: national origin, occupation, legal status

Published by Ahn and Oh (Nov 2021): Mitigating Language-Dependent Ethnic Bias in BERT. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

RedditBias [data on GitHub] [paper at ACL 2021]

11,873 Reddit comments. Each comment references a target group. RedditBias was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by Reddit users. The dataset license is MIT.

Other notes:

Covers 4 categories of bias: religion, race, gender, queerness
Evaluation by perplexity and conversation

Published by Barikeri et al. (Aug 2021): RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

HypothesisStereotypes [data on GitHub] [paper at ACL 2021 (Findings)]

2,098 prompts. Each prompt is an NLI premise. HypothesisStereotypes was created to study how stereotypes manifest in LLM-generated NLI hypotheses. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Uses 103 context situations as templates
Covers 6 categories of bias: gender, race, nationality, religion, politics, socio
Task for LLM is to generate hypothesis based on premise

Published by Sotnikova et al. (Aug 2021): Analyzing Stereotypes in Generative Text Inference Tasks. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

StereoSet [data on HuggingFace] [data on GitHub] [paper at ACL 2021]

16,955 multiple-choice questions. Each question is either about masked word or whole sentence association. StereoSet was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by US MTurkers. The dataset license is CC BY-SA 4.0.

Other notes:

Covers intersentence and intrasentence context
Covers 4 categories of bias: gender, profession, race, and religion

Published by Nadeem et al. (Aug 2021): StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 04.01.2024.

HONEST [data on HuggingFace] [data on GitHub] [paper at NAACL 2021]

2,520 prompts. Each prompt is the beginning of a sentence related to identity groups. HONEST was created to measure hurtful sentence completions from LLMs. The dataset languages are English, Italian, French, Portuguese, Romanian and Spanish. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

420 instances per language
Generated from 28 identity terms and 15 templates
Covers 1 category of bias: binary gender

Published by Nozza et al. (Jun 2021): HONEST: Measuring Hurtful Sentence Completion in Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

LMBias [data on GitHub] [paper at ICML 2021]

16,388 prompts. Each prompt is an unfinished sentence. LMBias was created to understand and mitigate social biases in LMs. The dataset language is English. Dataset entries are human-written: sampled from existing corpora including Reddit and WikiText. The dataset license is MIT.

Other notes:

Covers two categories of bias: binary gender, religion

Published by Liang et al. (May 2021): Towards Understanding and Mitigating Social Biases in Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

BOLD [data on HuggingFace] [data on GitHub] [paper at FAccT 2021]

23,679 prompts. Each prompt is an unfinished sentence from Wikipedia. BOLD was created to evaluate bias in text generation. The dataset language is English. Dataset entries are human-written: sampled starting sentences of Wikipedia articles. The dataset license is CC BY-SA 4.0.

Other notes:

Similar to RealToxicityPrompts but for bias
Covers 5 categories of bias: profession, gender, race, religion, and political ideology

Published by Dhamala et al. (Mar 2021): BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. This is an industry publication.

Added to SafetyPrompts.com on 23.12.2023.

CrowSPairs [data on HuggingFace] [data on GitHub] [paper at EMNLP 2020]

1,508 sentence pairs. Each pair comprises two sentences that are identical except for identity group references. CrowSPairs was created to evaluate LLM bias related to sociodemographics. The dataset language is English. Dataset entries are human-written: written by crowdworkers (US MTurkers). The dataset license is CC BY-SA 4.0.

Other notes:

Validated with 5 annotations per entry
Covers 9 categories of bias: race, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status.

Published by Nangia et al. (Nov 2020): CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

UnQover [data on GitHub] [paper at EMNLP 2020 (Findings)]

44 templates. Each template is combined with subjects and attributres to create an underspecified question. UnQover was created to evaluate stereotyping biases in QA systems. The dataset language is English. Dataset entries were created in a hybrid fashion: templates written by authors, subjects and attributes sampled from StereoSet and hand-written. The dataset license is Apache 2.0.

Other notes:

Covers 4 categories of bias: gender, nationality, ethnicity, religion

Published by Li et al. (Nov 2020): UNQOVERing Stereotyping Biases via Underspecified Questions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

Regard [data on GitHub] [paper at EMNLP 2019]

60 prompts. Each prompt is an unfinished sentence. Regard was created to evaluate biases in natural language generation. The dataset language is English. Dataset entries are human-written: written by the authors. The dataset license is not specified.

Other notes:

Covers 3 categories of bias: binary gender, race, sexual orientation

Published by Sheng et al. (Nov 2019): The Woman Worked as a Babysitter: On Biases in Language Generation. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 07.04.2024.

WinoBias [data on HuggingFace] [data on GitHub] [paper at NAACL 2018]

3,160 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoBias was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Uses two different sentence templates
Designed to evalauate coref resolution systems
Covers 1 category of bias: binary gender

Published by Zhao et al. (Jun 2018): Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

WinoGender [data on HuggingFace] [data on GitHub] [paper at NAACL 2018]

720 sentences. Each sentence refers to a person by their occupation and uses pronouns. WinoGender was created to evaluate gender bias in coreference resolution. The dataset language is English. Dataset entries were created in a hybrid fashion: human-written templates expanded by combination. The dataset license is MIT.

Other notes:

Uses two different sentence templates
Covers 1 category of bias: ternary gender

Published by Rudinger et al. (Jun 2018): Gender Bias in Coreference Resolution. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 04.01.2024.

Other Datasets

Mosscap [data on HuggingFace] [paper at blog]

278,945 prompts. Most prompts are prompt extraction attacks. Mosscap was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Mosscap game. The dataset license is MIT.

Other notes:

The focus of the work is adversarial / prompt extraction
Not all prompts are attacks
Prompts correspond to 8 difficulty levels of the game

Published by Lakera AI (Dec 2023): Mosscap Prompt Injection. This is an industry publication.

Added to SafetyPrompts.com on 11.01.2024.

HackAPrompt [data on HuggingFace] [paper at EMNLP 2023]

601,757 prompts. Most prompts are prompt extraction attacks. HackAPrompt was created to analyse prompt hacking / extraction attacks. The dataset language is mostly English. Dataset entries are human-written: written by participants of the HackAPrompt competition. The dataset license is MIT.

Other notes:

Focus of the work is adversarial / prompt hacking
Prompts were written by ca. 2.8k people from 50+ countries

Published by Schulhoff et al. (Dec 2023): Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 11.01.2024.

ToxicChat [data on HuggingFace] [paper at EMNLP 2023 (Findings)]

10,166 conversations. Each conversation is a single-turn with user input and LLM output. ToxicChat was created to evaluate dialogue content moderation systems. The dataset language is mostly English. Dataset entries are human-written: written by LMSys users. The dataset license is CC BY-NC 4.0.

Other notes:

Subset of LMSYSChat1M
Annotated for toxicity by 4 authors
Ca. 7% toxic

Published by Lin et al. (Dec 2023): ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 23.12.2023.

GandalfIgnoreInstructions [data on HuggingFace] [paper at blog]

1,000 prompts. Most prompts are prompt extraction attacks. GandalfIgnoreInstructions was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

Focus of the work is adversarial / prompt extraction
Not all prompts are attacks

Published by Lakera AI (Oct 2023): Gandalf Prompt Injection: Ignore Instruction Prompts. This is an industry publication.

Added to SafetyPrompts.com on 05.02.2024.

GandalfSummarization [data on HuggingFace] [paper at blog]

140 prompts. Most prompts are prompt extraction attacks. GandalfSummarization was created to analyse prompt hacking / extraction attacks. The dataset language is English. Dataset entries are human-written: written by players of the Gandalf game. The dataset license is MIT.

Other notes:

Focus of the work is adversarial / prompt extraction
Not all prompts are attacks

Published by Lakera AI (Oct 2023): Gandalf Prompt Injection: Summarization Prompts. This is an industry publication.

Added to SafetyPrompts.com on 05.02.2024.

FairPrism [data on GitHub] [paper at ACL 2023]

5,000 conversations. Each conversation is single-turn, containing a prompt and a potentially harmful model response. FairPrism was created to analyse harms in conversations with LLMs. The dataset language is English. Dataset entries were created in a hybrid fashion: prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet. The dataset license is MIT.

Other notes:

Does not introduce new prompts
Focus is on analysing model responses

Published by Fleisig et al. (Jul 2023): FairPrism: Evaluating Fairness-Related Harms in Text Generation. This is a collaboration between academia and industry.

Added to SafetyPrompts.com on 07.04.2024.

OIGModeration [data on HuggingFace] [data on GitHub] [paper at blog]

200,811 conversations. Each conversation has one or multiple turns. OIGModeration was created to create a diverse dataset of user dialogue that may be unsafe. The dataset language is English. Dataset entries were created in a hybrid fashion: data from public datasets, community contributions, synthetic and augmented data. The dataset license is Apache 2.0.

Other notes:

Contains safe and unsafe content
Dialogue-turns are labelled for the level of necessary caution
Labelling process unclear

Published by Ontocord AI (Mar 2023): Open Instruction Generalist: Moderation Dataset. This is an industry publication.

Added to SafetyPrompts.com on 05.02.2024.

ConvAbuse [data on GitHub] [paper at EMNLP 2021]

6,837 examples. Each example is a turn in a potentially multi-turn conversation. ConvAbuse was created to analyse abuse towards conversational AI systems. The dataset language is English. Dataset entries are human-written: written by users in conversations with three AI systems. The dataset license is CC BY 4.0.

Other notes:

Annotated for different types of abuse
Annotators are gender studies students
20,710 annotations for 6,837 examples

Published by Cercas Curry et al. (Nov 2021): ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI. This is an academic/non-profit publication.

Added to SafetyPrompts.com on 11.01.2024.

About

Table of Contents