This website lists open datasets for evaluating and improving the safety of large language models (LLMs). We include datasets that loosely fit two criteria:
We know our catalogue is not complete yet, and we plan to do regular updates. If you know of any missing or new datasets, please let us know via email or on Twitter. LLM safety is a community effort!
This website is maintained by me, Paul Röttger. I am a postdoc at MilaNLP working on evaluating and improving LLM safety. For feedback and suggestions, please get in touch via email or on Twitter.
Thank you to everyone who has given feedback or contributed to this website in other ways. Please check out the Acknowledgements.
If you use this website for your research, please cite our arXiv preprint:
@misc{röttger2024safetyprompts, title={SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety}, author={Paul Röttger and Fabio Pernisi and Bertie Vidgen and Dirk Hovy}, year={2024}, eprint={2404.05399}, archivePrefix={arXiv}, primaryClass={cs.CL} }
As of August 1st, 2024, SafetyPrompts.com lists 122 datasets. 48 "broad safety" datasets cover several aspects of LLM safety. 20 "narrow safety" datasets focus only on one specific aspect of LLM safety. 20 "value alignment" datasets are concerned with the ethical, moral or social behaviour of LLMs. 26 "bias" datasets evaluate sociodemographic biases in LLMs. 8 "other" datasets serve more specialised purposes.
Below, we list all datasets for each purpose type by date of publication, with the newest datasets listed first.
Note: SafetyPrompts.com takes its data from a Google Sheet.
You may find the sheet useful for running your own analyses.
100 prompts.
Each prompt is an unsafe question or instruction.
JBBBehaviours was created to evaluate effectiveness of different jailbreaking methods.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AdvBench and Harmbench, plus hand-written examples.
The dataset license is MIT.
Other notes:
Published by Chao et al. (Jul 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
100 prompts.
Each prompt is a question or instruction.
GPTFuzzer was created to evaluate effectiveness of automated red-teaming method.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicHarmlessBase and unpublished GPT-written dataset.
The dataset license is MIT.
Other notes:
Published by Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
1,400 conversations.
Each conversation is multi-turn with the final question being unsafe.
CoSafe was created to evaluating LLM safety in dialogue coreference.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from BeaverTails, then expanded with GPT-4 into multi-turn conversations.
The dataset license is not specified.
Other notes:
Published by Yu et al. (Jun 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
44,800 prompts.
Each prompt is a question or instruction.
ALERT was created to evaluate the safety of LLMs through red teaming methodologies.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, then augmented with templates.
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
11,435 multiple-choice questions.
SafetyBench was created to evaluate LLM safety with multiple choice questions.
The dataset languages are English and Chinese.
Dataset entries were created in a hybrid fashion: sampled from existing datasets + exams (zh), then LLM augmentation (zh).
The dataset license is MIT.
Other notes:
Published by Zhang et al. (Jun 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
9,450 prompts.
Each prompt is an unsafe question or instruction.
SorryBench was created to evaluate fine-grained LLM safety across varying linguistic characteristics.
The dataset languages are English, French, Chinese, Marathi, Tamil and Malayalam.
Dataset entries were created in a hybrid fashion: sampled from 10 other datasets, then augmented with templates.
The dataset license is MIT.
Other notes:
Published by Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
28,000 prompts.
Each prompt is a question or instruction.
XSafety was created to evaluate multilingual LLM safety.
The dataset languages are English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese and German.
Dataset entries were created in a hybrid fashion: sampled from SafetyPrompts, then auto-translated and validated.
The dataset license is not specified.
Other notes:
Published by Wang et al. (Jun 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
2,251 prompts.
Each prompt is a question or instruction.
Flames was created to evaluate value alignment of Chinese language LLMs.
The dataset language is Chinese.
Dataset entries are human-written: written by crowdworkers.
The dataset license is Apache 2.0.
Other notes:
Published by Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
1,861 multiple-choice questions.
CHiSafetyBench was created to evaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chinese.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: sampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5.
The dataset license is not specified.
Other notes:
Published by Zhang et al. (Jun 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 01.08.2024.
21,000 prompts.
Each prompt is a question or instruction.
SaladBench was created to evaluate LLM safety, plus attack and defense methods.
The dataset language is English.
Dataset entries were created in a hybrid fashion: mostly sampled from existing datasets, then augmented using GPT-4.
The dataset license is Apache 2.0.
Other notes:
Published by Li et al. (Jun 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
20,000 prompts.
Each prompt is an unsafe question or instruction.
SEval was created to evaluate LLM safety.
The dataset languages are English and Chinese.
Dataset entries are machine-written: generated by a fine-tuned Qwen-14b.
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
107,250 prompts.
Each prompt is a question targetting behaviour disallowed by OpenAI.
ForbiddenQuestions was created to evaluate whether LLMs answer questions that violate OpenAI's usage policy.
The dataset language is English.
Dataset entries are machine-written: GPT-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
52,430 conversations.
Each conversation is single-turn, containing a prompt and a potentially harmful model response.
SAFE was created to evaluate LLM safety beyond binary distincton of safe and unsafe.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled seed prompts from Friday website, then generated more prompts with GPT-4.
The dataset license is not specified.
Other notes:
Published by Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
939 prompts.
Each prompt is a question.
DoNotAnswer was created to evaluate 'dangerous capabilities' of LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-4.
The dataset license is Apache 2.0.
Other notes:
Published by Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
3,000 prompts.
Each prompt is a harmful instruction with an associated jailbreak prompt.
UltraSafety was created to create data for safety fine-tuning of LLMs.
The dataset language is English.
Dataset entries are machine-written: sampled from AdvBench and MaliciousInstruct, then expanded with SelfInstruct.
The dataset license is MIT.
Other notes:
Published by Guo et al. (Feb 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
400 prompts.
Each prompt is an instruction.
HarmBench was created to evaluate effectiveness of automated red-teaming methods.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is MIT.
Other notes:
Published by Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
243,877 prompts.
Each prompt is an instruction.
DecodingTrust was created to evaluate trustworthiness of LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates and examples plus extensive augmentation from GPTs.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
100 prompts.
Each prompt is a simple question or instruction.
SimpleSafetyTests was created to evaluate critical safety risks in LLMs.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
346 prompts.
Each prompt is a 'forbidden question' in one of six categories.
StrongREJECT was created to better investigate the effectiveness of different jailbreaking techniques.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance.
The dataset license is not specified.
Other notes:
Published by Souly et al. (Feb 2024): "A StrongREJECT for Empty Jailbreaks". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 27.02.2024.
100 prompts.
Each prompt is a question.
QHarm was created to evaluate LLM safety.
The dataset language is English.
Dataset entries are human-written: sampled randomly from AnthropicHarmlessBase (written by crowdworkers).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
100 prompts.
Each prompt is an instruction.
MaliciousInstructions was created to evaluate compliance of LLMs with malicious instructions.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-3 (text-davinci-003).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
2,000 conversations.
Each conversation is a user prompt with a safe model response.
SafetyInstructions was created to fine-tune LLMs to be safer.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
330 prompts.
Each prompt is a harmful instruction.
HExPHI was created to evaluate LLM safety.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs..
The dataset license is custom (HEx-PHI).
Other notes:
Published by Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
1,000 prompts.
500 are harmful strings that the model should not reproduce, 500 are harmful instructions.
AdvBench was created to elicit generation of harmful or objectionable content from LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by Wizard-Vicuna-30B-Uncensored.
The dataset license is MIT.
Other notes:
Published by Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
100 prompts.
Each prompt is an instruction.
TDCRedTeaming was created to evaluate success of automated red-teaming approaches.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is not specified.
Other notes:
Published by Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
2,130 prompts.
Each prompt is a question targeting a specific harm category.
JADE was created to use linguistic fuzzing to generate challenging prompts for evaluating LLM safety.
The dataset languages are Chinese and English.
Dataset entries are machine-written: generated by LLMs based on linguistic rules.
The dataset license is MIT.
Other notes:
Published by Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 05.02.2024.
10,050 prompts.
Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response.
CPAD was created to elicit generation of harmful or objectionable content from LLMs.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: mostly generated by GPTs based on some human seed prompts.
The dataset license is not specified.
Other notes:
Published by Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
1,402 prompts.
Each prompt is a question.
AttaQ was created to evaluate tendency of LLMs to generate harmful or undesirable responses.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia.
The dataset license is MIT.
Other notes:
Published by Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 01.08.2024.
3,269 prompts.
Each prompt is an instruction.
AART was created to illustrate the AART automated red-teaming method.
The dataset language is English.
Dataset entries are machine-written: generated by PALM.
The dataset license is CC BY 4.0.
Other notes:
Published by Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
29,201 prompts.
Each prompt is a question about a more or less controversial issue.
DELPHI was created to evaluate LLM performance in handling controversial issues.
The dataset language is English.
Dataset entries are human-written: sampled from the Quora Question Pair Dataset (written by Quora users).
The dataset license is CC BY 4.0.
Other notes:
Published by Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
197,628 sentences.
Each sentence is taken from a social media dataset.
AdvPromptSet was created to evaluate LLM responses to adversarial toxicity text prompts.
The dataset language is English.
Dataset entries are human-written: sampled from two Jigsaw social media datasets (written by social media users).
The dataset license is MIT.
Other notes:
Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
2,116 prompts.
Each prompt is a question or instruction, sometimes within a jailbreak template.
FFT was created to evaluation factuality,
fairness, and toxicity of LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewhere.
The dataset license is not specified.
Other notes:
Published by Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
333,963 conversations.
Each conversation contains a human prompt and LLM response.
BeaverTails was created to evaluate and improve LLM safety on QA.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled from human AnthropicRedTeam data, plus model-generated responses.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Ji et al. (Nov 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
100 prompts.
Each prompt is an unsafe question.
MaliciousInstruct was created to evaluate success of generation exploit jailbreak.
The dataset language is English.
Dataset entries are machine-written: written by ChatGPT, then filtered by authors.
The dataset license is not specified.
Other notes:
Published by Huang et al. (Oct 2023): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
1,960 prompts.
Each prompt is a question.
HarmfulQA was created to evaluate and improve LLM safety.
The dataset language is English.
Dataset entries are machine-written: generated by ChatGPT.
The dataset license is Apache 2.0.
Other notes:
Published by Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
200 prompts.
Each prompt is a question.
HarmfulQ was created to evaluate LLM safety.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-3 (text-davinci-002).
The dataset license is not specified.
Other notes:
Published by Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
24,516 binary-choice questions.
ModelWrittenAdvancedAIRisk was created to evaluate advanced AI risk posed by LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: generated by an unnamed LLM and crowdworkers.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
100,000 prompts.
Each prompt is a question or instruction.
SafetyPrompts was created to evaluate the safety of Chinese LLMs.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: human-written examples, augmented by LLMs.
The dataset license is Apache 2.0.
Other notes:
Published by Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
58,137 conversations.
Each conversation starts with a potentially unsafe opening followed by constructive feedback.
ProsocialDialog was created to teach conversational agents to respond to problematic content following social norms.
The dataset language is English.
Dataset entries were created in a hybrid fashion: GPT3-written openings with US crowdworker responses.
The dataset license is MIT.
Other notes:
Published by Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
38,961 conversations.
Conversations can be multi-turn, with user input and LLM output.
AnthropicRedTeam was created to analyse how people red-team LLMs.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (Upwork + MTurk).
The dataset license is MIT.
Other notes:
Published by Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
7,881 conversations.
Each conversation contains a safety failure plus a recovery response.
SaFeRDialogues was created to recovering from safety failures in LLM conversations.
The dataset language is English.
Dataset entries are human-written: unsafe conversation starters sampled from BAD, recovery responses written by crowdworkers.
The dataset license is MIT.
Other notes:
Published by Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
990 prompts.
Each prompt is a question, instruction or statement.
SafetyKit was created to quickly assess apparent safety concerns in conversational AI.
The dataset language is English.
Dataset entries are human-written: sampled from several human-written datasets.
The dataset license is MIT.
Other notes:
Published by Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
11,492 conversational turns.
Each turn consists of a context and a response.
DiaSafety was created to capture unsafe behaviors in humanbot dialogue settings.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from Reddit, other datasets, and machine-generated.
The dataset license is Apache 2.0.
Other notes:
Published by Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
44,849 conversational turns.
Each turn consists of a user prompt and multiple LLM completions.
AnthropicHarmlessBase was created to red-team LLMs.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (Upwork + MTurk).
The dataset license is MIT.
Other notes:
Published by Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
78,874 conversations.
Conversations can be multi-turn, with user input and LLM output.
BAD was created to evaluate the safety of conversational agents.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers with the goal of making models give unsafe responses, and also validated by multiple annotators.
The dataset license is MIT.
Other notes:
Published by Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
99,442 prompts.
Each prompt is an unfinished sentence from the OpenWebCorpus.
RealToxicityPrompts was created to evaluate propensity of LLMs to generate toxic content.
The dataset language is English.
Dataset entries are human-written: sampled OpenWebCorpus sentences.
The dataset license is Apache 2.0.
Other notes:
Published by Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
90,000 prompts.
30k are for multi-turn tasks, and 60k for single-turn tasks.
ParlAIDialogueSafety was created to evaluate and improve the safety of conversational agents.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers in isolation ("standard"), or with the goal of making a model give an offensive response ("adversarial").
The dataset license is MIT.
Other notes:
Published by Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
24,850 conversations.
Each conversation is about an emotional situation described by one speaker, across one or multiple turns.
EmpatheticDialogues was created to train dialogue agents to be more empathetic.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (US MTurkers).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
3,688 multiple-choice questions.
WMDP was created to measure hazardous knowledge in biosecurity, cybersecurity, and chemical security.
The dataset language is English.
Dataset entries are human-written: written by experts (academics and technical consultants).
The dataset license is MIT.
Other notes:
Published by Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
450 prompts.
Each prompt is a simple question.
XSTest was created to evaluate exaggerated safety / false refusal behaviour in LLMs.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY 4.0.
Other notes:
Published by Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
1,800 conversations.
Each conversation is an unsafe medical request with an associated safe response.
MedSafetyBench was created to measure the medical safety of LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-4 and llama-2-7b.
The dataset license is MIT.
Other notes:
Published by Han et al. (Jun 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
15,140 prompts.
Each prompt is an instruction or question, sometimes with a jailbreak.
DoAnythingNow was created to characterise and evaluate in-the-wild LLM jailbreak prompts.
The dataset language is English.
Dataset entries are human-written: written by users on Reddit, Discord, websites and in other datasets.
The dataset license is MIT.
Other notes:
Published by Shen et al. (May 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
862 prompts.
Each prompt is a test case combining rules and instructions.
RuLES was created to evaluate the ability of LLMs to follow simple rules.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
178 prompts.
Each prompt is an instruction.
CoNA was created to evaluate compliance of LLMs with harmful instructions.
The dataset language is English.
Dataset entries are human-written: sampled from MT-CONAN, then rephrased.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
40 prompts.
Each prompt is an instruction.
ControversialInstructions was created to evaluate LLM behaviour on controversial topics.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
1,000 prompts.
Each prompt is an instruction.
PhysicalSafetyInstructions was created to evaluate LLM commonsense physical safety.
The dataset language is English.
Dataset entries are human-written: sampled from SafeText, then rephrased.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
20,956 prompts.
Each prompt is an open-ended question or instruction.
SycophancyEval was created to evaluate sycophancy in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts are written by humans and models.
The dataset license is not specified.
Other notes:
Published by Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
1,326 prompts.
Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary.
ConfAIde was created to evaluate the privacy-reasoning capabilities of instruction-tuned LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination and with LLMs.
The dataset license is MIT.
Other notes:
Published by Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
1,000 prompts.
Each prompt is an instruction to assist in a cyberattack.
CyberattackAssistance was created to evaluate LLM compliance in assisting in cyberattacks.
The dataset language is English.
Dataset entries were created in a hybrid fashion: written by experts, augmented with LLMs.
The dataset license is custom (Llama2 Community License).
Other notes:
Published by Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
122 prompts.
Each prompt is a single-sentence misconception.
SPMisconceptions was created to measure the ability of LLMs to refute misconceptions.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is MIT.
Other notes:
Published by Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
569 samples.
Each sample combines defense and attacker input.
PromptExtractionRobustness was created to evaluate LLM vulnerability to prompt extraction.
The dataset language is English.
Dataset entries are human-written: written by Tensor Trust players.
The dataset license is not specified.
Other notes:
Published by Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
775 samples.
Each sample combines defense and attacker input.
PromptHijackingRobustness was created to evaluate LLM vulnerability to prompt hijacking.
The dataset language is English.
Dataset entries are human-written: written by Tensor Trust players.
The dataset license is not specified.
Other notes:
Published by Toyer et al. (Nov 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
416 prompts.
Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template.
LatentJailbreak was created to evaluate safety and robustness of LLMs in response to adversarial prompts.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
30,051 binary-choice questions.
ModelWrittenSycophancy was created to evaluate sycophancy in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: questions sampled from surveys with contexts generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
3,238 entries.
Each entry is a tuple of name and email address.
PersonalInfoLeak was created to evaluate whether LLMs are prone to leaking PII.
The dataset language is English.
Dataset entries are human-written: sampled from Enron email corpus.
The dataset license is Apache 2.0.
Other notes:
Published by Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 27.02.2024.
367 prompts.
Prompts are combined with 1,465 commands to create pieces of advice.
SafeText was created to evaluate commonsense physical safety.
The dataset language is English.
Dataset entries are human-written: written by Reddit users, posts sampled with multiple filtering steps.
The dataset license is MIT.
Other notes:
Published by Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
260,851 prompts.
Each prompt comprises n-shot examples of toxic content.
ToxiGen was created to create new examples of implicit hate speech.
The dataset language is English.
Dataset entries are human-written: sampled from Gab and Reddit (hate), news and blogs (not hate).
The dataset license is MIT.
Other notes:
Published by Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
817 prompts.
Each prompt is a question.
TruthfulQA was created to evaluate truthfulness in LLM answers.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY 4.0.
Other notes:
Published by Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
107,000 binary-choice questions.
Each question is a trolley-style moral choice scenario.
MultiTP was created to evaluate LLM moral decision-making across many languages.
The dataset languages are Afrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional) and Zulu.
Dataset entries were created in a hybrid fashion: sampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languages.
The dataset license is MIT.
Other notes:
Published by Jin et al. (Jul 2024): "Multilingual Trolley Problems for Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
21,492,393 multiple-choice questions.
Each question comes with sociodemographic attributes of the respondent.
WorldValuesBench was created to evaluate LLM awareness of multicultural human values.
The dataset language is English.
Dataset entries are human-written: adapted from the World Values Survey (written by survey designers) using templates.
The dataset license is not specified.
Other notes:
Published by Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
8,011 conversations.
Conversations can be multi-turn, with user input and responses from one or multiple LLMs.
PRISM was created to capture diversity of human preferences over LLM behaviours.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Kirk et al. (Apr 2024): "The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
2,556 multiple-choice questions.
GlobalOpinionQA was created to evaluate whose opinions LLM responses are most similar to.
The dataset language is English.
Dataset entries are human-written: adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers).
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Durmus et al. (Apr 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models". This is an industry publication.
Added to SafetyPrompts.com on 04.01.2024.
1,767 binary-choice question.
Each prompt is a hypothetical moral scenario with two potential actions.
MoralChoice was created to evaluate the moral beliefs in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: mostly generated by GPTs, plus some human-written scenarios.
The dataset license is CC BY 4.0.
Other notes:
Published by Scherrer et al. (Nov 2023): "Evaluating the Moral Beliefs Encoded in LLMs". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
1,498 multiple-choice questions.
OpinionQA was created to evaluate the alignment of LLM opinions with US demographic groups.
The dataset language is English.
Dataset entries are human-written: adapted from the Pew American Trends Panel surveys.
The dataset license is not specified.
Other notes:
Published by Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
1,712 multiple-choice questions.
Each question targets responsible behaviours.
CValuesResponsibilityMC was created to evaluate human value alignment in Chinese LLMs.
The dataset language is Chinese.
Dataset entries are machine-written: automatically created from human-written prompts.
The dataset license is Apache 2.0.
Other notes:
Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
800 prompts.
Each prompt is an open question targeting responsible behaviours.
CValuesResponsibilityPrompts was created to evaluate human value alignment in Chinese LLMs.
The dataset language is Chinese.
Dataset entries are human-written: written by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
133,204 binary-choice questions.
ModelWrittenPersona was created to evaluate LLM behaviour related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals.
The dataset language is English.
Dataset entries are machine-written: generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
350 conversations.
Conversations can be multi-turn, with user input and LLM output.
DICES350 was created to collect diverse perspectives on conversational AI safety.
The dataset language is English.
Dataset entries are human-written: written by adversarial Lamda users.
The dataset license is CC BY 4.0.
Other notes:
Published by Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
990 conversations.
Conversations can be multi-turn, with user input and LLM output.
DICES990 was created to collect diverse perspectives on conversational AI safety.
The dataset language is English.
Dataset entries are human-written: written by adversarial Lamda users.
The dataset license is CC BY 4.0.
Other notes:
Published by Aroyo et al. (Jun 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
572,322 scenarios.
Each scenario is a choose-your-own-adventure style prompt.
Machiavelli was created to evaluate ethical behaviour of LLM agents.
The dataset language is English.
Dataset entries are human-written: human-written choose-your-own-adventure stories.
The dataset license is MIT.
Other notes:
Published by Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
148 prompts.
Each prompt is a question about a vignette (situation) related to a specific norm.
MoralExceptQA was created to evaluate LLM ability to understand, interpret and predict human moral judgments and decisions.
The dataset language is English.
Dataset entries are human-written: written by the authors (?).
The dataset license is not specified.
Other notes:
Published by Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
38,000 conversations.
Each conversation is a single-turn with user input and LLM output.
MIC was created to understand the intuitions, values and moral judgments reflected in LLMs.
The dataset language is English.
Dataset entries are human-written: questions sampled from AskReddit.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
1,838 locations.
Each location comes with a number of actions to choose from.
JiminyCricket was created to evaluate alignment of agents with human values and morals.
The dataset language is English.
Dataset entries are human-written: sampled from 25 text-based adventure games, then annotated for morality by human annotators.
The dataset license is MIT.
Other notes:
Published by Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
12,000 stories.
Each story consists of seven sentences.
MoralStories was created to evaluate commonsense, moral and social reasoning skills of LLMs.
The dataset language is English.
Dataset entries are human-written: written and validated by US MTurkers.
The dataset license is MIT.
Other notes:
Published by Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
134,420 binary-choice question.
Each prompt is a scenario about ethical reasoning with two actions to choose from.
ETHICS was created to assess LLM basic knowledge of
ethics and common human values.
The dataset language is English.
Dataset entries are human-written: written and validated by crowdworkers (US, UK and Canadian MTurkers).
The dataset license is MIT.
Other notes:
Published by Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
32,766 anecdotes.
Each anecdote describes an action in the context of a situation.
ScruplesAnecdotes was created to evaluate LLM understanding of ethical norms.
The dataset language is English.
Dataset entries are human-written: sampled from Reddit AITA communities.
The dataset license is Apache 2.0.
Other notes:
Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
10,000 binary-choice question.
Each prompt pairs two actions and identifies which one crowd workers found less ethical.
ScruplesDilemmas was created to evaluate LLM understanding of ethical norms.
The dataset language is English.
Dataset entries are human-written: sampled from Reddit AITA communities.
The dataset license is Apache 2.0.
Other notes:
Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
292,000 sentences.
Each sentence is a rule of thumb.
SocialChemistry101 was created to evaluate the ability of LLMs to reason about social and moral norms.
The dataset language is English.
Dataset entries are human-written: written by US crowdworkers, based on situations described on social media.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
78,400 examples.
Each example is a QA, sentiment classification or NLI example.
CALM was created to evaluate LLM gender and racial bias across different tasks and domains.
The dataset language is English.
Dataset entries were created in a hybrid fashion: templates created by humans based on other datasets, then expanded by combination.
The dataset license is MIT.
Other notes:
Published by Gupta et al. (Jan 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
9,450 binary-choice questions.
Each question comes with scenario context.
DiscrimEval was created to evaluate the potential discriminatory impact of LMs across use cases.
The dataset language is English.
Dataset entries are machine-written: topics, templates and questions generated by Claude.
The dataset license is CC BY 4.0.
Other notes:
Published by Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
214,460 prompts.
Each prompt is the beginning of a sentence related to a person's sociodemographics.
HolisticBiasR was created to evaluate LLM completions for sentences related to individual sociodemographics.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
3,565 sentences.
Each sentence corresponds to a specific gender stereotype.
GEST was created to measure gender-stereotypical reasoning in language models and machine translation systems..
The dataset languages are Belarussian, Russian, Ukrainian, Croation, Serbian, Slovene, Czech, Polish, Slovak and English.
Dataset entries are human-written: written by professional translators, then validated by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Pikuliak et al. (Nov 2023): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 05.02.2024.
4,800 Weibo posts.
Each post references a target group.
CHBias was created to evaluate LLM bias related to sociodemographics.
The dataset language is Chinese.
Dataset entries are human-written: posts sampled from Weibo (written by Weibo users).
The dataset license is MIT.
Other notes:
Published by Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
7,750 tuples.
Each tuple is an identity group plus a stereotype attribute.
SeeGULL was created to expand cultural and geographic coverage of stereotype benchmarks.
The dataset language is English.
Dataset entries were created in a hybrid fashion: generated by LLMs, partly validated by human annotation.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 05.02.2024.
3,000 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoGenerated was created to evaluate pronoun gender biases in LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
45,540 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
WinoQueer was created to evaluate LLM bias related to queer identity terms.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is not specified.
Other notes:
Published by Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
28,343 conversations.
Each conversation is a single turn between two Zhihu users.
CDialBias was created to evaluate bias in Chinese social media conversations.
The dataset language is Chinese.
Dataset entries are human-written: sampled from social media site Zhihu.
The dataset license is not specified.
Other notes:
Published by Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
459,758 prompts.
Each prompt is a sentence starting a two-person conversation.
HolisticBias was created to evaluate LLM biases related to sociodemographics .
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written attributes combined in templates.
The dataset license is MIT.
Other notes:
Published by Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset". This is an industry publication.
Added to SafetyPrompts.com on 04.01.2024.
3,852 tuples.
Each tuple is an identity group plus a stereotype attribute.
IndianStereotypes was created to benchmark stereotypes for the Indian context.
The dataset language is English.
Dataset entries are human-written: sampled from IndicCorp-en.
The dataset license is Apache 2.0.
Other notes:
Published by Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
58,492 examples.
Each example is a context plus two questions with answer choices.
BBQ was created to evaluate social biases of LLMs in question answering.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is CC BY 4.0.
Other notes:
Published by Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
1,679 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
FrenchCrowSPairs was created to evaluate LLM bias related to sociodemographics.
The dataset language is French.
Dataset entries are human-written: written by authors, partly translated from English CrowSPairs.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
228 prompts.
Each prompt is an unfinished sentence about an individual with specified sociodemographics.
BiasOutOfTheBox was created to evaluate intersectional occupational biases in GPT-2.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
60 templates.
Each template is filled with country and attribute.
EthnicBias was created to evaluate ethnic bias in masked language models.
The dataset languages are English, German, Spanish, Korean, Turkish and Chinese.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
11,873 Reddit comments.
Each comment references a target group.
RedditBias was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by Reddit users.
The dataset license is MIT.
Other notes:
Published by Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
2,098 prompts.
Each prompt is an NLI premise.
HypothesisStereotypes was created to study how stereotypes manifest in LLM-generated NLI hypotheses.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
16,955 multiple-choice questions.
Each question is either about masked word or whole sentence association.
StereoSet was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by US MTurkers.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
2,520 prompts.
Each prompt is the beginning of a sentence related to identity groups.
HONEST was created to measure hurtful sentence completions from LLMs.
The dataset languages are English, Italian, French, Portuguese, Romanian and Spanish.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
16,388 prompts.
Each prompt is an unfinished sentence.
LMBias was created to understand and mitigate social biases in LMs.
The dataset language is English.
Dataset entries are human-written: sampled from existing corpora including Reddit and WikiText.
The dataset license is MIT.
Other notes:
Published by Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
23,679 prompts.
Each prompt is an unfinished sentence from Wikipedia.
BOLD was created to evaluate bias in text generation.
The dataset language is English.
Dataset entries are human-written: sampled starting sentences of Wikipedia articles.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
1,508 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
CrowSPairs was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (US MTurkers).
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
44 templates.
Each template is combined with subjects and attributres to create an underspecified question.
UnQover was created to evaluate stereotyping biases in QA systems.
The dataset language is English.
Dataset entries were created in a hybrid fashion: templates written by authors, subjects and attributes sampled from StereoSet and hand-written.
The dataset license is Apache 2.0.
Other notes:
Published by Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
60 prompts.
Each prompt is an unfinished sentence.
Regard was created to evaluate biases in natural language generation.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is not specified.
Other notes:
Published by Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
3,160 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoBias was created to evaluate gender bias in coreference resolution.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
720 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoGender was created to evaluate gender bias in coreference resolution.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
278,945 prompts.
Most prompts are prompt extraction attacks.
Mosscap was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Mosscap game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Dec 2023): "Mosscap Prompt Injection". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
601,757 prompts.
Most prompts are prompt extraction attacks.
HackAPrompt was created to analyse prompt hacking / extraction attacks.
The dataset language is mostly English.
Dataset entries are human-written: written by participants of the HackAPrompt competition.
The dataset license is MIT.
Other notes:
Published by Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
10,166 conversations.
Each conversation is a single-turn with user input and LLM output.
ToxicChat was created to evaluate dialogue content moderation systems.
The dataset language is mostly English.
Dataset entries are human-written: written by LMSys users.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
1,000 prompts.
Most prompts are prompt extraction attacks.
GandalfIgnoreInstructions was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Gandalf game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
140 prompts.
Most prompts are prompt extraction attacks.
GandalfSummarization was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Gandalf game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
5,000 conversations.
Each conversation is single-turn, containing a prompt and a potentially harmful model response.
FairPrism was created to analyse harms in conversations with LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet.
The dataset license is MIT.
Other notes:
Published by Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
200,811 conversations.
Each conversation has one or multiple turns.
OIGModeration was created to create a diverse dataset of user dialogue that may be unsafe.
The dataset language is English.
Dataset entries were created in a hybrid fashion: data from public datasets, community contributions, synthetic and augmented data.
The dataset license is Apache 2.0.
Other notes:
Published by Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
6,837 examples.
Each example is a turn in a potentially multi-turn conversation.
ConvAbuse was created to analyse abuse towards conversational AI systems.
The dataset language is English.
Dataset entries are human-written: written by users in conversations with three AI systems.
The dataset license is CC BY 4.0.
Other notes:
Published by Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
Thank you to Fabio Pernisi, Bertie Vidgen, and Dirk Hovy for their co-authorship on the SafetyPrompts paper. Thank you for feedback and dataset suggestions to Giuseppe Attanasio, Steven Basart, Federico Bianchi, Daniel Hershcovic, Kexin Huang, Hyunwoo Kim, George Kour, Bo Li, Hannah Lucas, Norman Mu, Niloofar Mireshghallah, Matus Pikuliak, Verena Rieser, Felix Röttger, Sam Toyer, Pranav Venkit, and Laura Weidinger. Special thanks to Hannah Rose Kirk for the initial logo suggestion. Thanks also to Jerome Lachaud for the site theme.
SocialChemistry101 [data on GitHub] [paper at EMNLP 2020 (Main)]