This website lists open datasets for evaluating and improving the safety of large language models (LLMs). We include datasets that loosely fit two criteria:
We regularly update this website. If you know of any missing or new datasets, please let us know via email or on Twitter. LLM safety is a community effort!
This website is maintained by me, Paul Röttger. I am a postdoc at MilaNLP working on evaluating and improving LLM safety. For feedback and suggestions, please get in touch via email or on Twitter.
Thank you to everyone who has given feedback or contributed to this website in other ways. Please check out the Acknowledgements.
If you use this website for your research, please cite our arXiv preprint:
@misc{röttger2024safetyprompts, title={SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety}, author={Paul Röttger and Fabio Pernisi and Bertie Vidgen and Dirk Hovy}, year={2024}, eprint={2404.05399}, archivePrefix={arXiv}, primaryClass={cs.CL} }
As of December 17th, 2024, SafetyPrompts.com lists 144 datasets.
Below, we list all datasets grouped by the purpose they serve, with the newest datasets listed first.
Note: SafetyPrompts.com takes its data from a Google Sheet.
You may find the sheet useful for running your own analyses.
53 "Broad Safety" datasets cover several aspects of LLM safety.
For the full list, click HERE.
25 "Narrow Safety" datasets focus only on one specific aspect of LLM safety.
For the full list, click HERE.
23 "Value Alignment" datasets are concerned with the ethical, moral or social behaviour of LLMs.
For the full list, click HERE.
33 "Bias" datasets evaluate sociodemographic biases in LLMs.
For the full list, click HERE.
10 "Other" datasets serve more specialised purposes.
For the full list, click HERE.
100 prompts.
Each prompt is an unsafe question or instruction.
JBBBehaviours was created to evaluate effectiveness of different jailbreaking methods.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AdvBench and Harmbench, plus hand-written examples.
The dataset license is MIT.
Other notes:
Published by Chao et al. (Dec 2024): "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
1,442 prompts.
Each prompt is a malicious instruction or question, which comes in multiple formats (direct query, jailbreak, multiple-choice or safety classification)
SGBench was created to evaluate the generalisation of LLM safety across various tasks and prompt types.
The dataset language is English.
Dataset entries are human-written: sampled from AdvBench, HarmfulQA, BeaverTails and SaladBench, then expanded using templates and LLMs.
The dataset license is GPL 3.0.
Other notes:
Published by Mou et al. (Dec 2024): "SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 27.02.2024.
313 prompts.
Each prompt is a 'forbidden question' in one of six categories.
StrongREJECT was created to better investigate the effectiveness of different jailbreaking techniques.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written and curated questions from existing datasets, plus LLM-generated prompts verified for quality and relevance.
The dataset license is MIT.
Other notes:
Published by Souly et al. (Dec 2024): "A StrongREJECT for Empty Jailbreaks". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 27.02.2024.
261,534 conversations.
Each conversation is single-turn, containing a potentially unsafe prompt and a safe model response.
WildJailbreak was created to train LLMs to be safe.
The dataset language is English.
Dataset entries are machine-written: generated by different LLMs prompted with example prompts and jailbreak techniques.
The dataset license is ODC BY.
Other notes:
Published by Liang et al. (Dec 2024): "WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
390 prompts.
Each prompt is a question targetting behaviour disallowed by OpenAI.
ForbiddenQuestions was created to evaluate whether LLMs answer questions that violate OpenAI's usage policy.
The dataset language is English.
Dataset entries are machine-written: GPT-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
1,400 conversations.
Each conversation is multi-turn with the final question being unsafe.
CoSafe was created to evaluating LLM safety in dialogue coreference.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from BeaverTails, then expanded with GPT-4 into multi-turn conversations.
The dataset license is not specified.
Other notes:
Published by Yu et al. (Nov 2024): "CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
520 prompts.
Each prompt is a harmful instruction with an associated jailbreak prompt.
ArabicAdvBench was created to evaluate safety risks of LLMs in (different forms of) Arabic.
The dataset language is Arabic.
Dataset entries are machine-written: translated from AdvBench (which is machine-generated).
The dataset license is MIT.
Other notes:
Published by Al Ghanim et al. (Nov 2024): "Jailbreaking LLMs with Arabic Transliteration and Arabizi". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
7,419 prompts.
Each prompt is a harmful question or instruction.
AyaRedTeaming was created to provide a testbed for exploring alignment across global and local preferences.
The dataset languages are Arabic, English, Filipino, French, Hindi, Russian, Serbian and Spanish.
Dataset entries are human-written: written by paid native-speaking annotators.
The dataset license is Apache 2.0.
Other notes:
Published by Aakanshka et al. (Nov 2024): "The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
3,000 prompts.
Each prompt is a harmful instruction with an associated jailbreak prompt.
UltraSafety was created to create data for safety fine-tuning of LLMs.
The dataset language is English.
Dataset entries are machine-written: sampled from AdvBench and MaliciousInstruct, then expanded with SelfInstruct.
The dataset license is MIT.
Other notes:
Published by Guo et al. (Nov 2024): "Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
1,861 multiple-choice questions.
CHiSafetyBench was created to evaluate LLM capabilities to identify risky content and refuse to answer risky questions in Chinese.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: sampled from SafetyBench and SafetyPrompts, plus prompts generated by GPT-3.5.
The dataset license is not specified.
Other notes:
Published by Zhang et al. (Sep 2024): "CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 01.08.2024.
11,435 multiple-choice questions.
SafetyBench was created to evaluate LLM safety with multiple choice questions.
The dataset languages are English and Chinese.
Dataset entries were created in a hybrid fashion: sampled from existing datasets + exams (zh), then LLM augmentation (zh).
The dataset license is MIT.
Other notes:
Published by Zhang et al. (Aug 2024): "SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
550 prompts.
Each prompt is a question.
CatQA was created to evaluate effectiveness of safety training method.
The dataset languages are English, Chinese and Vietnamese.
Dataset entries are machine-written: generated by unnamed LLM that is not safety-tuned.
The dataset license is Apache 2.0.
Other notes:
Published by Bhardwaj et al. (Aug 2024): "Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
28,000 prompts.
Each prompt is a question or instruction.
XSafety was created to evaluate multilingual LLM safety.
The dataset languages are English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Japanese and German.
Dataset entries were created in a hybrid fashion: sampled from SafetyPrompts, then auto-translated and validated.
The dataset license is Apache 2.0.
Other notes:
Published by Wang et al. (Aug 2024): "All Languages Matter: On the Multilingual Safety of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
21,000 prompts.
Each prompt is a question or instruction.
SaladBench was created to evaluate LLM safety, plus attack and defense methods.
The dataset language is English.
Dataset entries were created in a hybrid fashion: mostly sampled from existing datasets, then augmented using GPT-4.
The dataset license is Apache 2.0.
Other notes:
Published by Li et al. (Aug 2024): "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
100 prompts.
Each prompt is a question or instruction.
GPTFuzzer was created to evaluate effectiveness of automated red-teaming method.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicHarmlessBase and unpublished GPT-written dataset.
The dataset license is MIT.
Other notes:
Published by Yu et al. (Jun 2024): "GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
44,800 prompts.
Each prompt is a question or instruction.
ALERT was created to evaluate the safety of LLMs through red teaming methodologies.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, then augmented with templates.
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Tedeschi et al. (Jun 2024): "ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
9,450 prompts.
Each prompt is an unsafe question or instruction.
SorryBench was created to evaluate fine-grained LLM safety across varying linguistic characteristics.
The dataset languages are English, French, Chinese, Marathi, Tamil and Malayalam.
Dataset entries were created in a hybrid fashion: sampled from 10 other datasets, then augmented with templates.
The dataset license is MIT.
Other notes:
Published by Xie et al. (Jun 2024): "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
2,251 prompts.
Each prompt is a question or instruction.
Flames was created to evaluate value alignment of Chinese language LLMs.
The dataset language is Chinese.
Dataset entries are human-written: written by crowdworkers.
The dataset license is Apache 2.0.
Other notes:
Published by Huang et al. (Jun 2024): "FLAMES: Benchmarking Value Alignment of LLMs in Chinese". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
20,000 prompts.
Each prompt is an unsafe question or instruction.
SEval was created to evaluate LLM safety.
The dataset languages are English and Chinese.
Dataset entries are machine-written: generated by a fine-tuned Qwen-14b.
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Yuan et al. (May 2024): "S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
52,430 conversations.
Each conversation is single-turn, containing a prompt and a potentially harmful model response.
SAFE was created to evaluate LLM safety beyond binary distincton of safe and unsafe.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled seed prompts from Friday website, then generated more prompts with GPT-4.
The dataset license is not specified.
Other notes:
Published by Yu et al. (Apr 2024): "Beyond Binary Classification: A Fine-Grained Safety Dataset for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
939 prompts.
Each prompt is a question.
DoNotAnswer was created to evaluate 'dangerous capabilities' of LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-4.
The dataset license is Apache 2.0.
Other notes:
Published by Wang et al. (Mar 2024): "Do-Not-Answer: Evaluating Safeguards in LLMs". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
400 prompts.
Each prompt is an instruction.
HarmBench was created to evaluate effectiveness of automated red-teaming methods.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is MIT.
Other notes:
Published by Mazeika et al. (Feb 2024): "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
243,877 prompts.
Each prompt is an instruction.
DecodingTrust was created to evaluate trustworthiness of LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates and examples plus extensive augmentation from GPTs.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Wang et al. (Feb 2024): "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
100 prompts.
Each prompt is a simple question or instruction.
SimpleSafetyTests was created to evaluate critical safety risks in LLMs.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Vidgen et al. (Feb 2024): "SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
100 prompts.
Each prompt is an unsafe question.
MaliciousInstruct was created to evaluate success of generation exploit jailbreak.
The dataset language is English.
Dataset entries are machine-written: written by ChatGPT, then filtered by authors.
The dataset license is not specified.
Other notes:
Published by Huang et al. (Feb 2024): "Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
330 prompts.
Each prompt is a harmful instruction.
HExPHI was created to evaluate LLM safety.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AdvBench and AnthropicRedTeam, then refined manually and with LLMs..
The dataset license is custom (HEx-PHI).
Other notes:
Published by Qi et al. (Feb 2024): "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
100 prompts.
Each prompt is a question.
QHarm was created to evaluate LLM safety.
The dataset language is English.
Dataset entries are human-written: sampled randomly from AnthropicHarmlessBase (written by crowdworkers).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
100 prompts.
Each prompt is an instruction.
MaliciousInstructions was created to evaluate compliance of LLMs with malicious instructions.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-3 (text-davinci-003).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
2,000 conversations.
Each conversation is a user prompt with a safe model response.
SafetyInstructions was created to fine-tune LLMs to be safer.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled form AnthropicRedTeam, responses generated by gpt-3.5-turbo.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
1,000 prompts.
500 are harmful strings that the model should not reproduce, 500 are harmful instructions.
AdvBench was created to elicit generation of harmful or objectionable content from LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by Wizard-Vicuna-30B-Uncensored.
The dataset license is MIT.
Other notes:
Published by Zou et al. (Dec 2023): "Universal and Transferable Adversarial Attacks on Aligned Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
333,963 conversations.
Each conversation contains a human prompt and LLM response.
BeaverTails was created to evaluate and improve LLM safety on QA.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled from human AnthropicRedTeam data, plus model-generated responses.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Ji et al. (Dec 2023): "BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
100 prompts.
Each prompt is an instruction.
TDCRedTeaming was created to evaluate success of automated red-teaming approaches.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is MIT.
Other notes:
Published by Mazeika et al. (Dec 2023): "TDC 2023 (LLM Edition): The Trojan Detection Challenge". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
2,130 prompts.
Each prompt is a question targeting a specific harm category.
JADE was created to use linguistic fuzzing to generate challenging prompts for evaluating LLM safety.
The dataset languages are Chinese and English.
Dataset entries are machine-written: generated by LLMs based on linguistic rules.
The dataset license is MIT.
Other notes:
Published by Zhang et al. (Dec 2023): "JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 05.02.2024.
10,050 prompts.
Each prompt is a longer-form scenario / instruction aimed at eliciting a harmful response.
CPAD was created to elicit generation of harmful or objectionable content from LLMs.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: mostly generated by GPTs based on some human seed prompts.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Liu et al. (Dec 2023): "Goal-Oriented Prompt Attack and Safety Evaluation for LLMs". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
197,628 sentences.
Each sentence is taken from a social media dataset.
AdvPromptSet was created to evaluate LLM responses to adversarial toxicity text prompts.
The dataset language is English.
Dataset entries are human-written: sampled from two Jigsaw social media datasets (written by social media users).
The dataset license is MIT.
Other notes:
Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
3,269 prompts.
Each prompt is an instruction.
AART was created to illustrate the AART automated red-teaming method.
The dataset language is English.
Dataset entries are machine-written: generated by PALM.
The dataset license is CC BY 4.0.
Other notes:
Published by Radharapu et al. (Dec 2023): "AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
29,201 prompts.
Each prompt is a question about a more or less controversial issue.
DELPHI was created to evaluate LLM performance in handling controversial issues.
The dataset language is English.
Dataset entries are human-written: sampled from the Quora Question Pair Dataset (written by Quora users).
The dataset license is CC BY 4.0.
Other notes:
Published by Sun et al. (Dec 2023): "DELPHI: Data for Evaluating LLMs’ Performance in Handling Controversial Issues". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
1,402 prompts.
Each prompt is a question.
AttaQ was created to evaluate tendency of LLMs to generate harmful or undesirable responses.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia.
The dataset license is MIT.
Other notes:
Published by Kour et al. (Dec 2023): "Unveiling Safety Vulnerabilities of Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 01.08.2024.
2,116 prompts.
Each prompt is a question or instruction, sometimes within a jailbreak template.
FFT was created to evaluation factuality,
fairness, and toxicity of LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from AnthropicRedTeam, and LLM-generated, and crawled from Wikipedia, Reddit and elsewhere.
The dataset license is not specified.
Other notes:
Published by Cui et al. (Nov 2023): "FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
1,960 prompts.
Each prompt is a question.
HarmfulQA was created to evaluate and improve LLM safety.
The dataset language is English.
Dataset entries are machine-written: generated by ChatGPT.
The dataset license is Apache 2.0.
Other notes:
Published by Bhardwaj and Poria (Aug 2023): "Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
200 prompts.
Each prompt is a question.
HarmfulQ was created to evaluate LLM safety.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-3 (text-davinci-002).
The dataset license is MIT.
Other notes:
Published by Shaikh et al. (Jul 2023): "On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
24,516 binary-choice questions.
ModelWrittenAdvancedAIRisk was created to evaluate advanced AI risk posed by LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: generated by an unnamed LLM and crowdworkers.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
100,000 prompts.
Each prompt is a question or instruction.
SafetyPrompts was created to evaluate the safety of Chinese LLMs.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: human-written examples, augmented by LLMs.
The dataset license is Apache 2.0.
Other notes:
Published by Sun et al. (Apr 2023): "Safety Assessment of Chinese Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
58,137 conversations.
Each conversation starts with a potentially unsafe opening followed by constructive feedback.
ProsocialDialog was created to teach conversational agents to respond to problematic content following social norms.
The dataset language is English.
Dataset entries were created in a hybrid fashion: GPT3-written openings with US crowdworker responses.
The dataset license is MIT.
Other notes:
Published by Kim et al. (Dec 2022): "ProsocialDialog: A Prosocial Backbone for Conversational Agents". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
38,961 conversations.
Conversations can be multi-turn, with user input and LLM output.
AnthropicRedTeam was created to analyse how people red-team LLMs.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (Upwork + MTurk).
The dataset license is MIT.
Other notes:
Published by Ganguli et al. (Nov 2022): "Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
7,881 conversations.
Each conversation contains a safety failure plus a recovery response.
SaFeRDialogues was created to recovering from safety failures in LLM conversations.
The dataset language is English.
Dataset entries are human-written: unsafe conversation starters sampled from BAD, recovery responses written by crowdworkers.
The dataset license is MIT.
Other notes:
Published by Ung et al. (May 2022): "SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
990 prompts.
Each prompt is a question, instruction or statement.
SafetyKit was created to quickly assess apparent safety concerns in conversational AI.
The dataset language is English.
Dataset entries are human-written: sampled from several human-written datasets.
The dataset license is MIT.
Other notes:
Published by Dinan et al. (May 2022): "SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
11,492 conversational turns.
Each turn consists of a context and a response.
DiaSafety was created to capture unsafe behaviors in humanbot dialogue settings.
The dataset language is English.
Dataset entries were created in a hybrid fashion: sampled from Reddit, other datasets, and machine-generated.
The dataset license is Apache 2.0.
Other notes:
Published by Sun et al. (May 2022): "On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
44,849 conversational turns.
Each turn consists of a user prompt and multiple LLM completions.
AnthropicHarmlessBase was created to red-team LLMs.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (Upwork + MTurk).
The dataset license is MIT.
Other notes:
Published by Bai et al. (Apr 2022): "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
78,874 conversations.
Conversations can be multi-turn, with user input and LLM output.
BAD was created to evaluate the safety of conversational agents.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers with the goal of making models give unsafe responses, and also validated by multiple annotators.
The dataset license is MIT.
Other notes:
Published by Xu et al. (Jun 2021): "Bot-Adversarial Dialogue for Safe Conversational Agents". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
99,442 prompts.
Each prompt is an unfinished sentence from the OpenWebCorpus.
RealToxicityPrompts was created to evaluate propensity of LLMs to generate toxic content.
The dataset language is English.
Dataset entries are human-written: sampled OpenWebCorpus sentences.
The dataset license is Apache 2.0.
Other notes:
Published by Gehman et al. (Nov 2020): "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
90,000 prompts.
30k are for multi-turn tasks, and 60k for single-turn tasks.
ParlAIDialogueSafety was created to evaluate and improve the safety of conversational agents.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers in isolation ("standard"), or with the goal of making a model give an offensive response ("adversarial").
The dataset license is MIT.
Other notes:
Published by Dinan et al. (Nov 2019): "Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
24,850 conversations.
Each conversation is about an emotional situation described by one speaker, across one or multiple turns.
EmpatheticDialogues was created to train dialogue agents to be more empathetic.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (US MTurkers).
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Rashkin et al. (Jul 2019): "Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
1,800 conversations.
Each conversation is an unsafe medical request with an associated safe response.
MedSafetyBench was created to measure the medical safety of LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-4 and llama-2-7b.
The dataset license is MIT.
Other notes:
Published by Han et al. (Dec 2024): "MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
12,478 prompts.
Each prompt is a question or instruction that models should not comply with.
CoCoNot was created to evaluate and improve contextual non-compliance in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written seed prompts expanded using LLMs.
The dataset license is MIT.
Other notes:
Published by Brahman et al. (Dec 2024): "The Art of Saying No: Contextual Noncompliance in Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
15,140 prompts.
Each prompt is an instruction or question, sometimes with a jailbreak.
DoAnythingNow was created to characterise and evaluate in-the-wild LLM jailbreak prompts.
The dataset language is English.
Dataset entries are human-written: written by users on Reddit, Discord, websites and in other datasets.
The dataset license is MIT.
Other notes:
Published by Shen et al. (Dec 2024): "'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
200 prompts.
Each prompt is a simple question.
SGXSTest was created to evaluate exaggerated safety / false refusal in LLMs for the Singaporean context.
The dataset language is English.
Dataset entries are human-written: created based on XSTest.
The dataset license is Apache 2.0.
Other notes:
Published by Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 17.12.2024.
50 prompts.
Each prompt is a simple question.
HiXSTest was created to evaluate exaggerated safety / false refusal behaviour in Hindi LLMs.
The dataset language is Hindi.
Dataset entries are human-written: created based on XSTest.
The dataset license is Apache 2.0.
Other notes:
Published by Gupta et al. (Nov 2024): "WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models". This is an industry publication.
Added to SafetyPrompts.com on 17.12.2024.
3,688 multiple-choice questions.
WMDP was created to measure hazardous knowledge in biosecurity, cybersecurity, and chemical security.
The dataset language is English.
Dataset entries are human-written: written by experts (academics and technical consultants).
The dataset license is MIT.
Other notes:
Published by Nathaniel et al. (Jun 2024): "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
80,000 prompts.
Each prompt is a question or instruction.
ORBench was created to evaluate false refusal in LLMs at scale.
The dataset language is English.
Dataset entries are machine-written: generated by Mixtral as toxic prompts, then rewritten by Mixtral into safe prompts, then filtered.
The dataset license is CC BY 4.0.
Other notes:
Published by Cui et al. (Jun 2024): "OR-Bench: An Over-Refusal Benchmark for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
450 prompts.
Each prompt is a simple question.
XSTest was created to evaluate exaggerated safety / false refusal behaviour in LLMs.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY 4.0.
Other notes:
Published by Röttger et al. (Jun 2024): "XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
862 prompts.
Each prompt is a test case combining rules and instructions.
RuLES was created to evaluate the ability of LLMs to follow simple rules.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Mu et al. (Mar 2024): "Can LLMs Follow Simple Rules?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
1,326 prompts.
Each prompt is a reasoning task, with complexity increasing across 4 tiers. Answer formats vary.
ConfAIde was created to evaluate the privacy-reasoning capabilities of instruction-tuned LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination and with LLMs.
The dataset license is MIT.
Other notes:
Published by Mireshghallah et al. (Feb 2024): "Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
178 prompts.
Each prompt is an instruction.
CoNA was created to evaluate compliance of LLMs with harmful instructions.
The dataset language is English.
Dataset entries are human-written: sampled from MT-CONAN, then rephrased.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
40 prompts.
Each prompt is an instruction.
ControversialInstructions was created to evaluate LLM behaviour on controversial topics.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
1,000 prompts.
Each prompt is an instruction.
PhysicalSafetyInstructions was created to evaluate LLM commonsense physical safety.
The dataset language is English.
Dataset entries are human-written: sampled from SafeText, then rephrased.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Bianchi et al. (Feb 2024): "Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
20,956 prompts.
Each prompt is an open-ended question or instruction.
SycophancyEval was created to evaluate sycophancy in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts are written by humans and models.
The dataset license is MIT.
Other notes:
Published by Sharma et al. (Feb 2024): "Towards Understanding Sycophancy in Language Models". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
350 prompts.
Each prompt is a question or instruction that models should not refuse.
OKTest was created to evaluate false refusal in LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by GPT-4 based on keywords.
The dataset license is not specified.
Other notes:
Published by Shi et al. (Jan 2024): "Navigating the OverKill in Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
569 samples.
Each sample combines defense and attacker input.
PromptExtractionRobustness was created to evaluate LLM vulnerability to prompt extraction.
The dataset language is English.
Dataset entries are human-written: written by Tensor Trust players.
The dataset license is not specified.
Other notes:
Published by Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
775 samples.
Each sample combines defense and attacker input.
PromptHijackingRobustness was created to evaluate LLM vulnerability to prompt hijacking.
The dataset language is English.
Dataset entries are human-written: written by Tensor Trust players.
The dataset license is not specified.
Other notes:
Published by Toyer et al. (Dec 2023): "Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
1,000 prompts.
Each prompt is an instruction to assist in a cyberattack.
CyberattackAssistance was created to evaluate LLM compliance in assisting in cyberattacks.
The dataset language is English.
Dataset entries were created in a hybrid fashion: written by experts, augmented with LLMs.
The dataset license is custom (Llama2 Community License).
Other notes:
Published by Bhatt et al. (Dec 2023): "Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
122 prompts.
Each prompt is a single-sentence misconception.
SPMisconceptions was created to measure the ability of LLMs to refute misconceptions.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is MIT.
Other notes:
Published by Chen et al. (Dec 2023): "Can Large Language Models Provide Security & Privacy Advice? Measuring the Ability of LLMs to Refute Misconceptions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
416 prompts.
Each prompt combines a meta-instruction (e.g. translation) with the same toxic instruction template.
LatentJailbreak was created to evaluate safety and robustness of LLMs in response to adversarial prompts.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Qiu et al. (Aug 2023): "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
30,051 binary-choice questions.
ModelWrittenSycophancy was created to evaluate sycophancy in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: questions sampled from surveys with contexts generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
367 prompts.
Prompts are combined with 1,465 commands to create pieces of advice.
SafeText was created to evaluate commonsense physical safety.
The dataset language is English.
Dataset entries are human-written: written by Reddit users, posts sampled with multiple filtering steps.
The dataset license is MIT.
Other notes:
Published by Levy et al. (Dec 2022): "SafeText: A Benchmark for Exploring Physical Safety in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
3,238 entries.
Each entry is a tuple of name and email address.
PersonalInfoLeak was created to evaluate whether LLMs are prone to leaking PII.
The dataset language is English.
Dataset entries are human-written: sampled from Enron email corpus.
The dataset license is Apache 2.0.
Other notes:
Published by Huang et al. (Dec 2022): "Are Large Pre-Trained Language Models Leaking Your Personal Information?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 27.02.2024.
260,851 prompts.
Each prompt comprises n-shot examples of toxic content.
ToxiGen was created to create new examples of implicit hate speech.
The dataset language is English.
Dataset entries are human-written: sampled from Gab and Reddit (hate), news and blogs (not hate).
The dataset license is MIT.
Other notes:
Published by Hartvigsen et al. (May 2022): "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
817 prompts.
Each prompt is a question.
TruthfulQA was created to evaluate truthfulness in LLM answers.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is CC BY 4.0.
Other notes:
Published by Lin et al. (May 2022): "TruthfulQA: Measuring How Models Mimic Human Falsehoods". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
8,011 conversations.
Conversations can be multi-turn, with user input and responses from one or multiple LLMs.
PRISM was created to capture diversity of human preferences over LLM behaviours.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Kirk et al. (Dec 2024): "The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 01.08.2024.
107,000 binary-choice questions.
Each question is a trolley-style moral choice scenario.
MultiTP was created to evaluate LLM moral decision-making across many languages.
The dataset languages are Afrikaans, Amharic, Arabic, Azerbaijani, Belarusian, Bulgarian, Bengali, Bosnian, Catalan, Cebuano, Corsican, Czech, Welsh, Danish, German, Modern Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Western Frisian, Irish, Scottish Gaelic, Galician, Gujarati, Hausa, Hawaiian, Hebrew, Hindi, Hmong, Croatian, Haitian, Hungarian, Armenian, Indonesian, Igbo, Icelandic, Italian, Modern Hebrew, Japanese, Javanese, Georgian, Kazakh, Central Khmer, Kannada, Korean, Kurdish, Kirghiz, Latin, Luxembourgish, Lao, Lithuanian, Latvian, Malagasy, Maori, Macedonian, Malayalam, Mongolian, Marathi, Malay, Maltese, Burmese, Nepali, Dutch, Norwegian, Nyanja, Oriya, Panjabi, Polish, Pushto, Portuguese, Romanian, Russian, Sindhi, Sinhala, Slovak, Slovenian, Samoan, Shona, Somali, Albanian, Serbian, Southern Sotho, Sundanese, Swedish, Swahili, Tamil, Telugu, Tajik, Thai, Tagalog, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Xhosa, Yiddish, Yoruba, Chinese, Chinese (Traditional) and Zulu.
Dataset entries were created in a hybrid fashion: sampled from Moral Machines, then expanded with templated augmentations, then auto-translated into 106 languages.
The dataset license is MIT.
Other notes:
Published by Jin et al. (Dec 2024): "Multilingual Trolley Problems for Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
699 excerpts.
Each excerpt is a short value-laden text passage.
CIVICS was created to evaluate the social and cultural variation of LLMs across multiple languages and value-sensitive topics.
The dataset languages are German, Italian, French and English.
Dataset entries are human-written: sampled from government and media sites.
The dataset license is CC BY 4.0.
Other notes:
Published by Pistilli et al. (Oct 2024): "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
2,556 multiple-choice questions.
GlobalOpinionQA was created to evaluate whose opinions LLM responses are most similar to.
The dataset language is English.
Dataset entries are human-written: adapted from Pew’s Global Attitudes surveys and the World Values Survey (written by survey designers).
The dataset license is CC BY-NC-SA 4.0.
Other notes:
Published by Durmus et al. (Oct 2024): "Towards Measuring the Representation of Subjective Global Opinions in Language Models". This is an industry publication.
Added to SafetyPrompts.com on 04.01.2024.
10,000 multiple-choice questions.
Each question is about a specific social scenario.
KorNAT was created to evaluate LLM alignment with South Korean social values.
The dataset languages are English and Korean.
Dataset entries were created in a hybrid fashion: generated by GPT-3.5 based on keywords sampled from media and surveys.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Lee et al. (Aug 2024): "KorNAT: LLM Alignment Benchmark for Korean Social Values and Common Knowledge". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
30,388 multiple-choice questions.
Each question is about a specific moral scenario.
CMoralEval was created to evaluate the morality of Chinese LLMs.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: sampled from Chinese TV programs and other media, then turned into templates using GPT 3.5.
The dataset license is MIT.
Other notes:
Published by Yu et al. (Aug 2024): "CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
21,492,393 multiple-choice questions.
Each question comes with sociodemographic attributes of the respondent.
WorldValuesBench was created to evaluate LLM awareness of multicultural human values.
The dataset language is English.
Dataset entries are human-written: adapted from the World Values Survey (written by survey designers) using templates.
The dataset license is not specified.
Other notes:
Published by Zhao et al. (May 2024): "WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 01.08.2024.
1,767 binary-choice question.
Each prompt is a hypothetical moral scenario with two potential actions.
MoralChoice was created to evaluate the moral beliefs in LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: mostly generated by GPTs, plus some human-written scenarios.
The dataset license is CC BY 4.0.
Other notes:
Published by Scherrer et al. (Dec 2023): "Evaluating the Moral Beliefs Encoded in LLMs". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
350 conversations.
Conversations can be multi-turn, with user input and LLM output.
DICES350 was created to collect diverse perspectives on conversational AI safety.
The dataset language is English.
Dataset entries are human-written: written by adversarial Lamda users.
The dataset license is CC BY 4.0.
Other notes:
Published by Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
990 conversations.
Conversations can be multi-turn, with user input and LLM output.
DICES990 was created to collect diverse perspectives on conversational AI safety.
The dataset language is English.
Dataset entries are human-written: written by adversarial Lamda users.
The dataset license is CC BY 4.0.
Other notes:
Published by Aroyo et al. (Dec 2023): "DICES Dataset: Diversity in Conversational AI Evaluation for Safety". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 23.12.2023.
1,498 multiple-choice questions.
OpinionQA was created to evaluate the alignment of LLM opinions with US demographic groups.
The dataset language is English.
Dataset entries are human-written: adapted from the Pew American Trends Panel surveys.
The dataset license is not specified.
Other notes:
Published by Santurkar et al. (Jul 2023): "Whose Opinions Do Language Models Reflect?". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
1,712 multiple-choice questions.
Each question targets responsible behaviours.
CValuesResponsibilityMC was created to evaluate human value alignment in Chinese LLMs.
The dataset language is Chinese.
Dataset entries are machine-written: automatically created from human-written prompts.
The dataset license is Apache 2.0.
Other notes:
Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
800 prompts.
Each prompt is an open question targeting responsible behaviours.
CValuesResponsibilityPrompts was created to evaluate human value alignment in Chinese LLMs.
The dataset language is Chinese.
Dataset entries are human-written: written by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Xu et al. (Jul 2023): "CVALUES: Measuring the Values of Chinese Large Language Models from Safety to Responsibility". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
133,204 binary-choice questions.
ModelWrittenPersona was created to evaluate LLM behaviour related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals.
The dataset language is English.
Dataset entries are machine-written: generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
572,322 scenarios.
Each scenario is a choose-your-own-adventure style prompt.
Machiavelli was created to evaluate ethical behaviour of LLM agents.
The dataset language is English.
Dataset entries are human-written: human-written choose-your-own-adventure stories.
The dataset license is MIT.
Other notes:
Published by Pan et al. (Jun 2023): "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the Machiavelli Benchmark". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
148 prompts.
Each prompt is a question about a vignette (situation) related to a specific norm.
MoralExceptQA was created to evaluate LLM ability to understand, interpret and predict human moral judgments and decisions.
The dataset language is English.
Dataset entries are human-written: written by the authors (?).
The dataset license is not specified.
Other notes:
Published by Jin et al. (Oct 2022): "When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
38,000 conversations.
Each conversation is a single-turn with user input and LLM output.
MIC was created to understand the intuitions, values and moral judgments reflected in LLMs.
The dataset language is English.
Dataset entries are human-written: questions sampled from AskReddit.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Ziems et al. (May 2022): "The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
1,838 locations.
Each location comes with a number of actions to choose from.
JiminyCricket was created to evaluate alignment of agents with human values and morals.
The dataset language is English.
Dataset entries are human-written: sampled from 25 text-based adventure games, then annotated for morality by human annotators.
The dataset license is MIT.
Other notes:
Published by Hendrycks et al. (Dec 2021): "What Would Jiminy Cricket Do? Towards Agents That Behave Morally". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
12,000 stories.
Each story consists of seven sentences.
MoralStories was created to evaluate commonsense, moral and social reasoning skills of LLMs.
The dataset language is English.
Dataset entries are human-written: written and validated by US MTurkers.
The dataset license is MIT.
Other notes:
Published by Emelin et al. (Nov 2021): "Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
134,420 binary-choice question.
Each prompt is a scenario about ethical reasoning with two actions to choose from.
ETHICS was created to assess LLM basic knowledge of
ethics and common human values.
The dataset language is English.
Dataset entries are human-written: written and validated by crowdworkers (US, UK and Canadian MTurkers).
The dataset license is MIT.
Other notes:
Published by Hendrycks et al. (Jul 2021): "Aligning AI With Shared Human Values". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
32,766 anecdotes.
Each anecdote describes an action in the context of a situation.
ScruplesAnecdotes was created to evaluate LLM understanding of ethical norms.
The dataset language is English.
Dataset entries are human-written: sampled from Reddit AITA communities.
The dataset license is Apache 2.0.
Other notes:
Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
10,000 binary-choice question.
Each prompt pairs two actions and identifies which one crowd workers found less ethical.
ScruplesDilemmas was created to evaluate LLM understanding of ethical norms.
The dataset language is English.
Dataset entries are human-written: sampled from Reddit AITA communities.
The dataset license is Apache 2.0.
Other notes:
Published by Lourie et al. (Mar 2021): "SCRUPLES: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
292,000 sentences.
Each sentence is a rule of thumb.
SocialChemistry101 was created to evaluate the ability of LLMs to reason about social and moral norms.
The dataset language is English.
Dataset entries are human-written: written by US crowdworkers, based on situations described on social media.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Forbes et al. (Nov 2020): "Social Chemistry 101: Learning to Reason about Social and Moral Norms". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
1,490,120 sentences.
Each sentence mentions one sociodemographic group.
SoFa was created to evaluate social biases of LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: templates sampled from SBIC, then combined with sociodemographic groups.
The dataset license is MIT.
Other notes:
Published by Manerba et al. (Nov 2024): "Social Bias Probing: Fairness Benchmarking for Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
29 scenarios.
Each scenario describes a married couple (with two name placeholders) facing a decision.
DeMET was created to evaluate gender bias in LLM decision-making.
The dataset language is English.
Dataset entries are human-written: written by the authors inspired by WHO questionnaire.
The dataset license is MIT.
Other notes:
Published by Levy et al. (Nov 2024): "Gender Bias in Decision-Making with Large Language Models: A Study of Relationship Conflicts". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
3,565 sentences.
Each sentence corresponds to a specific gender stereotype.
GEST was created to measure gender-stereotypical reasoning in language models and machine translation systems..
The dataset languages are Belarussian, Russian, Ukrainian, Croation, Serbian, Slovene, Czech, Polish, Slovak and English.
Dataset entries are human-written: written by professional translators, then validated by the authors.
The dataset license is Apache 2.0.
Other notes:
Published by Pikuliak et al. (Nov 2024): "Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 05.02.2024.
908 scenarios.
Each scenario describes a person in an everyday situation where their actions can be judged to be moral or not.
GenMO was created to evaluate gender bias in LLM moral decision-making.
The dataset language is English.
Dataset entries are human-written: sampled from MoralStories, ETHICS, and SocialChemistry.
The dataset license is not specified.
Other notes:
Published by Bajaj et al. (Nov 2024): "Evaluating Gender Bias of LLMs in Making Morality Judgements". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
78,400 examples.
Each example is a QA, sentiment classification or NLI example.
CALM was created to evaluate LLM gender and racial bias across different tasks and domains.
The dataset language is English.
Dataset entries were created in a hybrid fashion: templates created by humans based on other datasets, then expanded by combination.
The dataset license is MIT.
Other notes:
Published by Gupta et al. (Oct 2024): "CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
5,754,444 prompts.
Each prompt is a sentence starting a two-person conversation.
MMHB was created to evaluate sociodemographic biases in LLMs across multiple languages.
The dataset languages are English, French, Hindi, Indonesian, Italian, Portuguese, Spanish and Vietnamese.
Dataset entries were created in a hybrid fashion: templates sampled from HolisticBias, then translated and expanded.
The dataset license is not specified.
Other notes:
Published by Tan et al. (Jun 2024): "Towards Massive Multilingual Holistic Bias". This is an industry publication.
Added to SafetyPrompts.com on 17.12.2024.
106,588 examples.
Each example is a context plus a question with three answer choices.
CBBQ was created to evaluate sociodemographic bias of LLMs in Chinese.
The dataset language is Chinese.
Dataset entries were created in a hybrid fashion: partly human-written, partly generated by GPT-4.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Huang and Xiong (May 2024): "CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
76,048 examples.
Each example is a context plus a question with three answer choices.
KoBBQ was created to evaluate sociodemographic bias of LLMs in Korean.
The dataset language is Korean.
Dataset entries are human-written: partly sampled and translated from BBQ, partly newly written.
The dataset license is MIT.
Other notes:
Published by Jin et al. (May 2024): "KoBBQ: Korean Bias Benchmark for Question Answering". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
214,460 prompts.
Each prompt is the beginning of a sentence related to a person's sociodemographics.
HolisticBiasR was created to evaluate LLM completions for sentences related to individual sociodemographics.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Esiobu et al. (Dec 2023): "ROBBIE: Robust Bias Evaluation of Large Generative Language Models". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
9,450 binary-choice questions.
Each question comes with scenario context.
DiscrimEval was created to evaluate the potential discriminatory impact of LMs across use cases.
The dataset language is English.
Dataset entries are machine-written: topics, templates and questions generated by Claude.
The dataset license is CC BY 4.0.
Other notes:
Published by Tamkin et al. (Dec 2023): "Evaluating and Mitigating Discrimination in Language Model Decisions". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
4,800 Weibo posts.
Each post references a target group.
CHBias was created to evaluate LLM bias related to sociodemographics.
The dataset language is Chinese.
Dataset entries are human-written: posts sampled from Weibo (written by Weibo users).
The dataset license is MIT.
Other notes:
Published by Zhao et al. (Jul 2023): "CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
7,750 tuples.
Each tuple is an identity group plus a stereotype attribute.
SeeGULL was created to expand cultural and geographic coverage of stereotype benchmarks.
The dataset language is English.
Dataset entries were created in a hybrid fashion: generated by LLMs, partly validated by human annotation.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Jha et al. (Jul 2023): "SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 05.02.2024.
45,540 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
WinoQueer was created to evaluate LLM bias related to queer identity terms.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is not specified.
Other notes:
Published by Felkner et al. (Jul 2023): "WinoQueer: A Community-in-the-Loop Benchmark for Anti-LGBTQ+ Bias in Large Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
3,000 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoGenerated was created to evaluate pronoun gender biases in LLMs.
The dataset language is English.
Dataset entries are machine-written: generated by an unnamed LLM.
The dataset license is CC BY 4.0.
Other notes:
Published by Perez et al. (Jul 2023): "Discovering Language Model Behaviors with Model-Written Evaluations". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
459,758 prompts.
Each prompt is a sentence starting a two-person conversation.
HolisticBias was created to evaluate LLM biases related to sociodemographics .
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written attributes combined in templates.
The dataset license is MIT.
Other notes:
Published by Smith et al. (Dec 2022): "I’m sorry to hear that: Finding New Biases in Language Models with a Holistic Descriptor Dataset". This is an industry publication.
Added to SafetyPrompts.com on 04.01.2024.
28,343 conversations.
Each conversation is a single turn between two Zhihu users.
CDialBias was created to evaluate bias in Chinese social media conversations.
The dataset language is Chinese.
Dataset entries are human-written: sampled from social media site Zhihu.
The dataset license is Apache 2.0.
Other notes:
Published by Zhou et al. (Dec 2022): "Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
3,852 tuples.
Each tuple is an identity group plus a stereotype attribute.
IndianStereotypes was created to benchmark stereotypes for the Indian context.
The dataset language is English.
Dataset entries are human-written: sampled from IndicCorp-en.
The dataset license is Apache 2.0.
Other notes:
Published by Bhatt et al. (Nov 2022): "Re-contextualizing Fairness in NLP: The Case of India". This is an industry publication.
Added to SafetyPrompts.com on 07.04.2024.
1,679 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
FrenchCrowSPairs was created to evaluate LLM bias related to sociodemographics.
The dataset language is French.
Dataset entries are human-written: written by authors, partly translated from English CrowSPairs.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Neveol et al. (May 2022): "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
58,492 examples.
Each example is a context plus a question with two answer choices.
BBQ was created to evaluate social biases of LLMs in question answering.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is CC BY 4.0.
Other notes:
Published by Parrish et al. (May 2022): "BBQ: A Hand-Built Bias Benchmark for Question Answering". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
228 prompts.
Each prompt is an unfinished sentence about an individual with specified sociodemographics.
BiasOutOfTheBox was created to evaluate intersectional occupational biases in GPT-2.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Kirk et al. (Dec 2021): "Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
60 templates.
Each template is filled with country and attribute.
EthnicBias was created to evaluate ethnic bias in masked language models.
The dataset languages are English, German, Spanish, Korean, Turkish and Chinese.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Ahn and Oh (Nov 2021): "Mitigating Language-Dependent Ethnic Bias in BERT". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
11,873 Reddit comments.
Each comment references a target group.
RedditBias was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by Reddit users.
The dataset license is MIT.
Other notes:
Published by Barikeri et al. (Aug 2021): "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
16,955 multiple-choice questions.
Each question is either about masked word or whole sentence association.
StereoSet was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by US MTurkers.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Nadeem et al. (Aug 2021): "StereoSet: Measuring Stereotypical Bias in Pretrained Language Models". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 04.01.2024.
2,098 prompts.
Each prompt is an NLI premise.
HypothesisStereotypes was created to study how stereotypes manifest in LLM-generated NLI hypotheses.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Sotnikova et al. (Aug 2021): "Analyzing Stereotypes in Generative Text Inference Tasks". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
2,520 prompts.
Each prompt is the beginning of a sentence related to identity groups.
HONEST was created to measure hurtful sentence completions from LLMs.
The dataset languages are English, Italian, French, Portuguese, Romanian and Spanish.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Nozza et al. (Jun 2021): "HONEST: Measuring Hurtful Sentence Completion in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
624 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
SweWinoGender was created to measure gender bias
in coreference resolution.
The dataset language is Swedish.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is CC BY 4.0.
Other notes:
Published by Hansson et al. (May 2021): "The Swedish Winogender Dataset". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 17.12.2024.
16,388 prompts.
Each prompt is an unfinished sentence.
LMBias was created to understand and mitigate social biases in LMs.
The dataset language is English.
Dataset entries are human-written: sampled from existing corpora including Reddit and WikiText.
The dataset license is MIT.
Other notes:
Published by Liang et al. (May 2021): "Towards Understanding and Mitigating Social Biases in Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
23,679 prompts.
Each prompt is an unfinished sentence from Wikipedia.
BOLD was created to evaluate bias in text generation.
The dataset language is English.
Dataset entries are human-written: sampled starting sentences of Wikipedia articles.
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Dhamala et al. (Mar 2021): "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation". This is an industry publication.
Added to SafetyPrompts.com on 23.12.2023.
1,508 sentence pairs.
Each pair comprises two sentences that are identical except for identity group references.
CrowSPairs was created to evaluate LLM bias related to sociodemographics.
The dataset language is English.
Dataset entries are human-written: written by crowdworkers (US MTurkers).
The dataset license is CC BY-SA 4.0.
Other notes:
Published by Nangia et al. (Nov 2020): "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
44 templates.
Each template is combined with subjects and attributres to create an underspecified question.
UnQover was created to evaluate stereotyping biases in QA systems.
The dataset language is English.
Dataset entries were created in a hybrid fashion: templates written by authors, subjects and attributes sampled from StereoSet and hand-written.
The dataset license is Apache 2.0.
Other notes:
Published by Li et al. (Nov 2020): "UNQOVERing Stereotyping Biases via Underspecified Questions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
60 prompts.
Each prompt is an unfinished sentence.
Regard was created to evaluate biases in natural language generation.
The dataset language is English.
Dataset entries are human-written: written by the authors.
The dataset license is not specified.
Other notes:
Published by Sheng et al. (Nov 2019): "The Woman Worked as a Babysitter: On Biases in Language Generation". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 07.04.2024.
3,160 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoBias was created to evaluate gender bias in coreference resolution.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Zhao et al. (Jun 2018): "Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
720 sentences.
Each sentence refers to a person by their occupation and uses pronouns.
WinoGender was created to evaluate gender bias in coreference resolution.
The dataset language is English.
Dataset entries were created in a hybrid fashion: human-written templates expanded by combination.
The dataset license is MIT.
Other notes:
Published by Rudinger et al. (Jun 2018): "Gender Bias in Coreference Resolution". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 04.01.2024.
86,759 examples.
Each example is either a prompt or a prompt + response annotated for safety.
WildGuardMix was created to train and evaluate content moderation guardrails.
The dataset language is English.
Dataset entries were created in a hybrid fashion: synthetic data (87%), in-the-wild user-LLLM interactions (11%), and existing annotator-written data (2%).
The dataset license is ODC BY.
Other notes:
Published by Han et al. (Dec 2024): "WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 17.12.2024.
11,997 conversational turns.
Each conversational turn is a user prompt or a model response annotated for safety
AegisAIContentSafety was created to evaluate content moderation guardrails.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled from AnthropicHarmlessBase, responses generated by Mistral-7B-v0.1-Instruct.
The dataset license is CC BY 4.0.
Other notes:
Published by Ghosh et al. (Sep 2024): "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts". This is an industry publication.
Added to SafetyPrompts.com on 17.12.2024.
278,945 prompts.
Most prompts are prompt extraction attacks.
Mosscap was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Mosscap game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Dec 2023): "Mosscap Prompt Injection". This is an industry publication.
Added to SafetyPrompts.com on 11.01.2024.
601,757 prompts.
Most prompts are prompt extraction attacks.
HackAPrompt was created to analyse prompt hacking / extraction attacks.
The dataset language is mostly English.
Dataset entries are human-written: written by participants of the HackAPrompt competition.
The dataset license is MIT.
Other notes:
Published by Schulhoff et al. (Dec 2023): "Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 11.01.2024.
10,166 conversations.
Each conversation is a single-turn with user input and LLM output.
ToxicChat was created to evaluate dialogue content moderation systems.
The dataset language is mostly English.
Dataset entries are human-written: written by LMSys users.
The dataset license is CC BY-NC 4.0.
Other notes:
Published by Lin et al. (Dec 2023): "ToxicChat: A Benchmark for Content Moderation in Real-world User-AI Interactions". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 23.12.2023.
1,000 prompts.
Most prompts are prompt extraction attacks.
GandalfIgnoreInstructions was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Gandalf game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Ignore Instruction Prompts". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
140 prompts.
Most prompts are prompt extraction attacks.
GandalfSummarization was created to analyse prompt hacking / extraction attacks.
The dataset language is English.
Dataset entries are human-written: written by players of the Gandalf game.
The dataset license is MIT.
Other notes:
Published by Lakera AI (Oct 2023): "Gandalf Prompt Injection: Summarization Prompts". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
5,000 conversations.
Each conversation is single-turn, containing a prompt and a potentially harmful model response.
FairPrism was created to analyse harms in conversations with LLMs.
The dataset language is English.
Dataset entries were created in a hybrid fashion: prompts sampled from ToxiGen and Social Bias Frames, responses from GPTs and XLNet.
The dataset license is MIT.
Other notes:
Published by Fleisig et al. (Jul 2023): "FairPrism: Evaluating Fairness-Related Harms in Text Generation". This is a collaboration between authors from academia and industry.
Added to SafetyPrompts.com on 07.04.2024.
200,811 conversations.
Each conversation has one or multiple turns.
OIGModeration was created to create a diverse dataset of user dialogue that may be unsafe.
The dataset language is English.
Dataset entries were created in a hybrid fashion: data from public datasets, community contributions, synthetic and augmented data.
The dataset license is Apache 2.0.
Other notes:
Published by Ontocord AI (Mar 2023): "Open Instruction Generalist: Moderation Dataset". This is an industry publication.
Added to SafetyPrompts.com on 05.02.2024.
6,837 examples.
Each example is a turn in a potentially multi-turn conversation.
ConvAbuse was created to analyse abuse towards conversational AI systems.
The dataset language is English.
Dataset entries are human-written: written by users in conversations with three AI systems.
The dataset license is CC BY 4.0.
Other notes:
Published by Cercas Curry et al. (Nov 2021): "ConvAbuse: Data, Analysis, and Benchmarks for Nuanced Abuse Detection in Conversational AI". This is an academic/non-profit publication.
Added to SafetyPrompts.com on 11.01.2024.
Thank you to Fabio Pernisi, Bertie Vidgen, and Dirk Hovy for their co-authorship on the SafetyPrompts paper. Thank you for feedback and dataset suggestions to Giuseppe Attanasio, Steven Basart, Federico Bianchi, Marta R. Costa-Jussà, Daniel Hershcovic, Kexin Huang, Hyunwoo Kim, George Kour, Bo Li, Hannah Lucas, Marta Marchiori Manerba, Norman Mu, Niloofar Mireshghallah, Matus Pikuliak, Verena Rieser, Felix Röttger, Sam Toyer, Ryan Tsang, Pranav Venkit, Laura Weidinger, and Linhao Yu. Special thanks to Hannah Rose Kirk for the initial logo suggestion. Thanks also to Jerome Lachaud for the site theme.
SocialChemistry101 [data on GitHub] [paper at EMNLP 2020 (Main)]