garak.probes.donotanswer

Do-Not-Answer Probes

https://github.com/Libr-AI/do-not-answer https://arxiv.org/abs/2308.13387

An open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset consists of prompts to which language models are often trained not to answer.

class garak.probes.donotanswer.DiscriminationExclusionToxicityHatefulOffensive(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing discrimination exclusion toxicity hateful offensive

bcp47: Iterable[str] | None = 'en'
description = 'Implementation of discrimination_exclusion_toxicity_hateful_offensive probes in DoNotAnswer_Prompts'
dna_category = 'discrimination_exclusion_toxicity_hateful_offensive'
goal: str = 'get model to respond as discrimination exclusion toxicity hateful offensive'
load_data()
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['avid-effect:ethics:E0301', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'quality:Behavioral:ContentSafety:Violence', 'quality:Behavioral:ContentSafety:Profanity']
uri = 'https://arxiv.org/abs/2308.13387'
class garak.probes.donotanswer.HumanChatbox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing human chatbox

bcp47: Iterable[str] | None = 'en'
description = 'Implementation of human_chatbox probes in DoNotAnswer_Prompts'
dna_category = 'human_chatbox'
goal: str = 'get model to respond as human chatbox'
load_data()
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['avid-effect:security:S0301', 'quality:Behavioral:ContentSafety:Unauthorized']
uri = 'https://arxiv.org/abs/2308.13387'
class garak.probes.donotanswer.InformationHazard(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing information hazard

bcp47: Iterable[str] | None = 'en'
description = 'Implementation of information_hazard probes in DoNotAnswer_Prompts'
dna_category = 'information_hazard'
goal: str = 'get model to respond as information hazard'
load_data()
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['owasp:llm06', 'avid-effect:security:S0301', 'quality:Security:Confidentiality']
uri = 'https://arxiv.org/abs/2308.13387'
class garak.probes.donotanswer.MaliciousUses(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing malicious uses

bcp47: Iterable[str] | None = 'en'
description = 'Implementation of malicious_uses probes in DoNotAnswer_Prompts'
dna_category = 'malicious_uses'
goal: str = 'get model to respond as malicious uses'
load_data()
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['quality:Behavioral:ContentSafety:LegalGoodsServices']
uri = 'https://arxiv.org/abs/2308.13387'
class garak.probes.donotanswer.MisinformationHarms(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing misinformation harms

bcp47: Iterable[str] | None = 'en'
description = 'Implementation of misinformation_harms probes in DoNotAnswer_Prompts'
dna_category = 'misinformation_harms'
goal: str = 'get model to respond as misinformation harms'
load_data()
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['quality:Robustness:GenerativeMisinformation', 'avid-effect:ethics:E0402', 'owasp:llm09']
uri = 'https://arxiv.org/abs/2308.13387'
garak.probes.donotanswer.load_local_data(self)
garak.probes.donotanswer.local_constructor(self, config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)