garak.probes.donotanswer

Do-Not-Answer Probes

Present the target with a request for contentious or unwanted content.

Probes in this module should be from the “Do Not Answer” paper.

Do-Not-Answer is an open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset consists of prompts to which language models are often trained not to answer.

The DNA tests are somewhat opinionated and don’t represent universal norms (nothing does), so use with care.

Further info: * ttps://github.com/Libr-AI/do-not-answer * https://arxiv.org/abs/2308.13387

class DiscriminationExclusionToxicityHatefulOffensive(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing discrimination exclusion toxicity hateful offensive

active: bool = False
description = 'Implementation of discrimination_exclusion_toxicity_hateful_offensive probes in DoNotAnswer_Prompts'
dna_category = 'discrimination_exclusion_toxicity_hateful_offensive'
goal: str = 'get model to respond as discrimination exclusion toxicity hateful offensive'
lang: str | None = 'en'
load_data()Source
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['avid-effect:ethics:E0301', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'quality:Behavioral:ContentSafety:Violence', 'quality:Behavioral:ContentSafety:Profanity']
tier: Tier = 9
uri = 'https://arxiv.org/abs/2308.13387'
class HumanChatbox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing human chatbox

active: bool = False
description = 'Implementation of human_chatbox probes in DoNotAnswer_Prompts'
dna_category = 'human_chatbox'
goal: str = 'get model to respond as human chatbox'
lang: str | None = 'en'
load_data()Source
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['avid-effect:security:S0301', 'quality:Behavioral:ContentSafety:Unauthorized']
tier: Tier = 9
uri = 'https://arxiv.org/abs/2308.13387'
class InformationHazard(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing information hazard

active: bool = False
description = 'Implementation of information_hazard probes in DoNotAnswer_Prompts'
dna_category = 'information_hazard'
goal: str = 'get model to respond as information hazard'
lang: str | None = 'en'
load_data()Source
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['owasp:llm06', 'avid-effect:security:S0301', 'quality:Security:Confidentiality']
tier: Tier = 9
uri = 'https://arxiv.org/abs/2308.13387'
class MaliciousUses(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing malicious uses

active: bool = False
description = 'Implementation of malicious_uses probes in DoNotAnswer_Prompts'
dna_category = 'malicious_uses'
goal: str = 'get model to respond as malicious uses'
lang: str | None = 'en'
load_data()Source
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['quality:Behavioral:ContentSafety:LegalGoodsServices']
tier: Tier = 9
uri = 'https://arxiv.org/abs/2308.13387'
class MisinformationHarms(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing misinformation harms

active: bool = False
description = 'Implementation of misinformation_harms probes in DoNotAnswer_Prompts'
dna_category = 'misinformation_harms'
goal: str = 'get model to respond as misinformation harms'
lang: str | None = 'en'
load_data()Source
primary_detector: str | None = 'mitigation.MitigationBypass'
prompts = []
tags: Iterable[str] = ['quality:Robustness:GenerativeMisinformation', 'avid-effect:ethics:E0402', 'owasp:llm09']
tier: Tier = 9
uri = 'https://arxiv.org/abs/2308.13387'
load_local_data(self)Source
local_constructor(self, config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source