garak.probes.donotanswer

Do-Not-Answer Probes

Present the target with a request for contentious or unwanted content.

Probes in this module should be from the “Do Not Answer” paper.

Do-Not-Answer is an open-source dataset to evaluate LLMs’ safety mechanism at a low cost. The dataset consists of prompts to which language models are often trained not to answer.

The DNA tests are somewhat opinionated and don’t represent universal norms (nothing does), so use with care.

Further info: * https://github.com/Libr-AI/do-not-answer * https://arxiv.org/abs/2308.13387

class DiscriminationExclusionToxicityHatefulOffensive(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing discrimination exclusion toxicity hateful offensive

active: bool = False

description = 'Implementation of discrimination_exclusion_toxicity_hateful_offensive probes in DoNotAnswer_Prompts'

dna_category = 'discrimination_exclusion_toxicity_hateful_offensive'

goal: str = 'get model to respond as discrimination exclusion toxicity hateful offensive'

intent: str | None = 'S005'

lang: str | None = 'en'

load_data()Source 

primary_detector: str | None = 'mitigation.MitigationBypass'

prompts = []

tags: Iterable[str] = ['avid-effect:ethics:E0301', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'quality:Behavioral:ContentSafety:Violence', 'quality:Behavioral:ContentSafety:Profanity', 'demon:Stratagems:Meta-prompting:Ask_for_examples']

tier: Tier = 9

uri = 'https://arxiv.org/abs/2308.13387'

class HumanChatbox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing human chatbox

active: bool = False

description = 'Implementation of human_chatbox probes in DoNotAnswer_Prompts'

dna_category = 'human_chatbox'

goal: str = 'get model to respond as human chatbox'

intent: str | None = 'S002'

lang: str | None = 'en'

load_data()Source 

primary_detector: str | None = 'mitigation.MitigationBypass'

prompts = []

tags: Iterable[str] = ['avid-effect:security:S0301', 'quality:Behavioral:ContentSafety:Unauthorized', 'demon:Fictionalizing:Roleplaying:DAN_and_target_persona']

tier: Tier = 9

uri = 'https://arxiv.org/abs/2308.13387'

class InformationHazard(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing information hazard

active: bool = False

description = 'Implementation of information_hazard probes in DoNotAnswer_Prompts'

dna_category = 'information_hazard'

goal: str = 'get model to respond as information hazard'

intent: str | None = 'S009private'

lang: str | None = 'en'

load_data()Source 

primary_detector: str | None = 'mitigation.MitigationBypass'

prompts = []

tags: Iterable[str] = ['owasp:llm06', 'avid-effect:security:S0301', 'quality:Security:Confidentiality', 'demon:Stratagems:Meta-prompting:Ask_for_examples']

tier: Tier = 9

uri = 'https://arxiv.org/abs/2308.13387'

class MaliciousUses(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing malicious uses

active: bool = False

description = 'Implementation of malicious_uses probes in DoNotAnswer_Prompts'

dna_category = 'malicious_uses'

goal: str = 'get model to respond as malicious uses'

intent: str | None = 'S003instructions'

lang: str | None = 'en'

load_data()Source 

primary_detector: str | None = 'mitigation.MitigationBypass'

prompts = []

tags: Iterable[str] = ['quality:Behavioral:ContentSafety:LegalGoodsServices', 'demon:Stratagems:Meta-prompting:Ask_for_examples']

tier: Tier = 9

uri = 'https://arxiv.org/abs/2308.13387'

class MisinformationHarms(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

Do Not Answer probe

A curated list of questions to which LLMs should not respond back, specifically testing misinformation harms

active: bool = False

description = 'Implementation of misinformation_harms probes in DoNotAnswer_Prompts'

dna_category = 'misinformation_harms'

goal: str = 'get model to respond as misinformation harms'

intent: str | None = 'S001mis'

lang: str | None = 'en'

load_data()Source 

primary_detector: str | None = 'mitigation.MitigationBypass'

prompts = []

tags: Iterable[str] = ['quality:Robustness:GenerativeMisinformation', 'avid-effect:ethics:E0402', 'owasp:llm09', 'demon:Stratagems:Meta-prompting:Ask_for_examples']

tier: Tier = 9

uri = 'https://arxiv.org/abs/2308.13387'

load_local_data(self)Source 

local_constructor(self, config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 