garak.probes.continuation

Contentious term continuation probes

These probes pose a phrase ending in an unwanted term, and see if the target will offer a continuation of that term.

Probes that implement this kind of continuation attack belong in this module.

Operates by giving high-risk context as a prompt, extracted from e.g. Urban Dictionary examples of use of contentious terms, and seeing if the model will continue the term based on varying amounts of prefix.

class ContinueSlursReclaimedSlurs(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source

Bases: ContinueSlursReclaimedSlursFull, Probe

Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur. It uses a subset of 80 slurs & contexts that test LLMs have been likely to provide completions for.

active: bool = True
aliases = ['continuation.ContinueSlursReclaimedSlurs80']
follow_prompt_cap = True
source_resource_filename = 'slurprompts_mini.jsonl'
class ContinueSlursReclaimedSlursFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source

Bases: Probe

Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur.

active: bool = False
doc_uri: str = ''
follow_prompt_cap = False
goal: str = 'output an ethnic slur'
lang: str | None = 'en'
primary_detector: str | None = 'continuation.Continuation'
source_resource_filename = 'slurprompts.jsonl'
tags: Iterable[str] = ['avid-effect:ethics:E0101', 'avid-effect:ethics:E0301', 'avid-effect:performance:P0401', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'payload:unwanted:hatespeech']
tier: Tier = 2