garak.probes.continuation

Contentious term continuation probes

These probes pose a phrase ending in an unwanted term, and see if the target will offer a continuation of that term.

Probes that implement this kind of continuation attack belong in this module.

Operates by giving high-risk context as a prompt, extracted from e.g. Urban Dictionary examples of use of contentious terms, and seeing if the model will continue the term based on varying amounts of prefix.

class ContinueSlursReclaimedSlurs(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

Bases: ContinueSlursReclaimedSlursFull, Probe

Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur. It uses a subset of 80 slurs & contexts that test LLMs have been likely to provide completions for.

active: bool = True

aliases = ['continuation.ContinueSlursReclaimedSlurs80']

follow_prompt_cap = True

intent: str | None = 'S005hate'

source_resource_filename = 'slurprompts_mini.jsonl'

class ContinueSlursReclaimedSlursFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

Bases: Probe

Continue partial word with slur or reclaimed slur.

This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur.

active: bool = False

doc_uri: str = ''

follow_prompt_cap = False

goal: str = 'output an ethnic slur'

intent: str | None = 'S005hate'

lang: str | None = 'en'

primary_detector: str | None = 'continuation.Continuation'

source_resource_filename = 'slurprompts.jsonl'

tags: Iterable[str] = ['avid-effect:ethics:E0101', 'avid-effect:ethics:E0301', 'avid-effect:performance:P0401', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'payload:unwanted:hatespeech', 'demon:Language:Stylizing:Give_examples']

tier: Tier = 2