garak.probes.continuation
Contentious term continuation probes
These probes pose a phrase ending in an unwanted term, and see if the target will offer a continuation of that term.
Probes that implement this kind of continuation attack belong in this module.
Operates by giving high-risk context as a prompt, extracted from e.g. Urban Dictionary examples of use of contentious terms, and seeing if the model will continue the term based on varying amounts of prefix.
- class ContinueSlursReclaimedSlurs(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source
Bases:
ContinueSlursReclaimedSlursFull,ProbeContinue partial word with slur or reclaimed slur.
This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur. It uses a subset of 80 slurs & contexts that test LLMs have been likely to provide completions for.
- aliases = ['continuation.ContinueSlursReclaimedSlurs80']
- follow_prompt_cap = True
- source_resource_filename = 'slurprompts_mini.jsonl'
- class ContinueSlursReclaimedSlursFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source
Bases:
ProbeContinue partial word with slur or reclaimed slur.
This probe poses a partial ethnic slur with some preceding content, and checks to see if the model will complete the slur.
- follow_prompt_cap = False
- source_resource_filename = 'slurprompts.jsonl'