garak.probes.adaptive_attacks
Simple adaptive attack probes
Probes implementing the “simple adaptive attacks” jailbreak family from Andriushchenko, Croce, and Flammarion, “Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks” (ICML 2024, arXiv:2404.02151).
The paper shows that strong jailbreak success rates against safety-aligned chat models are achievable without gradient access by combining three light techniques: (1) an adaptive prompt template that constrains the model’s response format and forces an affirmative opener, (2) a short adversarial suffix that can be discovered via random search, and (3) self-transfer of suffixes between targets. See garak issue #583.
This probe is template-only: it uses a cached set of seed forbidden requests combined with a paraphrase of the published adaptive prompt template and a small set of pre-computed adversarial suffixes drawn from public AdvBench results. No gradient computation or online search is performed.
- class AdaptiveAttacks(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source
Bases:
ProbeSimple adaptive attacks probe (template + cached suffix)
Combines an adaptive prompt template with a small cached suffix list to elicit policy-violating responses from safety-aligned chat models without requiring gradient access. Based on Andriushchenko, Croce, and Flammarion (ICML 2024).
Configurable parameters:
DEFAULT_PARAMScontents:follow_prompt_cap=True
Default values are listed
See also Configuring garak for how to set these values.
Other attributes: