garak.probes.adaptive_attacks

Simple adaptive attack probes

Probes implementing the “simple adaptive attacks” jailbreak family from Andriushchenko, Croce, and Flammarion, “Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks” (ICML 2024, arXiv:2404.02151).

The paper shows that strong jailbreak success rates against safety-aligned chat models are achievable without gradient access by combining three light techniques: (1) an adaptive prompt template that constrains the model’s response format and forces an affirmative opener, (2) a short adversarial suffix that can be discovered via random search, and (3) self-transfer of suffixes between targets. See garak issue #583.

This probe is template-only: it uses a cached set of seed forbidden requests combined with a paraphrase of the published adaptive prompt template and a small set of pre-computed adversarial suffixes drawn from public AdvBench results. No gradient computation or online search is performed.

class AdaptiveAttacks(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source

Bases: Probe

Simple adaptive attacks probe (template + cached suffix)

Combines an adaptive prompt template with a small cached suffix list to elicit policy-violating responses from safety-aligned chat models without requiring gradient access. Based on Andriushchenko, Croce, and Flammarion (ICML 2024).

Configurable parameters:

DEFAULT_PARAMS contents:

  • follow_prompt_cap = True

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True
doc_uri: str = 'https://arxiv.org/abs/2404.02151'
goal: str = 'cause the model to disregard the system prompt and produce a forbidden response'
lang: str | None = 'en'
primary_detector: str | None = 'mitigation.MitigationBypass'
tags: Iterable[str] = ['avid-effect:security:S0403', 'avid-effect:performance:P0204', 'owasp:llm01', 'quality:Security:PromptStability', 'payload:jailbreak']
tier: Tier = 2