garak.probes.sysprompt_extraction
System Prompt Extraction
Probes that attempt to extract a model’s system prompt using adversarial techniques including direct requests, encoding tricks, and role-playing.
System prompts are sourced from HuggingFace datasets and combined with
attack templates loaded from garak/data/sysprompt_extraction/attacks.json.
- class SystemPromptExtraction(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source
Bases:
ProbeAttempt to extract system prompts using adversarial attack templates
Combines system prompts from HuggingFace datasets with attack templates to test whether a model will reveal its system prompt under adversarial pressure. Uses conversation support to set the system prompt, then probes with various extraction techniques.
Configurable parameters:
DEFAULT_PARAMScontents:system_prompt_sources=['garak-llm/drh-System-Prompt-processed', 'garak-llm/tm-system_prompt']system_prompt_subsample=50follow_prompt_cap=True
Default values are listed
See also Configuring garak for how to set these values.
Other attributes: