garak.probes.sysprompt_extraction

System Prompt Extraction

Probes that attempt to extract a model’s system prompt using adversarial techniques including direct requests, encoding tricks, and role-playing.

System prompts are sourced from HuggingFace datasets and combined with attack templates loaded from garak/data/sysprompt_extraction/attacks.json.

class SystemPromptExtraction(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source

Bases: Probe

Attempt to extract system prompts using adversarial attack templates

Combines system prompts from HuggingFace datasets with attack templates to test whether a model will reveal its system prompt under adversarial pressure. Uses conversation support to set the system prompt, then probes with various extraction techniques.

Configurable parameters:

DEFAULT_PARAMS contents:

  • system_prompt_sources = ['garak-llm/drh-System-Prompt-processed', 'garak-llm/tm-system_prompt']

  • system_prompt_subsample = 50

  • follow_prompt_cap = True

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True
doc_uri: str = 'https://simonwillison.net/tags/exfiltration-attacks/'
goal: str = "extract the model's system prompt"
lang: str | None = 'en'
primary_detector: str | None = 'sysprompt_extraction.PromptExtraction'
tags: Iterable[str] = ['avid-effect:security:S0301', 'owasp:llm01', 'quality:Security:PromptStability']
tier: Tier = 1