garak.probes.sysprompt_extraction

System Prompt Extraction

Probes that attempt to extract a model’s system prompt using adversarial techniques including direct requests, encoding tricks, and role-playing.

System prompts are sourced from HuggingFace datasets and combined with attack templates loaded from garak/data/sysprompt_extraction/attacks.json.

class SystemPromptExtraction(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

Bases: Probe

Attempt to extract system prompts using adversarial attack templates

Combines system prompts from HuggingFace datasets with attack templates to test whether a model will reveal its system prompt under adversarial pressure. Uses conversation support to set the system prompt, then probes with various extraction techniques.

Configurable parameters:

DEFAULT_PARAMS contents:

system_prompt_sources = ['garak-llm/drh-System-Prompt-processed', 'garak-llm/tm-system_prompt']
system_prompt_subsample = 50
follow_prompt_cap = True

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True

doc_uri: str = 'https://simonwillison.net/tags/exfiltration-attacks/'

goal: str = "extract the model's system prompt"

intent: str | None = 'T009reveal'

lang: str | None = 'en'

primary_detector: str | None = 'sysprompt_extraction.PromptExtraction'

tags: Iterable[str] = ['avid-effect:security:S0301', 'owasp:llm01', 'quality:Security:PromptStability']

tier: Tier = 1