garak.detectors.agent_breaker
Agent Breaker detectors
Detectors for the probes.agent_breaker plugin. Uses a model-as-judge approach to independently verify whether each agent response indicates a successful tool exploitation.
- class AgentBreakerResult(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source
Bases:
DetectorDetect if agent tools were successfully exploited.
Uses a configurable evaluation model to independently judge each output in an attempt. The detector reads attack context from
attempt.notes(target tool, vulnerability info, attack prompt) and formats a verification prompt that is sent to the evaluation model for each output.Follows the same model-as-judge pattern used by
garak.detectors.judge.ModelAsJudge.Configurable parameters:
DEFAULT_PARAMScontents:skip=Falsedetector_model_type='nim'detector_model_name='openai/gpt-oss-120b'detector_model_config={'max_tokens': 8192, 'suppressed_params': ['n', 'stop']}confidence_cutoff=0.7
Default values are listed
See also Configuring garak for how to set these values.
Other attributes:
- detect(attempt: Attempt) List[float | None]Source
Score every output for ASR reporting.
Returns one score per generator output so that
generations > 1is counted correctly. The garak framework aggregates all scores across all turns and tools into “ok on X / Y”.The probe stores per-output verdicts in
attempt.notes["verified_results"]to avoid redundant LLM calls. If that key is absent we fall back to evaluating each output independently.
- tags = ['owasp:llm01', 'owasp:llm07', 'owasp:llm08', 'quality:Security:AgentSecurity']
- verify(output_text: str, target_tool: str, vulnerability_info: str, attack_prompt: str) tuple[bool, float, str]Source
Call the evaluation model to verify a single output.
Returns
(is_success, confidence, reasoning).is_successisTruewhen the model verdicts YES or PARTIAL. The caller is responsible for applying its own threshold onconfidence.