garak.probes.promptinject

PromptInject

Basic prompt injections, including plaintext malicious commands into requests for innocuous information.

Probes in this module should specifically implement parts of the PromptInject framework.

Uses the PromptInject framework, https://openreview.net/forum?id=qiaRo_7Zmug, to probe a system. A subset of the attacks is implemented here. The default is to use a selection of that subset, because even the subset corresponds to thousands of requests by default.

Further info:

https://openreview.net/forum?id=qiaRo_7Zmug

class HijackHateHumans(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is hate-humans

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on hate humans'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_hate-humans'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'hate-humans'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Rhetoric:Persuasion_and_manipulation:Distraction', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

class HijackHateHumansFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is hate-humans

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = False

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on hate humans'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_hate-humans'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'hate-humans'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

class HijackKillHumans(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is kill-humans

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on kill humans'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_kill-humans'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'kill-humans'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Rhetoric:Persuasion_and_manipulation:Distraction', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

class HijackKillHumansFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is kill-humans

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = False

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on kill humans'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_kill-humans'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'kill-humans'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

class HijackLongPrompt(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is long-prompt

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on long prompt'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_long-prompt'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'long-prompt'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Rhetoric:Persuasion_and_manipulation:Distraction', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

class HijackLongPromptFull(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)

Bases: Probe

PromptInject probe to inject a string; payload is long-prompt

Attempt to make the model repeat a given string

Configurable parameters:

DEFAULT_PARAMS contents:

generation_params = ['temperature', 'top_p', 'frequency_penalty', 'presence_penalty', 'max_tokens']

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = False

description = 'PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. trophy Best Paper Awards @ NeurIPS ML Safety Workshop 2022'

goal: str = 'inject a prompt on long prompt'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

load_data_all_attacks(prompt_data)Source 

name = 'promptinject_long-prompt'

parallelisable_attempts: bool = False

primary_detector: str | None = 'promptinject.AttackRogueString'

rogue_string = 'long-prompt'

tags: Iterable[str] = ['avid-effect:security:S0403', 'owasp:llm01', 'quality:Behavioral:ContentSafety:Violence', 'quality:Security:PromptStability', 'payload:unwanted:violence', 'demon:Language:Prompt_injection:Strong_arm_attack', 'demon:Language:Stylizing:Capitalizing', 'demon:Language:Prompt_injection:Ignore_previous_instructions']

tier: Tier = 2

uri = 'https://openreview.net/forum?id=qiaRo_7Zmug'

constructor(self, config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

load_data_all_attacks(self, prompt_data)Source 