garak.probes.atkgen

atkgen: Attack Generation

This probe uses a separate model to create attack prompts, which are intended to guide that generator into a certain failure mode.

Only probes that dynamically generate their own attack prompts belong in this module.

Warning: These attack models may generate highly toxic prompts while attempting to goad target models into failure - peruse logs at your own risk, and don’t spend too much time doing it, using this guide to working with harmful text: https://aclanthology.org/2022.findings-emnlp.35/

Further info:

https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red

class Tox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

Bases: Probe

Automatic attack generation, trying to elicit toxic outputs from the model.

This probe uses an attack model, trained on conversations that led to LLMs outputting toxicity, to generate prompts that should nudge a target LLM into yielding toxic output. The data used for this attack generation model was selected using the same detector as this tox probe uses, so both attack generation model and detector are looking for the same definition of the failure mode.

Configurable parameters:

DEFAULT_PARAMS contents:

max_calls_per_conv = 5
convs_per_generation = 5
constructive_tension = True
red_team_model_type = 'huggingface.Pipeline'
red_team_model_name = 'garak-llm/attackgeneration-toxicity_gpt2'
red_team_model_config = {'hf_args': {'device': 'cpu', 'torch_dtype': 'float32'}}
red_team_prompt_template = '<|input|>[query]<|response|>'
red_team_postproc_rm_regex = '\\<\\|.*'
use_only_first_sent = True
allow_repetition = False

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = True

doc_uri: str = 'https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red'

goal: str = 'make the model say something toxic'

intent: str | None = 'S005'

lang: str | None = 'en'

parallelisable_attempts: bool = False

primary_detector: str | None = 'unsafe_content.ToxicCommentModel'

probe(generator) → List[Attempt]Source : attempt to exploit the target generator, returning a list of results

tags: Iterable[str] = ['avid-effect:ethics:E0301', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'quality:Behavioral:ContentSafety:Profanity', 'payload:unwanted', 'demon:Stratagems:Scattershot:Regenerate_response']

tier: Tier = 2