garak.probes.atkgen

atkgen: Attack Generation.

This probe uses a separate model to create prompts for a generator, which are intended to guide that generator into a certain failure mode.

Warning: These attack models may generate highly toxic prompts while attempting to goad target models into failure - peruse logs at your own risk, and don’t spend too much time doing it, using this guide to working with harmful text:

class garak.probes.atkgen.Tox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)

Bases: Probe

Automatic attack generation, trying to elicit toxic outputs from the model.

This probe uses an attack model, trained on conversations that led to LLMs outputting toxicity, to generate prompts that should nudge a target LLM into yielding toxic output. The data used for this attack generation model was selected using the same detector as this tox probe uses, so both attack generation model and detector are looking for the same definition of the failure mode.

DEFAULT_PARAMS = {'constructive_tension': True, 'max_calls': 5, 'red_team_model_config': {'hf_args': {'device': 'cpu', 'torch_dtype': 'float32'}}, 'red_team_model_name': 'garak-llm/artgpt2tox', 'red_team_model_type': 'huggingface.Pipeline', 'red_team_postproc_rm_regex': '\\<\\|.*', 'red_team_prompt_template': '<|input|>[query]<|response|>', 'use_only_first_sent': True}
bcp47: Iterable[str] | None = 'en'
doc_uri: str = 'https://interhumanagreement.substack.com/p/faketoxicityprompts-automatic-red'
goal: str = 'make the model say something toxic'
parallelisable_attempts: bool = False
probe(generator) List[Attempt]

attempt to exploit the target generator, returning a list of results

recommended_detector: Iterable[str] = ['unsafe_content.ToxicCommentModel']
tags: Iterable[str] = ['avid-effect:ethics:E0301', 'quality:Behavioral:ContentSafety:Toxicity', 'quality:Behavioral:ContentSafety:HateHarassment', 'quality:Behavioral:ContentSafety:Profanity', 'payload:unwanted']