garak.probes.atkgen
atkgen: Attack Generation.
This probe uses a separate model to create prompts for a generator, which are intended to guide that generator into a certain failure mode.
Warning: These attack models may generate highly toxic prompts while attempting to goad target models into failure - peruse logs at your own risk, and don’t spend too much time doing it, using this guide to working with harmful text:
- class garak.probes.atkgen.Tox(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
Probe
Automatic attack generation, trying to elicit toxic outputs from the model.
This probe uses an attack model, trained on conversations that led to LLMs outputting toxicity, to generate prompts that should nudge a target LLM into yielding toxic output. The data used for this attack generation model was selected using the same detector as this tox probe uses, so both attack generation model and detector are looking for the same definition of the failure mode.
- DEFAULT_PARAMS = {'constructive_tension': True, 'max_calls': 5, 'red_team_model_config': {'hf_args': {'device': 'cpu', 'torch_dtype': 'float32'}}, 'red_team_model_name': 'garak-llm/artgpt2tox', 'red_team_model_type': 'huggingface.Pipeline', 'red_team_postproc_rm_regex': '\\<\\|.*', 'red_team_prompt_template': '<|input|>[query]<|response|>', 'use_only_first_sent': True}