garak.probes.phrasing
Past Tense
Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal
outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense is often sufficient to jailbreak many state- of-the-art LLMs
- class garak.probes.phrasing.FutureTense(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
Probe
Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior?
This check sends the generator requests for advice on possibly protect or restricted subjects in the future tense.
- class garak.probes.phrasing.FutureTenseMini(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
FutureTense
,TenseMini
Attempt to evade safeguards by setting requests in the future tense
A subset of the tense-modified requests. Success is reported higher with more variation, so generations is overridden.
- class garak.probes.phrasing.PastTense(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
Probe
Will the system give unsafe responses when asked to write about weapon creation or dangerous behavior?
This check sends the generator requests for advice on possibly protect or restricted subjects in the past tense.
- class garak.probes.phrasing.PastTenseMini(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
-
Attempt to evade safeguards by setting requests in the past tense
A subset of the tense-modified requests. Success is reported higher with more variation, so generations is overridden.