garak.probes.base
This class defines the basic structure of garak’s probes. All probes inherit from garak.probes.base.Probe
.
Attributes:
1. doc_uri URI for documentation of the probe (perhaps a paper)
1. lang Language this is for, in BCP47 format; *
for all langs. Probes tend to be either monolingual or langauge-agnostic, so only a single BCP57-encoded language should go here (max).
1. active Should this probe be run by default?
1. tags MISP-format taxonomy categories
1. goal What the probe is trying to do, phrased as an imperative
1. primary_detector Default detector to run, if the primary/extended way of doing it is to be used
1. extended_detectors Optional extended detectors
1. parallelisable_attempts Can attempts from this probe be parallelised?
1. post_buff_hook Tracks whether a buff is loaded that requires a call to untransform model outputs
1. modality Which modalities does this probe work on? garak
supports mainstream any-to-any large models, but only assesses text output.
1. tier Description of impact this probe can have; 1 = high.
Functions:
__init__(): Class constructor. Call this from probes after doing local init. It does things like setting
probename
, setting up the description automatically from the class docstring, and logging probe instantiation.probe(). This function is responsible for the interaction between the probe and the generator. It takes as input the generator, and returns a list of completed
attempt
objects, including outputs generated.probe()
orchestrates all interaction between the probe and the generator. Because a fair amount of logic is concentrated here, hooks into the process are provided, so one doesn’t need to override theprobe()
function itself when customising probes.
The general flow in probe()
is:
Create a list of
attempt
objects corresponding to the prompts in the probe, using_mint_attempt()
. Prompts are iterated through and passed to_mint_attempt()
. The_mint_attempt()
function works by converting a prompt to a fullattempt
object, and then passing thatattempt
object through_attempt_prestore_hook()
. The result is added to a list inprobe()
calledattempts_todo
.If any buffs are loaded, the list of attempts is passed to
_buff_hook()
for transformation._buff_hook()
checks the config and then creates a new attempt list,buffed_attempts
, which contains the results of passing each original attempt through each instantiated buff in turn. Instantiated buffs are tracked in_config.buffmanager.buffs
. Oncebuffed_attempts
is populated, it’s returned, and overwritesprobe()
’sattempts_todo
.At this point,
probe()
is ready to start interacting with the generator. An empty listattempts_completed
is set up to hold completed results.The set of attempts is then passed to
_execute_all
.Attempts are iterated through (ether in parallel or serial) and individually posed to the generator using
_execute_attempt()
.The process of putting one
attempt
through the generator is orchestrated by_execute_attempt()
, and runs as follows:
First,
_generator_precall_hook()
allows adjustment of the attempt and generator (doesn’t return a value).Next, the prompt of the attempt (this_attempt.prompt) is passed to the generator’s
generate()
function. Results are stored in the attempt’soutputs
attribute.If there’s a buff that wants to transform the generator results, the completed attempt is transformed through
_postprocess_buff()
(ifself.post_buff_hook == True
).The completed attempt is passed through a post-processing hook,
_postprocess_hook()
.A string of the completed attempt is logged to the report file.
A deepcopy of the attempt is returned.
Once done, the result of
_execute_attempt()
is added toattempts_completed
.Finally,
probe()
logs completion and returns the list of processed attempts fromattempts_completed
.
_attempt_prestore_hook(). Called when creating a new attempt with
_mint_attempt()
. Can be used to e.g. storetriggers
relevant to the attempt, for use in TriggerListDetector, or to add a note._buff_hook(). Called from
probe()
to buff attempts after the list inattempts_todo
is populated._execute_attempt(). Called from
_execute_all()
to orchestrate processing of one attempt by the generator._execute_all(). Called from
probe()
to orchestrate processing of the set of attempts by the generator.
If configured, parallelisation of attempt processing is set up using
multiprocessing
. The relevant config variable is_config.system.parallel_attempts
and the value should be greater than 1 (1 in parallel is just serial).Attempts are iterated through (ether in parallel or serial) and individually posed to the generator using
_execute_attempt()
.
_generator_precall_hook(). Called at the start of
_execute_attempt()
with attempt and generator. Can be used to e.g. adjust generator parameters._mint_attempt(). Converts a prompt to a new attempt object, managing metadata like attempt status and probe classname.
_postprocess_buff(). Called in
_execute_attempt()
after results come back from the generator, if a buff specifies it. Used to e.g. translate results back if already translated to another language._postprocess_hook(). Called near the end of
_execute_attempt()
to apply final postprocessing to attempts after generation. Can be used to restore state, e.g. if generator parameters were adjusted, or to clean up generator output.
Base classes for probes.
Probe plugins must inherit one of these. Probe serves as a template showing what expectations there are for inheriting classes.
- class garak.probes.base.Probe(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
Configurable
Base class for objects that define and execute LLM evaluations
- DEFAULT_PARAMS = {}
- class garak.probes.base.TreeSearchProbe(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)
Bases:
Probe
- DEFAULT_PARAMS = {'per_generation_threshold': 0.5, 'per_node_threshold': 0.1, 'queue_children_at_start': True, 'strategy': 'breadth_first', 'target_soft': True}
- probe(generator)
attempt to exploit the target generator, returning a list of results