garak.attempt

In garak, attempt objects track a single prompt and the results of running it on through the generator. Probes work by creating a set of garak.attempt objects and setting their class properties. These are passed by the harness to the generator, and the output added to the attempt. Then, a detector assesses the outputs from that attempt and the detector’s scores are saved in the attempt. Finally, an evaluator makes judgments of these scores, and writes hits out to the hitlog for any successful probing attempts.

garak.attempt

Defines the Attempt class, which encapsulates a prompt with metadata and results

class garak.attempt.Attempt(status=0, prompt=None, probe_classname=None, probe_params=None, targets=None, notes=None, detector_results=None, goal=None, seq=-1, lang=None, reverse_translation_outputs=None)

Bases: object

A class defining objects that represent everything that constitutes a single attempt at evaluating an LLM.

Parameters:
  • status (int) – The status of this attempt; ATTEMPT_NEW, ATTEMPT_STARTED, or ATTEMPT_COMPLETE

  • prompt (str) – The processed prompt that will presented to the generator

  • probe_classname (str) – Name of the probe class that originated this Attempt

  • probe_params (dict, optional) – Non-default parameters logged by the probe

  • targets (List(str), optional) – A list of target strings to be searched for in generator responses to this attempt’s prompt

  • outputs (List(str)) – The outputs from the generator in response to the prompt

  • notes (dict) – A free-form dictionary of notes accompanying the attempt

  • detector_results (dict) – A dictionary of detector scores, keyed by detector name, where each value is a list of scores corresponding to each of the generator output strings in outputs

  • goal (str) – Free-text simple description of the goal of this attempt, set by the originating probe

  • seq (int) – Sequence number (starting 0) set in garak.probes.base.Probe.probe(), to allow matching individual prompts with lists of answers/targets or other post-hoc ordering and keying

  • messages (List(dict)) – conversation turn histories; list of list of dicts have the format {“role”: role, “content”: text}, with actor being something like “system”, “user”, “assistant”

  • lang (str, valid BCP47) – Language code for prompt as sent to the target

  • reverse_translation_outputs – The reverse translation of output based on the original language of the probe

  • reverse_translation_outputs – List(str)

Expected use * an attempt tracks a seed prompt and responses to it * there’s a 1:1 relationship between attempts and source prompts * attempts track all generations * this means messages tracks many histories, one per generation * for compatibility, setting Attempt.prompt will set just one turn, and this is unpacked later

when output is set; we don’t know the # generations to expect until some output arrives

  • to keep alignment, generators need to return aligned lists of length #generations

Patterns/expectations for Attempt access: .prompt - returns the first user prompt .outputs - returns the most recent model outputs .latest_prompts - returns a list of the latest user prompts

Patterns/expectations for Attempt setting: .prompt - sets the first prompt, or fails if this has already been done .outputs - sets a new layer of model responses. silently handles expansion of prompt to multiple histories. prompt must be set .latest_prompts - adds a new set of user prompts

property all_outputs
as_dict() dict

Converts the attempt to a dictionary.

property latest_prompts
property outputs
outputs_for(lang) List[str]

outputs for a known language

When “*” or None are passed returns the original model output

property prompt
prompt_for(lang) str

prompt for a known language

When “*” or None are passed returns the prompt passed to the model