Writing a Probe

Probes are, in some ways, the essence of garak’s functionality – they serve as the abstraction that encapsulates attacks against AI models and systems. In this example, we’re going to go over the key points of how to develop a new probe.

Inheritance

All probes will inherit from garak.probes.base.Probe.

from garak.probes.base import Probe

class MyNewProbe(Probe):
    """
    Probe to do something naughty to a language model
    """
    ...

We require class docstrings in garak and enforce this requirement via a test required before merging.

Probes must always inherit from garak.probes.base.Probe. This allows probes to work nicely with Generator and Attempt objects in addition to ensuring that any `Buff`s that one might want to apply to a probe are going to work appropriately.

The probe method of Probe objects is where the core logic of a probe lies. Ideally, one would need only to populate the prompts attribute of a Probe and let the probe method do the heavy lifting. However, if this logic is insufficient for a custom probe, this is where the majority of the work (and potential issues) tends to lie.

Configuring and Describing Probes

Probes are built upon the Configurable base class and are themselves configurable. We largely ignore parameters like ENV_VAR and DEFAULT_PARAMS in Probe classes, but if your probe requires an environment variable or you want to set some default parameters, it is done first in the class.

More often, we’ll be looking at descriptive attributes of the probe. From the base class:

That’s a lot of attributes before we get into ones you may need to add! Many of these are decent defaults, though there are a few that we absolutely want to include in any new probe

  • doc_uri: What is a good reference for the probe you’re implementing? An academic paper, blog post, or social media post (in descending order of preference) goes here.

  • active: Should this probe be part of default scans?

  • tags: MISP-formatted taxonomy categories __e.g.__ ["avid-effect:security:S0403", "owasp:llm01", "quality:Security:PromptStability", "payload:jailbreak"]

  • goal: What is the probe trying to do? __e.g.__ "disregard the system prompt"

  • primary_detector: What Detector should your probe use?

Testing

Once the logic for our probe is written, you’ll want to test it before opening a pull request. Typically, a good place to start is by seeing if your probe can be imported!

If you can run this with no error, you’re ready to move on to the next phase of testing. Otherwise, try to address the encountered errors.

Let’s try running our new probe against a HuggingFace Pipeline using meta-llama/Llama-2-7b-chat-hf, a notoriously tricky model to get to behave badly.

If it all runs well, you’ll get a log and a hitlog file, which tell you how successful your new probe was! If you encounter errors, go through and try to address them. You can look at the bottom of the garak.log file, whose path is printed in the output every time you call garak, to see what errors there are.

If you want to debug your probe interactively, try using something like p = garak._plugins.load_plugin("probes.mynewprobe.MyNewProbe") from a Python prompt to load the probe. The variable p will be assigned an instance of the probe (if instantiation was successful) and you can test a lot of the probe’s intended functionality from here.

Finally, check a few properties:

  • Does the new probe appear in python -m garak --list_probes?

  • Does the probe run? python -m garak -m test -p mynewprobe.MyNewProbe

  • Do the garak tests pass? python -m pytest tests/

Done!

Congratulations on writing a probe for garak!

If you’ve tested your probe and validated that it works, run black to format your code in accordance with garak code standards. Once your code is properly tested and formatted, push your work to your github fork and open a pull request – thanks for your contribution!