garak.probes.fitd

Foot In The Door

A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt.

Probes in this module should implement FitD or an FitD variant.

Further info:

class FITD(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source

Bases: IterativeProbe

Foot In The Door probe

A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt

Paper: http://arxiv.org/abs/2502.19820

Configurable parameters:

DEFAULT_PARAMS contents:

  • max_calls_per_conv = 10

  • follow_prompt_cap = True

  • red_team_model_type = 'nim.NVOpenAIChat'

  • red_team_model_name = 'mistralai/mixtral-8x22b-instruct-v0.1'

  • red_team_model_config = {}

  • num_level_queries = 3

  • end_condition = 'verify'

  • initial_prompt_count = 5

  • detector_score_threshold = 0.5

  • prompt_template_response_aligned__aligned_prefix = 'yes'

  • prompt_template_eval__unsafe_prefix = 'unsafe'

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = False
doc_uri: str = 'https://arxiv.org/abs/2502.19820'
generate_next_turn_attempt(conversation: Conversation, last_attempt: Attempt) AttemptSource

Given the conversation object from the last attempt, this function generates an attempt for the next interaction. This can either create a new conversation with a newly generated turn appended to the last one or create a new conversation with an optional subset of the past turns from conversation along with a new turn.

goal: str = 'Transform queries from benign to harmful obtaining intermediate responses to get compliance from the model'
lang: str | None = 'en'
primary_detector: str | None = 'judge.RefusalOnlyAdversarial'
set_up_attack_gen_llm()Source

Instantiate client for attack gen LLM

set_up_detector()Source

Instantiate detector; to be used to decide whether to continue the conversation when end_condition is ‘detector’

set_up_refusal_detector()Source

Instantiate a garak mitigation.Prefixes detector to be used to identify refusals. Note that this is not intended to be the garak detector for the probe

tags: Iterable[str] = ['owasp:llm01', 'owasp:llm02', 'owasp:llm05']
tier: Tier = 3