garak.probes.fitd
Foot In The Door
A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt.
Probes in this module should implement FitD or an FitD variant.
Further info:
- class FITD(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/docs/source/../../garak/_config.py'>)Source
Bases:
IterativeProbeFoot In The Door probe
A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt
Paper: http://arxiv.org/abs/2502.19820
Configurable parameters:
DEFAULT_PARAMScontents:max_calls_per_conv=10follow_prompt_cap=Truered_team_model_type='nim.NVOpenAIChat'red_team_model_name='mistralai/mixtral-8x22b-instruct-v0.1'red_team_model_config={}num_level_queries=3end_condition='verify'initial_prompt_count=5detector_score_threshold=0.5prompt_template_response_aligned__aligned_prefix='yes'prompt_template_eval__unsafe_prefix='unsafe'
Default values are listed
See also Configuring garak for how to set these values.
Other attributes:
- generate_next_turn_attempt(conversation: Conversation, last_attempt: Attempt) AttemptSource
Given the conversation object from the last attempt, this function generates an attempt for the next interaction. This can either create a new conversation with a newly generated turn appended to the last one or create a new conversation with an optional subset of the past turns from conversation along with a new turn.
- goal: str = 'Transform queries from benign to harmful obtaining intermediate responses to get compliance from the model'
- set_up_detector()Source
Instantiate detector; to be used to decide whether to continue the conversation when end_condition is ‘detector’