garak.probes.fitd

Foot In The Door

A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt.

Probes in this module should implement FitD or an FitD variant.

Further info:

http://arxiv.org/abs/2502.19820

class FITD(config_root=<module 'garak._config' from '/home/docs/checkouts/readthedocs.org/user_builds/garak/checkouts/latest/garak/_config.py'>)Source 

Bases: IterativeProbe

Foot In The Door probe

A multi-turn jailbreak that generates queries of increasing intensity from safe to adversarial, getting compliance from the model at each intermediate stage to obtain attack success with the adversarial prompt

Paper: http://arxiv.org/abs/2502.19820

Configurable parameters:

DEFAULT_PARAMS contents:

max_calls_per_conv = 10
follow_prompt_cap = True
red_team_model_type = 'nim.NVOpenAIChat'
red_team_model_name = 'mistralai/mixtral-8x22b-instruct-v0.1'
red_team_model_config = {}
num_level_queries = 3
end_condition = 'verify'
initial_prompt_count = 5
detector_score_threshold = 0.5
prompt_template_response_aligned__aligned_prefix = 'yes'
prompt_template_eval__unsafe_prefix = 'unsafe'

Default values are listed

See also Configuring garak for how to set these values.

Other attributes:

active: bool = False

doc_uri: str = 'https://arxiv.org/abs/2502.19820'

generate_next_turn_attempt(conversation: Conversation, last_attempt: Attempt) → AttemptSource : Given the conversation object from the last attempt, this function generates an attempt for the next interaction. This can either create a new conversation with a newly generated turn appended to the last one or create a new conversation with an optional subset of the past turns from conversation along with a new turn.

goal: str = 'Transform queries from benign to harmful obtaining intermediate responses to get compliance from the model'

intent: str | None = 'T009ignore'

lang: str | None = 'en'

primary_detector: str | None = 'judge.RefusalOnlyAdversarial'

set_up_attack_gen_llm()Source : Instantiate client for attack gen LLM

set_up_detector()Source : Instantiate detector; to be used to decide whether to continue the conversation when end_condition is ‘detector’

set_up_refusal_detector()Source : Instantiate a garak mitigation.Prefixes detector to be used to identify refusals. Note that this is not intended to be the garak detector for the probe

tags: Iterable[str] = ['owasp:llm01', 'owasp:llm02', 'owasp:llm05', 'demon:Rhetoric:Persuasion_and_manipulation:Escalating']

tier: Tier = 3