A Bidirectional LLM Firewall: Architecture, Failure Modes, and Evaluation Results

Over the past months I have been building and evaluating a stateful, bidirectional security layer that sits between clients and LLM APIs and enforces defense-in-depth on both input → LLM and LLM → output.

This is not a prompt-template guardrail system.
It’s a full middleware with deterministic layers, semantic components, caching, and a formal threat model.

I’m sharing details here because many teams seem to be facing similar issues (prompt injection, tool abuse, hallucination safety), and I would appreciate peer feedback from engineers who operate LLMs in production.

1. Architecture Overview

Inbound (Human → LLM)

  • Normalization Layer

    • NFKC/Homoglyph normalization

    • Recursive Base64/URL decoding (max depth = 3)

    • Controls for zero-width characters and bidi overrides

  • PatternGate (Regex Hardening)

    • 40+ deterministic detectors across 13 attack families

    • Used as the “first-hit layer” for known jailbreak primitives

  • VectorGuard + CUSUM Drift Detector

    • Embedding-based anomaly scoring

    • Sequential CUSUM to detect oscillating attacks

    • Protects against payload variants that bypass regex

  • Kids Policy / Context Classifier

    • Optional mode

    • Classifies fiction vs. real-world risk domains

    • Used to block high-risk contexts even when phrased innocently

Outbound (LLM → User)

  • Strict JSON Decoder

    • Rejects duplicate keys, unsafe structures, parser differentials

    • Required for safe tool-calling / autonomous agents

  • ToolGuard

    • Detects and blocks attempts to trigger harmful tool calls

    • Works via pattern + semantic analysis

  • Truth Preservation Layer

    • Lightweight fact-checker against a canonical knowledge base

    • Flags high-risk hallucinations (medicine, security, chemistry)

2. Decision Cache (Exact / Semantic / Hybrid)

A key performance component is a hierarchical decision cache:

  • Exact mode = hash-based lookup

  • Semantic mode = embedding similarity + risk tolerance

  • Hybrid mode = exact first, semantic fallback

In real workloads this cuts 40–80% of evaluation latency depending on prompt diversity.

3. Evaluation Results (Internal Suite)

I tested the firewall against a synthetic adversarial suite (BABEL, NEMESIS, ORPHEUS, CMD-INJ).
This suite covers ~50 structured jailbreak families.

Results:

  • 0 / 50 bypasses on the current build

  • ~20–25% false positive rate on the Kids Policy (work in progress)

  • P99 latency: < 200 ms per request

  • Memory footprint: ~1.3 GB (mostly due to embedding model)

Important note:
These results apply only to the internal suite.
They do not imply general robustness, and I’m looking for external red-teaming.

4. Failure Modes Identified

The most problematic real-world cases so far:

  • Unicode abuse beyond standard homoglyph sets

  • “Role delegation” attacks that look benign until tool-level execution

  • Fictional prompts that drift into real harmful operational space

  • LLM hallucinations that fabricate APIs, functions, or credentials

  • Semantic near-misses where regex detectors fail but semantics are ambiguous

These informed several redesigns (especially the outbound layers).

5. Open Questions (Where I’d Appreciate Feedback)

  1. Best practices for low-FPR context classifiers in safety-critical tasks

  2. Efficient ways to detect tool-abuse intent when the LLM generates partial code

  3. Open-source adversarial suites larger than my internal one

  4. Integration patterns with LangChain / vLLM / FastAPI that don’t add excessive overhead

  5. Your experience with caching trade-offs under high variability prompts

If you operate LLMs in production or have built guardrails beyond templates, I’d appreciate your perspectives.
Happy to share more details or design choices on request.

1 Like

I gathered some resources for now.

Wow - thank you - i’ll check your information package asap!

1 Like

Hello again,

I wanted to extend my sincere thanks for your incredibly detailed and actionable advice on LLM firewall architecture. Your guidance on moving beyond simple pattern matching toward a multi-layered, context-aware system has been invaluable.

We’ve directly applied several of your recommendations with measurable success:

  • Integrating Aho‑Corasick for efficient multi‑keyword matching in our SafetyValidator.
  • Replacing binary risk scores with a nuanced, weighted scoring system that aggregates signals across layers.
  • Using HarmBench’s categorized metrics to drive our prioritization, which revealed our current weak points.

As a result, our overall HarmBench ASR dropped to 18.0%, with copyright violations now at only 4.0% ASR.

We are now facing the next architectural decision—one where your system‑level perspective would be extremely helpful. Your original note recommended specialized detectors (e.g., for code‑intent or persuasive rhetoric) for “hard” cases like cybercrime/intrusion and misinformation.

Our key question is about the integration pattern for such detectors:
In a production firewall that must balance latency, maintainability, and safety, would you recommend implementing these specialized detectorss as internal layers within the core firewall engine, or as separate,asynchronously‑called microservices?

We are especially concerned about:

  1. Latency impact of model inference (e.g., a CodeBERT‑style classifier) on the synchronous request path.
  2. Lifecycle & versioning—how to update a dedicated detector without redeploying the entire firewall.
  3. Failure isolation—ensuring that a failing detector doesn’t break the entire safety pipeline.

Any high‑level guidance you could share on this architectural choice would help us invest our engineering effort in the right direction.

Thank you again for your time and for sharing your expertise. It has already made a substantial difference in my projject.

Great. I gathered some additional information. Hope it helps…

1 Like

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

  • 0% False Negative Rate for critical code/SQL injection payloads.

  • ~3% False Positive Rate on a security-focused benign subset.

  • <30ms added latency for inline inference.
    The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

  1. Orchestration Pattern: For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

  2. Continual Learning: For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

Your insights on scaling this architecture would be invaluable.

1 Like

Follow-up Questions - Multi-Detector Architecture

Hello again,

Following up on our previous discussion about integrating specialized detectors: We proceeded by embedding a custom convolutional neural network (CNN) for code-intent classification directly within the firewall process as a co-located library, avoiding the initial overhead of microservices.

Current Status: The detector operates in production shadow mode alongside the primary rule engine. After iterative adversarial training (focused on obfuscation and context-wrapping) and threshold optimization (θ=0.6), its performance on our defined evaluation suite shows:

- 0% False Positive Rate (0/1000 benign samples across 9 categories)

- 95.7% Attack Detection Rate (557/582 adversarial samples)

  • Mathematical notation camouflage: 100% blocked (300/300)

  • Multilingual code-switching: 91.1% blocked (257/282, 25 bypasses)

- <30ms added latency for inline inference

The rule engine remains the final decision-maker, ensuring operational stability.

This internal hybrid pattern validated the core concept for our first detector. We are now planning to scale the architecture to incorporate additional specialized detectors (e.g., for persuasion, misinformation).

Based on your experience evolving such a system:

**Orchestration Pattern:** For a multi-detector system, did you find a hierarchical router (dispatching to specific detectors) or a sequential pipeline (where all relevant detectors evaluate the prompt) to be more maintainable and performant in production?

**Continual Learning:** For detectors that must adapt to new tactics, what has been a reliable operational pattern to retrain and safely deploy updated models without causing service disruption or regression in core safety metrics?

**Critical Follow-up Questions:**

**1. Shadow Mode to Production Transition:**

We’re currently operating in shadow mode with the rule engine as fallback. What has been your experience transitioning detectors from shadow mode to active production? Are there specific metrics thresholds (e.g., FPR <1%, FNR <5%) or validation periods (e.g., 2-4 weeks) that you found reliable before making the switch? How do you handle the transition without disrupting existing safety guarantees?

**2. Handling Known Bypasses:**

We have 25 multilingual attacks bypassing detection (8.9% of multilingual test suite) due to code embedded in string literals/comments that get filtered by preprocessing. Should we address these before production deployment, or is it acceptable to deploy with known limitations if they’re well-documented and monitored? What’s your threshold for “acceptable risk” when deploying security systems?

**3. Production FPR/FNR Monitoring:**

What monitoring infrastructure have you found most effective for tracking FPR/FNR in production? Do you use automated sampling, manual review queues, or a combination? How do you distinguish between legitimate false positives (user complaints) and actual system degradation? Any tools or frameworks you’d recommend?

**4. Sequential Pipeline at Scale:**

If we start with a sequential pipeline for 2-3 detectors, at what point does latency become a bottleneck? Have you found a practical limit (e.g., 3-4 detectors, 100ms total) before needing to switch to a router pattern? What were the key indicators that triggered your transition?

**5. Retraining Workflow:**

For establishing a retraining workflow, what’s your recommended validation process? We’re thinking: automated test suite (1,000+ samples), shadow mode deployment, regression testing (FPR/FNR thresholds), then gradual rollout. Is this reasonable, or are there critical steps we’re missing? How do you handle model versioning and rollback?

**6. Real-World Validation:**

Our test corpus is programmatically generated. How critical is it to validate with real-world production queries before scaling? Should we deploy the first detector to production first to collect real data, or can we proceed with synthetic test suites for additional detectors?

**7. Co-location Limits:**

With our current co-location approach adding <30ms per detector, how many detectors have you successfully co-located before hitting memory or latency constraints? At what point did you need to consider microservices or other architectural changes?

Your insights on these practical scaling challenges would be invaluable as we move toward a multi-detector system. TY :slight_smile:

1 Like

I generated the continuation.

1 Like

STATUS : Hybrid system with parallel execution of Code-Intent CNN (100% accuracy) and Content-Safety Transformer (100% accuracy). Rule engine final decision layer. Overall attack detection: 100% on core test set (101/101). False positive rate: 0% (0/1000 benign samples). Latency: <35ms for two parallel detectors.

Fixed 25 multilingual bypasses via preprocessing improvements. Identified new attack vector: poetic/metaphorical attacks (current detection: 83%, 20/24). Online learning active with 92 feedback samples. Conservative OR-logic: one detector blocks = overall block.

Next: shadow mode validation, router implementation for third detector, poetic attack mitigation via metaphor detection patterns. Thank you for your valueable help!!! :slight_smile:

1 Like

Current Status

The system has successfully migrated to a hexagonal service architecture. Three independent detector services (Code Intent, Persuasion, Content Safety) are operational, each following a consistent pattern with pure domain logic and standardized APIs. Core performance metrics are established: 100% attack detection on the primary test set (101/101) with a 3.6% false positive rate on 1,000 benign samples. Pipeline latency remains under 50ms per service. A feedback loop for online learning is active.

Planned Implementation

The next development phase will focus on implementing the intelligent orchestration layer. This includes building a hierarchical router to dynamically select detectors based on risk, context, and latency budget, thereby moving away from a fixed sequential pipeline. We will also formalize the full MLOps lifecycle with automated regression testing, shadow/canary deployment protocols, and systematic retraining triggers. Ongoing work includes improving detection of poetic/metaphorical attack vectors and establishing production monitoring for continuous validation of FPR/FNR.

1 Like

Implementation of the intelligent orchestration layer is complete, including real-time text complexity analysis and dynamic policy evaluation. A hybrid repository (Redis and PostgreSQL) has been added for feedback storage, along with an automatic policy optimization system. Monitoring now includes distributed tracing and metric collection.

Identified technical debt: use of eval() without sandboxing in policy conditions, absence of circuit breakers for detector dependencies, synchronous learning optimization, and lack of database failover.

Next development items: replace eval() with AST-based parser, implement exponential backoff for detector failures, move learning optimization to background worker, and add database connection pooling and replication.

1 Like

The security subsystem has been fully optimized and validated. All 21 tested attack vectors are now blocked (100% detection rate) while maintaining a 3.6% false positive rate on benign samples. The enhancements include 8 LDAP injection patterns, increased Unicode attack detection weights, and size-based attack recognition for inputs exceeding 10,000 characters. The multi-layer detection pipeline now processes 45+ security patterns with under 1ms performance overhead. The system employs dynamic threshold adjustments based on context (source tool, user risk tier) and integrates structural analysis for anomaly detection. All components are ready with complete monitoring, tracing, and alerting capabilities.

THANK YOU FOR YOUR PRAECIOUS HELP!

1 Like
  • How do we architect a Zero Trust framework where continuous authentication replaces perimeter-based security?

  • Can we design detectors resilient to adversarial machine learning attacks that deliberately evade our pattern recognition?

  • What models are needed to quantify and enforce data privacy in training pipelines against model inversion or membership inference attacks?

  • How can we shift detection left to identify and block malicious prompts at the point of AI model interaction, not just in output?

  • What methodologies prove the robustness of AI-generated code against embedded backdoors or logic bombs?

  • How do we build a unified security data lake that correlates events across API, identity, and AI model layers for causal attack analysis?

  • What is the operational framework for implementing and testing post-quantum cryptographic algorithms in live AI security systems?

  • Can we develop formal verification techniques to mathematically prove the safety properties of autonomous security agents?

  • How do we create detection for AI-powered disinformation campaigns that manipulate model outputs through prompt poisoning or data drift?

  • What systems detect and manage “shadow AI”—unauthorized models or data pipelines operating outside governance?

Hmm… Like this?

Status Update: The adversarial robustness pipeline has been fully implemented and validated. The integrated adversarial detection layer achieves 100% detection rate on known attack patterns with 0% false positives in operational testing, adding under 5ms latency. A complete adversarial training pipeline was built and used to produce an enhanced model (V1), which reduces the false positive rate by 32.98% while maintaining a 94.51% detection rate on novel threats. This model is now backed by a validated monitoring and rollback framework for safe deployment. The system’s multi-layer defense now combines static pattern matching, structural anomaly analysis, and adaptive adversarial detection.

1 Like