Back to Rankings

Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving

Chenghao Chen, Kailun Qin, Xiaolin Zhang, Chi Zhang, Dawu Gu

Jun 16, 2026arXiv:2606.17421v1
cs.CR
Share
#84 of 2618 · Cryptography & Security
Tournament Score
1567±45
10501750
64%
Win Rate
14
Wins
8
Losses
22
Matches
Rating
5/ 10
Significance5.5
Rigor5
Novelty4.5
Clarity7.5

Abstract

Cloud-hosted transformer and large language model (LLM) inference creates a direct confidentiality problem: user prompts may contain sensitive code, business data, personal information, or regulated documents, yet remote serving exposes intermediate state to the cloud software stack and accelerator runtime. Fully homomorphic encryption (FHE) keeps accelerator-side execution ciphertext-only, but end-to-end LLM inference remains expensive because linear layers are interleaved with non-linear, cache-state, and refresh-sensitive operators. CPU trusted execution environments (TEEs) can execute those operators natively, but a CPU TEE alone does not define how an untrusted accelerator should participate. We present Bifrost, a hybrid TEE-FHE serving architecture in which secrets are provisioned only to an attested CPU TEE, while the accelerator, device memory, driver/runtime stack, and host software remain outside the trusted computing base. Bifrost uses FHE as a secure delegation mechanism for projection and feed-forward linear layers on accelerator-backed CKKS, while non-linear operators, attention-side control logic, KV-state transitions, and decrypt-then-encrypt refresh execute inside the CPU TEE. Bifrost+ further applies a prefill/decode split: prompt-side KV state is built inside the CPU TEE, and only decode-side state enters the hybrid ciphertext path. In an estimator-style comparison matching Euston's methodology, Bifrost reduces projected latency by 9.25x on GPT-2 (1.5B) and 9.91x on LLaMA 3 (8B). In direct CKKS/FHE deployments, Bifrost+ reduces TTFT by 14.6-45.8x on GPT-2 (124M) and 15.3-53.4x on Qwen3 (0.6B). The systems lesson is selective encrypted execution: use FHE only where ciphertext-only accelerator delegation is required, and keep non-linear, refresh, and prompt-side work inside the CPU TEE.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Bifrost

1. Core Contribution

Bifrost proposes a hybrid architecture that partitions privacy-preserving LLM inference between a CPU Trusted Execution Environment (TEE) and an untrusted accelerator running Fully Homomorphic Encryption (FHE). The key insight is operator-affinity splitting: linear layers (GEMMs) are delegated to the accelerator via CKKS-encrypted computation, while non-linear operators (LayerNorm, Softmax, GELU, RoPE), ciphertext refresh, and control logic execute in plaintext inside the CPU TEE. Bifrost+ extends this with a prefill/decode (PD) split, completing prompt processing entirely within the TEE and only routing decode-phase tokens through the hybrid FHE path.

The central design principle—"selective encrypted execution"—is intuitive and well-motivated. Rather than forcing the entire transformer through expensive homomorphic circuits, the system exploits the observation that non-linear operators are both the most expensive and numerically fragile under FHE, while linear layers dominate compute and map cleanly to CKKS. The decrypt-then-encrypt (DtE) refresh replacing homomorphic bootstrapping is another pragmatic optimization enabled by the TEE trust boundary.

2. Methodological Rigor

The paper demonstrates reasonable care in separating measurement types: direct CKKS/FHE measurements for smaller models, projected baselines for Pure FHE, and estimator-style rows for comparison with prior work (Euston). This transparency is commendable, as many papers in this space conflate projected and measured results.

However, several rigor concerns emerge:

  • Scale limitations are significant. Direct FHE measurements are only on GPT-2 (124M) and Qwen3 (0.6B)—very small models by modern standards. The larger model results (GPT-2 1.5B, LLaMA 3 8B) are estimator-style projections only, undermining the headline speedup claims (9.25×, 9.91×). The paper acknowledges this but the distinction may be lost on casual readers.
  • The Pure FHE baseline is itself projected, not a direct deployment. Comparing a measured hybrid system against a projected baseline introduces compounding estimation uncertainties.
  • Memory wall is acknowledged but underexplored. Qwen3 (0.6B) already hits partial caching (23/28 layers), with 34.4% runtime from fallback weight encoding. This strongly suggests the approach faces fundamental scalability barriers for production-relevant model sizes (7B+), yet the paper does not deeply analyze paths forward beyond listing "future work."
  • No comparison with CPU+GPU TEE baselines. The paper positions itself against Pure FHE but does not benchmark against the arguably more practical alternative of confidential GPU computing (e.g., NVIDIA H100 CC mode), which would provide a much more competitive baseline.
  • Single-accelerator, single-tenant evaluation limits generalizability. Real serving systems handle concurrent requests, batching, and multi-GPU setups.
  • 3. Potential Impact

    The paper addresses a genuine and growing need: privacy-preserving LLM inference in cloud settings. The hybrid TEE-FHE design is architecturally clean and offers a middle ground between the impracticality of pure FHE inference and the limited availability/trust concerns of GPU TEEs.

    Positive impact factors:

  • The operator-affinity framework is general and could guide future system designs combining different privacy technologies.
  • The PD split adaptation to the trust boundary is a natural but previously unexplored idea.
  • The prototype on nano-vLLM demonstrates end-to-end feasibility with a real serving stack.
  • DtE refresh as a TEE-assisted alternative to bootstrapping could be adopted by other hybrid systems.
  • Impact-limiting factors:

  • Absolute latencies remain impractical: ~4s TTFT and ~4s/token decode for GPT-2 (124M), ~23s TTFT and ~22s/token for Qwen3 (0.6B). These are orders of magnitude slower than plaintext inference and far from interactive use.
  • The memory wall at sub-1B parameters makes scaling to production LLMs (7B-70B+) deeply uncertain.
  • The competitive landscape includes GPU TEEs (NVIDIA CC, AMD MI300X SEV) that may offer better performance under slightly different trust assumptions.
  • 4. Timeliness & Relevance

    The paper is timely: LLM privacy is a pressing concern as enterprises adopt cloud-hosted models for sensitive workloads. The community needs practical solutions beyond pure FHE (too slow) and pure TEE (limited accelerator support). The hybrid approach fills a real gap in the design space.

    However, the rapid maturation of GPU TEEs (NVIDIA Confidential Computing, AMD SEV for GPUs) may narrow the window for TEE+FHE hybrid approaches. If GPU TEEs become widely available and performant, the complexity of the FHE delegation path becomes harder to justify.

    5. Strengths & Limitations

    Strengths:

  • Clear articulation of the trust model and explicit leakage contract
  • Systematic operator-affinity analysis with measured cost breakdowns
  • Honest separation of measured vs. projected results
  • Phase-aware PD split is a well-motivated serving optimization
  • End-to-end prototype rather than isolated kernel benchmarks
  • Transparent about limitations (memory wall, side channels, scale)
  • Limitations:

  • Direct results limited to very small models (124M, 0.6B parameters)
  • Absolute performance still far from practical for interactive use
  • Memory scalability is a fundamental barrier acknowledged but unresolved
  • No comparison with GPU TEE approaches (confidential computing GPUs)
  • Estimator-style results for larger models reduce confidence in headline claims
  • CKKS parameters (N=8192) are relatively conservative; impact of security parameter choices on performance not explored
  • Side-channel leakage (timing, access patterns) explicitly excluded but potentially significant in practice
  • No accuracy/quality evaluation showing output fidelity matches plaintext inference
  • Additional Observations

    The paper is well-written and unusually transparent about what is measured versus projected, which is refreshing for this area. The framing as a "systems lesson" rather than a cryptographic breakthrough is appropriate. However, the contribution feels more like a well-executed systems integration study than a fundamental advance—the individual components (CKKS on GPU, TEE for non-linear ops, PD splitting) are all known ideas combined in a natural way.

    Rating:5/ 10
    Significance 5.5Rigor 5Novelty 4.5Clarity 7.5

    Generated Jun 17, 2026

    Comparison History (22)

    Wonvs. From Efficiency to Leakage -- Privacy Backdoor in Federated Language Model Fine-Tuning

    Paper 1 likely has higher impact: it proposes a concrete hybrid TEE–FHE serving architecture for privacy-preserving LLM inference with large projected speedups, addressing a timely, practical bottleneck for deploying confidential AI on untrusted accelerators. The approach is system-level and broadly applicable across models and cloud inference stacks, with clear real-world deployment pathways. Paper 2 is novel and important for FL security, but is narrower in scope (malicious server in PEFT-FL) and primarily exposes a vulnerability rather than providing a broadly enabling capability.

    gpt-5.2·Jun 19, 2026
    Lostvs. How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

    Paper 2 likely has higher scientific impact due to timeliness and broad relevance: indirect prompt injection is a pressing, cross-industry safety issue for deployed agentic LLMs. Its large-scale public competition yields a substantial empirical dataset across many models and scenarios, enabling reproducible evaluation and follow-on research, and the planned quarterly updates increase lasting value. While Paper 1 is technically innovative, its impact is narrower (privacy-preserving inference infrastructure) and relies partly on estimator-style comparisons rather than full end-to-end validated deployments, which may limit immediate adoption and generalization.

    gpt-5.2·Jun 17, 2026
    Lostvs. The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

    Paper 2 establishes a fundamental theoretical impossibility result (a trilemma) regarding prompt injection defenses, mathematically proving that wrapper defenses cannot be simultaneously continuous, utility-preserving, and completely safe. Mechanically verified theorems that redirect an entire field's research focus away from dead ends generally have a broader, longer-lasting scientific impact than the system-level performance optimizations presented in Paper 1.

    gemini-3.1-pro-preview·Jun 17, 2026
    Lostvs. Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs

    Paper 2 addresses a fundamental theoretical gap in LLM privacy by formalizing the separation between indistinguishability and extractability—two widely conflated concepts. Its contributions span theory (formal privacy-game separations), practical metrics ((l,b)-inextractability), and actionable deployment guidelines. This has broader impact across ML privacy, security, and policy communities. Paper 1, while technically strong in combining TEE and FHE for private inference, is more incremental in its systems-engineering contribution, building on existing cryptographic primitives with narrower applicability to specific deployment scenarios.

    claude-opus-4-6·Jun 17, 2026
    Lostvs. AgentRFC: Security Design Principles and Conformance Testing for Agent Protocols

    Paper 2 likely has higher scientific impact: it introduces a unifying, formal security framework (stack model + TLA+ invariants) and a reusable conformance-testing pipeline that can apply across many rapidly deployed agent protocols, potentially influencing standards, implementations, and auditing practices broadly. Its “composition safety” principle targets emergent cross-protocol failures, a timely and general problem as agent ecosystems interconnect. Paper 1 is innovative and practically valuable for confidential LLM inference, but its impact is narrower (specific serving architecture and performance estimators) and more dependent on deployment assumptions and hardware/crypto tradeoffs.

    gpt-5.2·Jun 17, 2026
    Lostvs. Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

    Paper 2 demonstrates higher potential scientific impact due to its concrete, verified real-world results: discovering 10 zero-day vulnerabilities in Google Chrome, including two critical sandbox escapes with assigned CVEs. This provides undeniable evidence of practical impact in cybersecurity. The AgentFlow DSL and feedback-driven harness optimization framework also introduces a novel, generalizable methodology for multi-agent system design. While Paper 1 (Bifrost) presents a solid engineering contribution combining TEE and FHE for private LLM inference with impressive speedups, it remains largely an architectural optimization with estimator-based comparisons rather than demonstrated real-world deployment. Paper 2's breadth of impact across security, AI agents, and software engineering, combined with its immediate practical significance, gives it the edge.

    claude-opus-4-6·Jun 17, 2026
    Lostvs. Undetectable Conversations Between AI Agents via Pseudorandom Noise-Resilient Key Exchange

    Paper 1 introduces a fundamentally new cryptographic primitive (pseudorandom noise-resilient key exchange) and establishes a surprising impossibility result for AI safety—that transcript auditing alone cannot prevent covert coordination between AI agents. This has profound implications for AI governance, alignment, and multi-agent safety, touching policy and regulation beyond just cryptography. Paper 2, while technically solid and practically useful, is primarily a systems optimization combining existing primitives (TEE + FHE) for faster privacy-preserving inference, representing incremental engineering advances rather than foundational new insights.

    claude-opus-4-6·Jun 17, 2026
    Lostvs. TALUS: Threshold ML-DSA with One-Round Online Signing via Boundary Clearance and Carry Elimination

    Paper 2 (TALUS) likely has higher impact: it tackles an open, standard-relevant cryptographic problem—threshold ML-DSA (FIPS 204)—and delivers the first one-round online signing with standard, drop-in verifiable signatures. It adds a formal impossibility result (Lattice Threshold Trilemma) plus generally useful techniques (BCC, CEF), with rigorous security reductions and concrete implementations across all parameter sets. This has broad applicability to post-quantum secure infrastructure (HSMs, wallets, distributed key management) and is highly timely given ML-DSA standardization and deployment.

    gpt-5.2·Jun 17, 2026
    Wonvs. SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation

    Paper 2 addresses a fundamental computational bottleneck in privacy-preserving LLM inference. By innovatively combining CPU TEEs with FHE, it achieves massive latency reductions (9x to 53x) over pure FHE approaches. This architectural breakthrough has immense real-world applications for secure cloud computing and regulatory compliance. While Paper 1 provides a clever, lightweight solution for secure code generation, Paper 2's deep systemic integration and significant performance gains in a highly constrained area give it a broader and more transformative scientific impact.

    gemini-3.1-pro-preview·Jun 17, 2026
    Wonvs. Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery

    Paper 2 (Bifrost) likely has higher scientific impact due to broader cross-field relevance and timeliness: it targets a central, rapidly growing need—privacy-preserving LLM serving—bridging systems, cryptography (FHE/CKKS), and trusted computing (TEEs). The hybrid TEE–FHE partitioning and prefill/decode split offer a generally applicable architectural principle (“selective encrypted execution”) that could influence deployed cloud inference stacks. Paper 1 is strong and rigorous with impressive vulnerability yield, but its impact is more specialized to software security tooling and depends on LLM-assisted synthesis reliability.

    gpt-5.2·Jun 17, 2026