Chenghao Chen, Kailun Qin, Xiaolin Zhang, Chi Zhang, Dawu Gu
Cloud-hosted transformer and large language model (LLM) inference creates a direct confidentiality problem: user prompts may contain sensitive code, business data, personal information, or regulated documents, yet remote serving exposes intermediate state to the cloud software stack and accelerator runtime. Fully homomorphic encryption (FHE) keeps accelerator-side execution ciphertext-only, but end-to-end LLM inference remains expensive because linear layers are interleaved with non-linear, cache-state, and refresh-sensitive operators. CPU trusted execution environments (TEEs) can execute those operators natively, but a CPU TEE alone does not define how an untrusted accelerator should participate. We present Bifrost, a hybrid TEE-FHE serving architecture in which secrets are provisioned only to an attested CPU TEE, while the accelerator, device memory, driver/runtime stack, and host software remain outside the trusted computing base. Bifrost uses FHE as a secure delegation mechanism for projection and feed-forward linear layers on accelerator-backed CKKS, while non-linear operators, attention-side control logic, KV-state transitions, and decrypt-then-encrypt refresh execute inside the CPU TEE. Bifrost+ further applies a prefill/decode split: prompt-side KV state is built inside the CPU TEE, and only decode-side state enters the hybrid ciphertext path. In an estimator-style comparison matching Euston's methodology, Bifrost reduces projected latency by 9.25x on GPT-2 (1.5B) and 9.91x on LLaMA 3 (8B). In direct CKKS/FHE deployments, Bifrost+ reduces TTFT by 14.6-45.8x on GPT-2 (124M) and 15.3-53.4x on Qwen3 (0.6B). The systems lesson is selective encrypted execution: use FHE only where ciphertext-only accelerator delegation is required, and keep non-linear, refresh, and prompt-side work inside the CPU TEE.
Bifrost proposes a hybrid architecture that partitions privacy-preserving LLM inference between a CPU Trusted Execution Environment (TEE) and an untrusted accelerator running Fully Homomorphic Encryption (FHE). The key insight is operator-affinity splitting: linear layers (GEMMs) are delegated to the accelerator via CKKS-encrypted computation, while non-linear operators (LayerNorm, Softmax, GELU, RoPE), ciphertext refresh, and control logic execute in plaintext inside the CPU TEE. Bifrost+ extends this with a prefill/decode (PD) split, completing prompt processing entirely within the TEE and only routing decode-phase tokens through the hybrid FHE path.
The central design principle—"selective encrypted execution"—is intuitive and well-motivated. Rather than forcing the entire transformer through expensive homomorphic circuits, the system exploits the observation that non-linear operators are both the most expensive and numerically fragile under FHE, while linear layers dominate compute and map cleanly to CKKS. The decrypt-then-encrypt (DtE) refresh replacing homomorphic bootstrapping is another pragmatic optimization enabled by the TEE trust boundary.
The paper demonstrates reasonable care in separating measurement types: direct CKKS/FHE measurements for smaller models, projected baselines for Pure FHE, and estimator-style rows for comparison with prior work (Euston). This transparency is commendable, as many papers in this space conflate projected and measured results.
However, several rigor concerns emerge:
The paper addresses a genuine and growing need: privacy-preserving LLM inference in cloud settings. The hybrid TEE-FHE design is architecturally clean and offers a middle ground between the impracticality of pure FHE inference and the limited availability/trust concerns of GPU TEEs.
The paper is timely: LLM privacy is a pressing concern as enterprises adopt cloud-hosted models for sensitive workloads. The community needs practical solutions beyond pure FHE (too slow) and pure TEE (limited accelerator support). The hybrid approach fills a real gap in the design space.
However, the rapid maturation of GPU TEEs (NVIDIA Confidential Computing, AMD SEV for GPUs) may narrow the window for TEE+FHE hybrid approaches. If GPU TEEs become widely available and performant, the complexity of the FHE delegation path becomes harder to justify.
The paper is well-written and unusually transparent about what is measured versus projected, which is refreshing for this area. The framing as a "systems lesson" rather than a cryptographic breakthrough is appropriate. However, the contribution feels more like a well-executed systems integration study than a fundamental advance—the individual components (CKKS on GPU, TEE for non-linear ops, PD splitting) are all known ideas combined in a natural way.
Generated Jun 17, 2026
Paper 1 likely has higher impact: it proposes a concrete hybrid TEE–FHE serving architecture for privacy-preserving LLM inference with large projected speedups, addressing a timely, practical bottleneck for deploying confidential AI on untrusted accelerators. The approach is system-level and broadly applicable across models and cloud inference stacks, with clear real-world deployment pathways. Paper 2 is novel and important for FL security, but is narrower in scope (malicious server in PEFT-FL) and primarily exposes a vulnerability rather than providing a broadly enabling capability.
Paper 2 likely has higher scientific impact due to timeliness and broad relevance: indirect prompt injection is a pressing, cross-industry safety issue for deployed agentic LLMs. Its large-scale public competition yields a substantial empirical dataset across many models and scenarios, enabling reproducible evaluation and follow-on research, and the planned quarterly updates increase lasting value. While Paper 1 is technically innovative, its impact is narrower (privacy-preserving inference infrastructure) and relies partly on estimator-style comparisons rather than full end-to-end validated deployments, which may limit immediate adoption and generalization.
Paper 2 establishes a fundamental theoretical impossibility result (a trilemma) regarding prompt injection defenses, mathematically proving that wrapper defenses cannot be simultaneously continuous, utility-preserving, and completely safe. Mechanically verified theorems that redirect an entire field's research focus away from dead ends generally have a broader, longer-lasting scientific impact than the system-level performance optimizations presented in Paper 1.
Paper 2 addresses a fundamental theoretical gap in LLM privacy by formalizing the separation between indistinguishability and extractability—two widely conflated concepts. Its contributions span theory (formal privacy-game separations), practical metrics ((l,b)-inextractability), and actionable deployment guidelines. This has broader impact across ML privacy, security, and policy communities. Paper 1, while technically strong in combining TEE and FHE for private inference, is more incremental in its systems-engineering contribution, building on existing cryptographic primitives with narrower applicability to specific deployment scenarios.
Paper 2 likely has higher scientific impact: it introduces a unifying, formal security framework (stack model + TLA+ invariants) and a reusable conformance-testing pipeline that can apply across many rapidly deployed agent protocols, potentially influencing standards, implementations, and auditing practices broadly. Its “composition safety” principle targets emergent cross-protocol failures, a timely and general problem as agent ecosystems interconnect. Paper 1 is innovative and practically valuable for confidential LLM inference, but its impact is narrower (specific serving architecture and performance estimators) and more dependent on deployment assumptions and hardware/crypto tradeoffs.
Paper 2 demonstrates higher potential scientific impact due to its concrete, verified real-world results: discovering 10 zero-day vulnerabilities in Google Chrome, including two critical sandbox escapes with assigned CVEs. This provides undeniable evidence of practical impact in cybersecurity. The AgentFlow DSL and feedback-driven harness optimization framework also introduces a novel, generalizable methodology for multi-agent system design. While Paper 1 (Bifrost) presents a solid engineering contribution combining TEE and FHE for private LLM inference with impressive speedups, it remains largely an architectural optimization with estimator-based comparisons rather than demonstrated real-world deployment. Paper 2's breadth of impact across security, AI agents, and software engineering, combined with its immediate practical significance, gives it the edge.
Paper 1 introduces a fundamentally new cryptographic primitive (pseudorandom noise-resilient key exchange) and establishes a surprising impossibility result for AI safety—that transcript auditing alone cannot prevent covert coordination between AI agents. This has profound implications for AI governance, alignment, and multi-agent safety, touching policy and regulation beyond just cryptography. Paper 2, while technically solid and practically useful, is primarily a systems optimization combining existing primitives (TEE + FHE) for faster privacy-preserving inference, representing incremental engineering advances rather than foundational new insights.
Paper 2 (TALUS) likely has higher impact: it tackles an open, standard-relevant cryptographic problem—threshold ML-DSA (FIPS 204)—and delivers the first one-round online signing with standard, drop-in verifiable signatures. It adds a formal impossibility result (Lattice Threshold Trilemma) plus generally useful techniques (BCC, CEF), with rigorous security reductions and concrete implementations across all parameter sets. This has broad applicability to post-quantum secure infrastructure (HSMs, wallets, distributed key management) and is highly timely given ML-DSA standardization and deployment.
Paper 2 addresses a fundamental computational bottleneck in privacy-preserving LLM inference. By innovatively combining CPU TEEs with FHE, it achieves massive latency reductions (9x to 53x) over pure FHE approaches. This architectural breakthrough has immense real-world applications for secure cloud computing and regulatory compliance. While Paper 1 provides a clever, lightweight solution for secure code generation, Paper 2's deep systemic integration and significant performance gains in a highly constrained area give it a broader and more transformative scientific impact.
Paper 2 (Bifrost) likely has higher scientific impact due to broader cross-field relevance and timeliness: it targets a central, rapidly growing need—privacy-preserving LLM serving—bridging systems, cryptography (FHE/CKKS), and trusted computing (TEEs). The hybrid TEE–FHE partitioning and prefill/decode split offer a generally applicable architectural principle (“selective encrypted execution”) that could influence deployed cloud inference stacks. Paper 1 is strong and rigorous with impressive vulnerability yield, but its impact is more specialized to software security tooling and depends on LLM-assisted synthesis reliability.