End-to-End Encrypted AI Inference with Post-Quantum Cryptography

What This Means and Why its Important
When you send a prompt to an AI provider, you're trusting them with the full content of your request. For casual questions or creative writing, that's fine. But for a growing number of real-world applications, it's a serious problem.

A physician using an AI assistant to reason about a patient's symptoms and lab results is transmitting protected health information. Under HIPAA, that data must be safeguarded, and "trust us, we don't look" is not a compliance strategy. Attorneys using AI to analyze contracts or summarize depositions are handling privileged communications; if the provider can read those prompts, privilege may be waived. Quantitative analysts prototyping trading strategies with AI are exposing proprietary intellectual property where a single leaked strategy could cost millions.

Beyond professional contexts, companies building internal AI tools over roadmaps, M&A plans, or board communications need guarantees that their data isn't being logged, trained on, or accessible to the provider's employees. And people discussing sensitive personal matters -- mental health, relationship issues, financial distress -- deserve the same privacy they'd expect from a therapist or attorney.

The uncomfortable reality is that most frontier AI labs not only can read your data, they often do, for training, safety filtering, abuse detection, and debugging. Their privacy policies typically grant broad rights to process and retain your inputs. And it's not just the provider's own employees: most large AI companies employ networks of contractors for data labeling, safety filtering, and quality assurance. Those contractors may in turn subcontract to other firms. Your sensitive prompt might pass through the hands of people you've never heard of, working for companies you didn't know existed, in jurisdictions with different privacy laws. You genuinely do not know who can see your data.

Even when they promise not to train on your data, the infrastructure itself has access: your prompts pass through load balancers, API gateways, logging pipelines, and application servers, any of which could be compromised or subpoenaed.

The right answer isn't "trust us." The right answer is architectural: make it so that reading your data is not physically possible. That's what we built.

Architecture Overview

Chutes E2EE provides true end-to-end encryption for AI inference. Your prompts are encrypted on your machine using post-quantum cryptography. They travel through our API as opaque ciphertext. They are decrypted only inside a hardware-isolated Trusted Execution Environment (Intel TDX) with encrypted memory and GPU VRAM. The response is encrypted inside that same enclave before it leaves, and decrypted only on your machine.

At no point can Chutes, the hosting provider, or any network intermediary read your data. The cryptographic design makes it impossible.

Here's the full flow:

Client Side

1. Fetch instance list + nonces

GET /e2e/instances/{chute_id}

← Returns: instance_ids, ML-KEM-768 public keys, single-use nonces

2. (Optional) Verify TEE attestation

GET /instances/{id}/attestation?nonce={your_random_nonce}

→ Returns: TDX quote with SHA256(nonce || e2e_pubkey), NVIDIA evidence

→ Verify quote via Intel DCAP • Confirm report_data binds nonce to key

→ This proves the key was generated inside a genuine TEE

3. Encrypt request

a. Generate ephemeral ML-KEM-768 keypair

b. ML-KEM Encapsulate: use instance's public key → shared_secret

c. HKDF-SHA256(shared_secret, salt, info) → symmetric key

d. Inject client's ephemeral public key into JSON body

e. Gzip compress • ChaCha20-Poly1305 encrypt

4. Send encrypted request

POST /e2e/invoke

Headers: X-Chute-Id, X-Instance-Id, X-E2E-Nonce...

Body: encrypted blob (application/octet-stream)

Chutes API

5. Validate nonce (atomic Redis Lua script)

• Check nonce exists, matches instance_id, hasn't been used

• Delete nonce atomically (single-use enforcement)

• Reject with 403 if invalid/expired/reused

6. Re-encrypt for transport to instance (mTLS)

• The API cannot read the E2E payload; it's opaque ciphertext

• Wraps in transport-layer encryption for the mTLS tunnel

• Forwards to the specific GPU instance

NOTE: The API sees only ciphertext. It handles routing, nonce validation, billing, and rate limiting. It CANNOT decrypt the E2E payload.

GPU instance (Intel TDX TEE)

7. Decrypt request

a. Strip transport encryption • Extract ML-KEM ciphertext

b. ML-KEM Decapsulate with instance private key → shared_secret

c. ChaCha20-Poly1305 decrypt + verify auth tag

d. Extract client's ephemeral public key (e2e_response_pk)

8. Run inference (model executes on decrypted prompt)

9. Encrypt response

Non-Streaming

a. ML-KEM Encapsulate using client's public key

b. Derive response key via HKDF • ChaCha20 encrypt

Streaming

a. Send e2e_init SSE event with ML-KEM ciphertext

b. Each chunk: ChaCha20-Poly1305 encrypt with stream key

c. Stream end: wipe all key material

Client Side

10. Decrypt response

a. Extract ML-KEM ciphertext • Decapsulate shared_secret

b. Derive symmetric key • Decrypt • Gzip decompress

For streaming: process e2e_init, then decrypt each e2e chunk

The Cryptographic Stack

ML-KEM-768 (Kyber): Post-Quantum Key Encapsulation

Every E2EE request uses ML-KEM-768(formerly CRYSTALS-Kyber), the NIST-standardized post-quantum key encapsulation mechanism. ML-KEM is a lattice-based scheme whose security rests on the hardness of the Module Learning With Errors (MLWE) problem, believed to be resistant to both classical and quantum attacks.

The key sizes for ML-KEM-768:

Parameter	Size
Public key	1,184 bytes
Private key	2,400 bytes
Ciphertext	1,088 bytes
Shared secret	32 bytes

Each request generates a fresh ephemeral ML-KEM keypair on the client side. The client encapsulates a shared secret using the instance's public key, then derives a symmetric key from that shared secret. The instance decapsulates with its private key to recover the same shared secret and thus the same symmetric key. The ephemeral keypair is discarded after the request completes.

For the response, the flow reverses: the instance encapsulates using the client's response public key (which was embedded in the encrypted request payload), and the client decapsulates with the corresponding private key.

This double key exchange means every request-response pair uses entirely independent key material. Compromising one exchange reveals nothing about any other.

HKDF-SHA256: Key Derivation

Raw shared secrets from ML-KEM are not used directly as encryption keys. Instead, we use HKDF (HMAC-based Key Derivation Function) with SHA-256 to derive purpose-specific symmetric keys:

request_key  = HKDF-SHA256(shared_secret, salt=CT[:16], info="e2e-req-v1")
response_key = HKDF-SHA256(shared_secret, salt=CT[:16], info="e2e-resp-v1")
stream_key   = HKDF-SHA256(shared_secret, salt=CT[:16], info="e2e-stream-v1")

The info parameter provides domain separation: even if the same shared secret were somehow reused (it isn't), the derived keys would be cryptographically independent across all three purposes.

ChaCha20-Poly1305: Authenticated Encryption

All payload encryption uses ChaCha20-Poly1305, an AEAD (Authenticated Encryption with Associated Data) cipher. ChaCha20 provides the encryption; Poly1305 provides the authentication tag, guaranteeing that any tampering with the ciphertext is detected.

Each encryption operation uses a random 12-byte nonce. The encrypted output is structured as:

[nonce (12 bytes)] [ciphertext (variable)] [auth tag (16 bytes)]

We chose ChaCha20-Poly1305 over AES-GCM because it performs well without hardware AES acceleration (relevant in some TEE contexts), it's resistant to timing side-channels by design, and it has a simpler implementation to audit.

Gzip Compression
All payloads are gzip-compressed before encryption. This reduces bandwidth (important for large prompts and responses) and eliminates information leakage from ciphertext length variations, since compressed data has more uniform entropy distribution.

Why Post-Quantum?

The cryptographic community increasingly operates under the assumption that large-scale quantum computers capable of breaking RSA and elliptic-curve cryptography will eventually be built. The timeline is debated (estimates range from 10 to 30+ years), but the threat model that matters today is "harvest now, decrypt later."

A well-resourced adversary could be recording encrypted traffic today with the intention of decrypting it once quantum computers become capable. For data with long-term sensitivity (trade secrets, medical records, legal communications, intelligence), this is a well-understood risk. NIST explicitly calls it out in their post-quantum guidance: "Even if an adversary can't crack the encryption that protects our secrets at the moment, it could still be beneficial to capture encrypted data and hold onto it, in the hopes that a quantum computer will break the encryption down the road."

By using ML-KEM-768 today, every Chutes E2EE request is protected against this attack vector. Even if a quantum computer capable of running Shor's algorithm at scale appears in 2035, traffic captured in 2025 remains secure. The lattice problems underlying ML-KEM are not vulnerable to known quantum algorithms.

Ephemeral keys compound this protection: each request uses independent key material. There is no long-lived key whose compromise would retroactively expose historical traffic.

TEE Attestation: Verifying the Encryption Endpoint
A natural question arises: how do you know the public key you're encrypting to actually belongs to a genuine TEE instance, and not a man-in-the-middle?
This is where hardware attestation comes in.

Every Chutes GPU instance runs inside an Intel TDX (Trust Domain Extensions) confidential VM, with NVIDIA H100/H200 GPUs operating in CC (confidential compute) mode. The TEE provides hardware-rooted attestation that can be independently verified by any caller.

The Attestation Flow

You generate a random 32-byte nonce and request attestation evidence for a specific instance. The instance produces a TDX quote whose report_data field contains SHA256(your_nonce ‖ instance_e2e_pubkey), cryptographically binding your freshness nonce to the instance's ML-KEM public key.

You then verify the quote against Intel's DCAP (Data Center Attestation Primitives) infrastructure. Verification checks that:

the quote's cryptographic signature is valid and chains to Intel's root of trust
debug mode is disabled (td_attributes bit 0 is clear)
report_data contains the expected hash
the measurement registers (MRTD, RTMRs) match the expected values for the Chutes runtime

NVIDIA GPU attestation is also provided, verified through the NVIDIA Attestation SDK, proving the GPUs are genuine NVIDIA hardware with host-isolated VRAM.

What This Proves

The nonce binding in report_data proves that the key and the attestation came from the same environment. An attacker cannot substitute their own key without producing a valid TDX quote, which requires genuine Intel TDX hardware running the exact measured software stack. Your random nonce prevents replay of old attestation evidence, so an attacker cannot present a legitimate quote from a different session.

The measurement registers (MRTD for initial memory, RTMRs for runtime state) correspond to the known Chutes runtime image. Because the entire TEE infrastructure is open source with reproducible builds, you can rebuild the image yourself and verify that the measurements match. The NVIDIA attestation proves the GPUs are running with hardware-encrypted VRAM inaccessible to the host OS, hypervisor, or any other tenant.

You can verify all of this yourself. The attestation evidence is cryptographic proof, rooted in hardware, that your data can only be decrypted inside the environment you've verified. For implementation details, see the TEE verification documentation.

Nonce Mechanics: Replay Protection

Every E2EE request requires a single-use nonce token. Here's how the nonce lifecycle works.

The client calls GET /e2e/instances/{chute_id}, which returns up to 5 eligible TEE instances, each with a batch of 10 cryptographically random nonce tokens (secrets.token_urlsafe(24)). Each nonce is stored in Redis, keyed by e2e_nonces:{user_id}:{chute_id}, mapping the nonce to its specific instance ID. Nonces expire server-side after 75 seconds (the client sees a 60-second expiry to provide margin).

When a request arrives at POST /e2e/invoke, the API validates and consumes the nonce using a Redis Lua script that executes atomically:

lua

local val = redis.call('HGET', KEYS[1], ARGV[1])
if val == false then return nil end       -- nonce doesn't exist
if val ~= ARGV[2] then return nil end     -- wrong instance
redis.call('HDEL', KEYS[1], ARGV[1])      -- delete (one-time use)
return val

This guarantees that each nonce can be used exactly once, for exactly one instance, by exactly one user. The check-and-delete is atomic with no window for a race condition.

The result:

a captured request cannot be replayed (the nonce is already consumed)
a nonce for instance A cannot be used against instance B
nonces are namespaced per user and per chute

Integration: Two Ways to Use E2EE

We provide two integration paths. Both handle all the cryptographic complexity transparently.

Python HTTP Transport (for OpenAI SDK users)

The chutes-e2ee-transport library implements an httpx transport that intercepts requests at the HTTP layer. The OpenAI SDK (or any compatible client) is completely unaware that encryption is happening:

python

import httpx
from openai import OpenAI
from chutes_e2ee import ChutesE2EETransport

client = OpenAI(
    api_key="cpk_...",
    base_url="https://llm.chutes.ai/v1",
    http_client=httpx.Client(
        transport=ChutesE2EETransport(api_key="cpk_...")
    ),
)

# This request is end-to-end encrypted. The code is identical
# to a normal OpenAI API call.
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3.1-TEE",
    messages=[{"role": "user", "content": "Hello!"}],
)

The transport handles instance discovery, nonce management (with automatic refresh on expiry), ML-KEM key exchange, payload encryption/decryption, and streaming, all transparently. It also supports async via AsyncChutesE2EETransport.

Local E2EE Proxy (for any language or SDK)

The e2ee-proxy is a Docker-based local reverse proxy (OpenResty/Lua + native C) that provides E2EE without any code changes at all. Point your existing SDK at the proxy instead of the Chutes API:

bash

docker run -p 8443:443 parachutes/e2ee-proxy:latest

python

# OpenAI SDK
from openai import OpenAI
client = OpenAI(
    api_key="cpk_...",
    base_url="https://e2ee-local-proxy.chutes.dev:8443/v1",
)

# Anthropic SDK
import anthropic
client = anthropic.Anthropic(
    api_key="cpk_...",
    base_url="https://e2ee-local-proxy.chutes.dev:8443",
)

The proxy supports OpenAI Chat Completions, OpenAI Responses API, and Anthropic Messages API formats. It translates between formats internally, encrypts via the same ML-KEM + ChaCha20-Poly1305 stack (implemented in a native C .so with code-level obfuscation to protect key material), and handles nonce caching, instance discovery, and automatic retry on nonce expiry.

TLS is handled via a baked-in certificate for local HTTPS, with support for custom certificates in production or self-signed certificates for development. The crypto operations run in native C with xVMP (virtual machine protection) obfuscation so that key material is never exposed as plaintext in memory outside of the protected code paths.

Both options provide identical security guarantees. The transport is simpler if you're already using the OpenAI Python SDK; the proxy is more flexible if you're using a different language or multiple SDKs.

Inside the TEE: Key Management and Protection

Inside the GPU instance, encryption keys are managed by Aegis, Chutes' runtime integrity and cryptographic library.

ML-KEM-768 keypair generation happens at instance startup inside the TEE. The private key never leaves the enclave. Per-request E2E contexts are allocated for each incoming request, providing key isolation between concurrent requests. All derived keys and intermediate state are explicitly zeroed after use.

Intel TDX provides hardware memory encryption (even physical access to the server's RAM cannot extract key material), and NVIDIA CC mode hardware-encrypts GPU VRAM so that model weights and inference state are inaccessible to the host.

The entire Chutes runtime (Aegis, the inference server, the encryption middleware) runs inside the TEE. Neither the host operating system, the hypervisor, nor the hosting provider can access the enclave's memory.

What the API Can and Cannot See

It's important to be precise about the trust boundaries:

Component	Can see plaintext?	What it sees
Your machine	Yes	Your prompt and the response
Chutes API	No	Opaque ciphertext, routing headers, nonce tokens, usage metadata for billing
Network intermediaries	No	TLS-encrypted ciphertext containing E2E-encrypted ciphertext
GPU instance (TEE)	Yes	Your prompt (after decryption) and the response (before encryption)
Host OS / hypervisor	No	Hardware-encrypted memory; cannot inspect TEE contents
Chutes engineers / support	No	No access to TEE memory; no logging of plaintext; cannot decrypt traffic

The API does see usage metadata (token counts) for billing purposes. This metadata is extracted from the response inside the TEE and sent alongside the encrypted blob in a JSON envelope. The API can read {"usage": {"prompt_tokens": 50, "completion_tokens": 200}} but cannot read the actual prompt or response content.

Streaming

E2EE streaming uses a slightly different protocol to avoid the overhead of a full ML-KEM key exchange per chunk.

The instance performs a single ML-KEM encapsulation using the client's response public key and derives a stream-specific symmetric key. The ML-KEM ciphertext is sent as the first SSE event (e2e_init). Each subsequent chunk is encrypted with ChaCha20-Poly1305 using the stream key and a fresh random nonce, sent as e2e SSE events. The client decapsulates the e2e_init ciphertext to recover the stream key, then decrypts each chunk as it arrives.

This gives you real-time streaming with the same security guarantees as non-streaming: every byte is authenticated and encrypted, with the stream key derived from a post-quantum key exchange.

Open Source

The entire E2EE architecture is open source and auditable:

Repository	Purpose
chutes-api	API server: nonce management, routing, attestation verification
chutes	TEE-side runtime: encryption middleware, key management
chutes-e2ee-transport	Python SDK transport for OpenAI client
e2ee-proxy	Local Docker proxy for any language/SDK
sek8s	TEE VM creation, Kubernetes admission controller, reproducible builds
TEE verification docs	How to independently verify attestation

Reproducible builds are essential for this model to work. If you can't rebuild the exact TEE image and compare its measurements against what's running in production, attestation is meaningless: you'd be verifying that something is running in a TEE, but not what. The sek8s repo contains everything needed to build the TDX guest images, configure the Kubernetes admission controller, and verify that the MRTD and RTMR measurements match what you'd expect from a clean build.

We believe that security claims without source code are marketing claims. If you can't inspect the implementation, you can't verify the guarantees. Every component in the E2EE pipeline (from key generation to attestation to nonce validation to encryption) is available for review.

The Standard Should Be Higher

Today, when you use most AI services, you're handing over your data and hoping for the best. You're trusting that the provider won't train on it (despite having the technical ability to do so), that their employees and contractors won't access it (despite having the credentials to do so), that their logging pipeline won't capture it (despite being configured to capture everything else), and that their infrastructure won't be compromised (despite being a high-value target).

We wouldn't accept this trust model for banking, for messaging, or for medical records. We shouldn't accept it for AI inference either.

End-to-end encryption with hardware attestation eliminates the need for trust entirely. Your data is protected by mathematics and hardware isolation. The provider cannot access your data, and you can verify this yourself using standard cryptographic attestation.

This should be the baseline for any AI service handling sensitive workloads. We've made it available today, as open source, with drop-in integration for existing tools.

Get started:


Python transport	`pip install chutes-e2ee`
Docker proxy	`docker run -p 8443:443 parachutes/e2ee-proxy:latest`
Documentation	docs.chutes.ai
Source	github.com/chutesai

End-to-End Encrypted AI Inference with Post-Quantum Cryptography

Timon Agar

Related Articles

Private AI inference: what it means and how Chutes makes it verifiable

Why Does AI Need GPUs?

Chutes: A Decentralized AI Platform