Token Minimization Architecture — xCloude.ai by MARCIANO

How it works

xCloude sits between you and the four major AI providers — Anthropic, OpenAI, Google, xAI — running every request through a compression and routing layer designed to extract more output per token spent.

The compression pipeline

A request comes in. Before it ever leaves our edge functions, it passes through a layered compression sequence: defined-terms substitution (turning recurring phrases into short tokens), corpus-adaptive context inclusion (pruning what the model already knows), duplicate-source detection (flagging when you've pasted the same document twice without realizing), and stuck-loop preemption (catching iterative prompts that aren't going to break through). The provider receives a request that's faithful to your intent but typically smaller in tokens than what you typed.

Multi-model routing

Different questions deserve different models. A factual lookup against current pricing data doesn't need Opus 4.7 — Haiku 4.5 is faster, cheaper, and equally accurate. A nuanced legal analysis or strategic synthesis is where the heavier models earn their cost. The routing logic — which is itself a Pro-tier configurable layer — picks based on your prompt's complexity, your historical preferences, and the per-model price/latency profile we keep current.

Lifecycle awareness

Models change. Anthropic deprecates a Sonnet variant. Google retires Gemini 1.x and silently returns 404 to anyone still pointing at it. xAI announces a six-day window before Grok Imagine Pro becomes Grok Imagine Quality. Most platforms find out when their users complain. We find out when our model_lifecycle table tells us — and we route around the deprecation before it touches your prompt.

"Token efficiency isn't a feature. It's the entire business model."

Architecture stack

Vercel frontend (Next.js + static HTML), single-page interactive surfaces
Supabase Postgres with row-level security on every user-scoped table
Three core edge functions: chat, multimodel, imagegen — each instrumented for full event logging
RAG-ready knowledge base with pgvector for project memory and corpus reuse
OAuth via Google + Spotify; Apple in queue; SSO for enterprise on the roadmap
Failure tracking via model_events, model_lifecycle, and isolated CSAM alerting in csam_alerts

The patent posture

The compression-strategy layer is patent-pending. Several of the sixteen strategies are operable today against any provider with zero buy-in — meaning they work for you whether or not Anthropic, OpenAI, Google, or xAI ever endorses them. The strategic frame for provider conversations is capacity extension, not licensing: a 40-80% efficiency improvement extends the useful life of every data center they've built. We are interested in conversations with provider partnerships teams.

What we measure

Every chat call, every compare query, every image generation, every refusal. Logged to a database you can query. Some reports are open to all signed-in users. The advanced and historical analytics are Pro-tier.

Live model health

Free

Real-time status across the four providers. Last-known-good model versions, average latency, recent error rates.

Your monthly token spend

Free

Your usage broken down by model, by surface (chat / compare / image), by date. Compared against what the same prompts would have cost going direct.

Per-prompt compression receipt

Free

After every call: original token count, compressed token count, savings in dollars. Surfaced inline in the chat UI.

Asymmetric refusal analytics

Pro

Where do the four major providers refuse the same prompt differently? Which models refuse "female X" but permit "male X"? Which providers refuse historical content the others answer? Aggregated weekly with diff against each provider's posted policy on the relevant date.

Model deprecation early warning

Pro

When a provider's model starts behaving differently — answer drift, latency creep, refusal pattern shift — we see it in the event log before they announce it. Pro subscribers get the alert; everyone else finds out from the broken production app.

Custom benchmark suites

Pro

Run your own prompt set against all four providers, on a schedule, and watch the response quality drift over time. Useful if your domain (tax, medical, legal, insurance) has a sensitivity the public benchmarks don't capture.

Provider TOS snapshot timeline

Pro

Every change to every provider's published acceptable use policy, captured the day it changed, indexed by topic. Line up the policy timeline against your own refusal timeline. Defensible primary-source archive.

The sixteen strategies

Compression is not one trick. It's a portfolio. The strategies stack: applying the first five typically yields a 25-40% reduction; applying the full sixteen plus the routing layer is where the 40-80% range lives. Five are open and available to all signed-in users. Eleven are Pro-tier.

STR-01

Defined-Terms Dictionary Free

Recurring multi-word phrases get bound to short tokens at session start. "Renewable energy investment tax credit" becomes {ITC}. The model sees the short form; you see the long one.

STR-02

Prompt Macros Free

User-defined templates with variable injection. /audit-memo {company} expands to a structured 800-token request that consistently produces a usable analysis.

STR-03

Duplicate Source Detection Free

Hashes every uploaded document. If you paste the same PDF twice in a session — or a near-identical version — the second one is silently dropped from the request and the model is told it already saw it.

STR-04

Stuck-Loop Detection Free

When a prompt gets refused or returns junk three times in a row with the same approach, the engine flags it. You won't burn another 2,000 tokens on the fourth attempt.

STR-05

Connection Quality Pre-flight Free

Before sending a 50K-token request, we check whether your network can sustain the round-trip. Saves the call from timing out at the provider's end with you charged anyway.

STR-06

Corpus-Adaptive Inclusion Pro

Pruning context the model already demonstrably knows. Never-remove rules guarantee critical anchors stay.

STR-07

Source-Quality Routing Pro

Different sources are worth different amounts of context budget. Primary sources get full inclusion; aggregators get summary inclusion; low-quality sources get cited but not included.

STR-08

Unnecessary Tool-Use Prevention Pro

Models love to call search even when the answer is in their training data. The router suppresses the call when the question doesn't need fresh data.

STR-09

Abort Preservation Pro

If a generation is going to fail, we save the partial work and the prompt context so the next attempt resumes instead of starting over.

STR-10

Deliverable-Type Enforcement Pro

Asking for a memo? The model's tendency to drift into expository preamble is suppressed. You get the deliverable, not the warm-up.

STR-11

Defined-Term Hierarchy Pro

Multi-tier dictionaries that compose. Project-level terms inherit from organization-level terms which inherit from industry-level terms.

STR-12

Cross-Session Memory Compression Pro

Long-running projects accumulate state. Instead of replaying it, we maintain a compressed running summary that gets refreshed when material new context arrives.

STR-13

Provider Preamble Stripping Pro

"I'd be happy to help you with that..." Every provider has a tic. We strip them before they hit your bill.

STR-14

Speculative Routing Pro

For ambiguous prompts, send a short test to a fast model first. If the answer is good enough, ship it. Otherwise escalate to the heavy tier with the test result as priming context.

STR-15

Image Iteration Compression Pro

Iterating on a generated image? We don't resend the seed prompt. We send the delta against the previous version.

STR-16

Refusal Reroute Pro

When a model refuses a prompt that another provider would have permitted, the engine offers a one-click reroute to a model that can handle it. No retyping.

We make money when you save tokens.

Compression

Routing

Lifecycle

Receipts

How it works

The compression pipeline

Multi-model routing

Lifecycle awareness

Architecture stack

The patent posture

What we measure

Live model health

Your monthly token spend

Per-prompt compression receipt

Asymmetric refusal analytics

Model deprecation early warning

Custom benchmark suites

Provider TOS snapshot timeline

The sixteen strategies

This is what xCloude.ai is, and what it wants to be.