Pre-Launch · LLM Inference Optimization

LLM inference optimization.
Significantly cheaper.

Infer Layer AI is currently in development — an application-layer optimization stack built on top of llm-d, the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. We add prompt compression, semantic caching, intelligent model routing, and context pruning on top of llm-d's high-performance serving core.

Core infrastructure ⚙ llm-d Red Hat · IBM · Google Planned model support Llama 4 Mistral 7B Mixtral 8x7B DeepSeek V3 Qwen3 Phi-4 + more

How It Works

Four techniques.
Powered by llm-d.

Infer Layer AI sits above llm-d — the battle-tested Kubernetes inference engine from Red Hat, IBM, and Google. Your requests pass through four application-layer optimizations before reaching llm-d's high-performance serving core.

Prompt Compression

Long prompts and system messages are semantically compressed before being sent to the model. Redundant phrasing, repetition, and low-signal content are removed. The model receives a shorter, structurally equivalent prompt — same task, fewer input tokens.

Reduces input tokens

Semantic Caching

Responses are cached by semantic meaning, not just exact text. When a new prompt is semantically equivalent to a cached one — even if worded differently — the cached response is returned instantly. Zero tokens consumed for repeat queries.

Eliminates repeated cost

Model Routing

Not every task needs a 70B model. Infer Layer AI will classify each request by complexity and route simple tasks — classification, extraction, short Q&A — to smaller, cheaper models like Mistral 7B or Phi-4. Complex reasoning goes to larger models like Llama 4 70B or DeepSeek V3.

Right model for each task

Context Pruning

In multi-turn conversations, context windows grow fast and most history becomes irrelevant. Infer Layer AI prunes the conversation context to include only the turns and information that are semantically relevant to the current query — keeping windows lean without losing continuity.

Keeps context lean

Request flow

📨

Your App

Sends request to Infer Layer endpoint

→

🗜️

Compress

Prompt & context reduced

→

⚡

Cache Check

Return instantly if matched

→

🔀

Route

Select optimal model size

→

⚙️

llm-d

Disaggregated serving, KV-cache routing

→

🤖

Open Source Model

Llama, Mistral, DeepSeek, Qwen & more

Who It's For

Built for teams
spending on tokens.

🚀

SaaS Startups

Building AI features and watching token costs grow as you scale — Infer Layer AI is being designed to sit between your app and the model, cutting waste at the application layer.

🏢

Enterprise AI Teams

Running high-volume LLM workloads internally. Cost per request compounds fast at scale — we're building specifically for teams where inference spend is already a budget line item.

🧑‍💻

Individual Developers

Self-hosting Llama, Mistral, or DeepSeek and hitting token limits faster than expected. Infer Layer AI will help you get more out of your existing hardware without upscaling.

🔬

ML Teams

Running inference pipelines, eval loops, or fine-tuning data generation at volume. We're building to eliminate the redundant token spend that silently inflates experiment costs.

About

Why we're
building this.

Architecture

Layer 2

Infer Layer AI — application-layer optimizations: prompt compression, semantic caching, model routing, context pruning.

Layer 1

llm-d — Kubernetes-native distributed inference. Disaggregated prefill/decode, KV-cache routing, high-throughput scheduling. Backed by Red Hat, IBM & Google.

Models

Llama 4, Mistral, Mixtral, DeepSeek V3, Qwen3, Phi-4 and more open-source models.

Token costs are quietly blocking AI adoption. Teams self-host open-source models to avoid proprietary API pricing — then discover that the inference cost problem follows them. Prompt bloat, redundant context, and poor model selection add up fast.

Infer Layer AI is being built on top of llm-d — the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. llm-d handles the hard infrastructure: disaggregated prefill/decode serving, KV-cache aware routing, and high-throughput scheduling. Infer Layer AI adds the application-layer intelligence on top: prompt compression, semantic caching, model routing, and context pruning.

We're pre-product. We're talking to engineers and teams now to understand the exact workloads that hurt most — before we write the first line of production code. Your inquiry directly shapes what we build.

Careers

We're just getting
started.

No open roles yet — we're a solo founder at the research stage. But if you're an ML engineer, systems developer, or researcher who cares deeply about inference efficiency, we'd still love to hear from you.

🌟

Lead AI Engineer

Lead the design and implementation of Infer Layer AI's core optimization stack. Strong background in LLMs, inference systems, and production ML. Comfortable owning technical direction from day one.

Future opening

⚙️

Inference Engineer

Deep knowledge of LLM serving, vLLM, or llm-d. Experience optimizing throughput and latency at the infrastructure level.

Future opening

🧠

ML Researcher

Background in prompt compression, context distillation, or semantic similarity. Interest in applying research to real production inference problems.

Future opening

🛠️

Backend / API Engineer

Experience building developer-facing APIs and SDKs. Comfortable working close to the metal with Python and Kubernetes-based systems.

Future opening

Think you'd be a good fit? Reach out even if there's no open role yet.

Contact

Interested?
Say hello.

We're in the research phase — talking to engineers and teams who deal with LLM token costs. If that's you, we'd love to hear about your setup.

We'll reply personally. No automated sequences.

Shape the product. We're pre-build. The workloads and pain points you describe today directly influence what we prioritize and how the system is designed.

First access. People who reach out now get priority when we launch — ahead of general availability.

Pricing input. We haven't set pricing yet. Early conversations directly inform what model makes sense for different usage patterns.

Direct line. You're talking to the person building this — not a sales rep.

LLM inference optimization. Significantly cheaper.

Four techniques.Powered by llm-d.

Built for teamsspending on tokens.

Why we'rebuilding this.

We're just gettingstarted.

Interested?Say hello.

LLM inference optimization.
Significantly cheaper.

Four techniques.
Powered by llm-d.

Built for teams
spending on tokens.

Why we're
building this.

We're just getting
started.

Interested?
Say hello.