Pre-Launch · LLM Inference Optimization

LLM inference optimization.
Significantly cheaper.

Infer Layer AI is currently in development — an application-layer optimization stack built on top of llm-d, the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. We add prompt compression, semantic caching, intelligent model routing, and context pruning on top of llm-d's high-performance serving core.

Core infrastructure ⚙ llm-d Red Hat · IBM · Google Planned model support Llama 4 Mistral 7B Mixtral 8x7B DeepSeek V3 Qwen3 Phi-4 + more

Four techniques.
Powered by llm-d.

Infer Layer AI sits above llm-d — the battle-tested Kubernetes inference engine from Red Hat, IBM, and Google. Your requests pass through four application-layer optimizations before reaching llm-d's high-performance serving core.

01
Prompt Compression
Long prompts and system messages are semantically compressed before being sent to the model. Redundant phrasing, repetition, and low-signal content are removed. The model receives a shorter, structurally equivalent prompt — same task, fewer input tokens.
Reduces input tokens
02
Semantic Caching
Responses are cached by semantic meaning, not just exact text. When a new prompt is semantically equivalent to a cached one — even if worded differently — the cached response is returned instantly. Zero tokens consumed for repeat queries.
Eliminates repeated cost
03
Model Routing
Not every task needs a 70B model. Infer Layer AI will classify each request by complexity and route simple tasks — classification, extraction, short Q&A — to smaller, cheaper models like Mistral 7B or Phi-4. Complex reasoning goes to larger models like Llama 4 70B or DeepSeek V3.
Right model for each task
04
Context Pruning
In multi-turn conversations, context windows grow fast and most history becomes irrelevant. Infer Layer AI prunes the conversation context to include only the turns and information that are semantically relevant to the current query — keeping windows lean without losing continuity.
Keeps context lean
Request flow
📨
Your App
Sends request to Infer Layer endpoint
🗜️
Compress
Prompt & context reduced
Cache Check
Return instantly if matched
🔀
Route
Select optimal model size
⚙️
llm-d
Disaggregated serving, KV-cache routing
🤖
Open Source Model
Llama, Mistral, DeepSeek, Qwen & more

Built for teams
spending on tokens.

🚀
SaaS Startups
Building AI features and watching token costs grow as you scale — Infer Layer AI is being designed to sit between your app and the model, cutting waste at the application layer.
🏢
Enterprise AI Teams
Running high-volume LLM workloads internally. Cost per request compounds fast at scale — we're building specifically for teams where inference spend is already a budget line item.
🧑‍💻
Individual Developers
Self-hosting Llama, Mistral, or DeepSeek and hitting token limits faster than expected. Infer Layer AI will help you get more out of your existing hardware without upscaling.
🔬
ML Teams
Running inference pipelines, eval loops, or fine-tuning data generation at volume. We're building to eliminate the redundant token spend that silently inflates experiment costs.

Why we're
building this.

Architecture
Layer 2
Infer Layer AI — application-layer optimizations: prompt compression, semantic caching, model routing, context pruning.
Layer 1
llm-d — Kubernetes-native distributed inference. Disaggregated prefill/decode, KV-cache routing, high-throughput scheduling. Backed by Red Hat, IBM & Google.
Models
Llama 4, Mistral, Mixtral, DeepSeek V3, Qwen3, Phi-4 and more open-source models.

Token costs are quietly blocking AI adoption. Teams self-host open-source models to avoid proprietary API pricing — then discover that the inference cost problem follows them. Prompt bloat, redundant context, and poor model selection add up fast.

Infer Layer AI is being built on top of llm-d — the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. llm-d handles the hard infrastructure: disaggregated prefill/decode serving, KV-cache aware routing, and high-throughput scheduling. Infer Layer AI adds the application-layer intelligence on top: prompt compression, semantic caching, model routing, and context pruning.

We're pre-product. We're talking to engineers and teams now to understand the exact workloads that hurt most — before we write the first line of production code. Your inquiry directly shapes what we build.

We're just getting
started.

No open roles yet — we're a solo founder at the research stage. But if you're an ML engineer, systems developer, or researcher who cares deeply about inference efficiency, we'd still love to hear from you.

🌟
Lead AI Engineer
Lead the design and implementation of Infer Layer AI's core optimization stack. Strong background in LLMs, inference systems, and production ML. Comfortable owning technical direction from day one.
Future opening
⚙️
Inference Engineer
Deep knowledge of LLM serving, vLLM, or llm-d. Experience optimizing throughput and latency at the infrastructure level.
Future opening
🧠
ML Researcher
Background in prompt compression, context distillation, or semantic similarity. Interest in applying research to real production inference problems.
Future opening
🛠️
Backend / API Engineer
Experience building developer-facing APIs and SDKs. Comfortable working close to the metal with Python and Kubernetes-based systems.
Future opening

Think you'd be a good fit? Reach out even if there's no open role yet.

Interested?
Say hello.

We're in the research phase — talking to engineers and teams who deal with LLM token costs. If that's you, we'd love to hear about your setup.

We'll reply personally. No automated sequences.

01
Shape the product. We're pre-build. The workloads and pain points you describe today directly influence what we prioritize and how the system is designed.
02
First access. People who reach out now get priority when we launch — ahead of general availability.
03
Pricing input. We haven't set pricing yet. Early conversations directly inform what model makes sense for different usage patterns.
04
Direct line. You're talking to the person building this — not a sales rep.