Infer Layer AI is currently in development — an application-layer optimization stack built on top of llm-d, the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. We add prompt compression, semantic caching, intelligent model routing, and context pruning on top of llm-d's high-performance serving core.
Infer Layer AI sits above llm-d — the battle-tested Kubernetes inference engine from Red Hat, IBM, and Google. Your requests pass through four application-layer optimizations before reaching llm-d's high-performance serving core.
Token costs are quietly blocking AI adoption. Teams self-host open-source models to avoid proprietary API pricing — then discover that the inference cost problem follows them. Prompt bloat, redundant context, and poor model selection add up fast.
Infer Layer AI is being built on top of llm-d — the Kubernetes-native distributed inference engine backed by Red Hat, IBM, and Google. llm-d handles the hard infrastructure: disaggregated prefill/decode serving, KV-cache aware routing, and high-throughput scheduling. Infer Layer AI adds the application-layer intelligence on top: prompt compression, semantic caching, model routing, and context pruning.
We're pre-product. We're talking to engineers and teams now to understand the exact workloads that hurt most — before we write the first line of production code. Your inquiry directly shapes what we build.
No open roles yet — we're a solo founder at the research stage. But if you're an ML engineer, systems developer, or researcher who cares deeply about inference efficiency, we'd still love to hear from you.
Think you'd be a good fit? Reach out even if there's no open role yet.
We're in the research phase — talking to engineers and teams who deal with LLM token costs. If that's you, we'd love to hear about your setup.
We'll reply personally. No automated sequences.