About ai&

ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.

At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.

We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.

As an inference & serving engineer, your objective is to build a high-performance, multi-tenant serving stack that squeezes maximum utilization out of heterogeneous hardware. This involves navigating the trade-offs between various state-of-the-art inference frameworks and engines, selecting and optimizing the right runtime for the right workload. The scope of work is not limited to Large Language Models; it extends to the frontier of Generative AI, including high-throughput Video generation and complex Multimodal systems where memory pressure and compute requirements are significantly more demanding.

Beyond just deploying models at scale, this role is responsible for building a robust system that bridges the gap between boutique, high-performance clusters and massive, multi-node deployments as the company grows. This requires a deep understanding of the "Inference Triangle"—constantly tuning the stack to find the optimal equilibrium between low-latency (TTFT/ITL), high-throughput, and inference quality (Precision/Quantization). The ideal candidate is a hands-on engineer who views the entire GPU fleet as a single, programmable compute fabric and is eager to get their hands dirty at every level of the stack.

Responsibilities:

Runtime Selection & Deep Optimization: Lead the evaluation, integration, and continuous tuning of diverse inference frameworks to ensure best-in-class performance across LLM, Video, and Multimodal workloads.
Latency & Throughput Engineering: Own the end-to-end performance profile of the model lifecycle, implementing advanced strategies such as disaggregated prefill/decode, speculative decoding, and continuous batching to minimize TTFT and maximize tokens-per-second.
Scalable Systems Evolution: Design and implement serving architectures that function seamlessly on small experimental clusters while providing a clear, robust path to massive-scale, multi-node deployments.
Advanced Memory & Cache Orchestration: Implement and optimize memory management techniques to maximize KV-cache reuse and minimize redundant computations in multi-turn or high-concurrency scenarios.
Day 0 Model Support: Working with the ecosystem, craft a Day 0 model support strategy ensuring our stack provides stable, high-performance support for new models when they are released.
Cross-Stack Integration: Collaborate with the Backend/Gateway and Compute Orchestration teams to ensure the inference engine’s telemetry, failure domains, and lifecycle management are perfectly aligned with the global load balancer and API layers.
Hands-on Technical Leadership: Maintain a high level of personal agency by writing production code, debugging complex distributed system "hangs," and contributing to architectural decisions in a flat, fast-moving team environment.
Collaborative Communication: Function as a primary technical peer to engineering leads, translating complex hardware and model constraints into clear product and infrastructure strategies.
Inference Strategy & Trade-offs: Define path forward when balancing model precision and quantization against the physical limits of HBM bandwidth and compute throughput

You may be a fit if you have the following skills:

Inference Engine: Deep experience with the internals of modern runtimes. You are a prominent contributor to inference engine ecosystems, including but not limited to OSS projects or proprietary engines at top-tier AI labs.
Multimodal Domain Knowledge: Understanding of the specific challenges involved in serving Large Language Models alongside Video and Vision-based generative models.
Scale-First Engineering: A track record of building and managing distributed systems that have evolved from small-scale proofs-of-concept to large-scale production deployments.
Great Team Spirit: A mission-driven approach to engineering, valuing clear communication, hands-on execution, and collective success over individual silos.

Member of Technical Staff - Inference Serving

Description

About ai&

Responsibilities:

Stack