About ai&

ai& is a new global AI technology company dedicated to meeting the world's growing demand for AI. Our vision is twofold: to serve as a premier AI lab specializing in localization, and to act as a global infrastructure and compute provider. We are building a unified, optimized global platform that integrates next-generation data centers and infrastructure, heterogeneous compute serving, and advanced model services. We believe that the most effective way to build and scale AI is to own the stack from top to bottom.

At ai&, we empower small teams with the autonomy needed to tackle significant challenges. Our approach is to deconstruct large problems into manageable components and solve complex issues collaboratively. We seek highly motivated, mission-driven individuals who demonstrate strong personal agency. We value curiosity as the foundation of talent, and we are looking for people eager to develop alongside our evolving technology and expanding business.

We are actively hiring worldwide, with presence in Tokyo, SF, Austin, and Toronto. We are more than happy to meet exceptional talent where they are.

Role overview

As our Data Center Facility Operations lead, you are responsible for the physical reliability and performance of ai&'s compute infrastructure in Japan. This is an execution-heavy role. You will manage the mechanical, electrical, and cooling systems that keep our high-density GPU clusters operational, own and operate the NOC for our Japan footprint, and be the person who ensures nothing reaches our compute fleet without being caught and addressed first.

You are not just overseeing a facility. You are running it. You will define the operational processes, build the monitoring stack, execute to strict SLAs, and be accountable for the uptime of some of the most demanding compute environments in the world. The ideal candidate has deep hands-on experience with mission-critical data center systems, specific expertise in high-density and liquid cooling environments, NOC operations experience, and the bilingual fluency to operate effectively in Japan while coordinating with a global team.

Responsibilities

NOC Ownership & Operations Own and operate the NOC for ai&'s Japan data center footprint. Monitor infrastructure health around the clock, triage alerts, coordinate incident response, and ensure nothing reaches the compute fleet without being caught first. Establish NOC processes, tooling, escalation paths, and shift handover procedures from the ground up.
Build Observability & Logging Build out the observability and logging systems required to track power, thermal, and environmental metrics in real time. Detect cooling excursions and power anomalies before they impact GPU health.
Execute to SLAs Define, monitor, and execute to strict SLAs ensuring maximum availability for mission-critical GPU workloads. Own the numbers and be accountable for them.
High-Density Systems Operations Manage the day-to-day operations of critical systems including UPS, switchgear, generators, and chillers. Optimize for AI-specific workloads requiring specialized cooling including CDUs and Rear Door Heat Exchangers.
Incident & Risk Management Respond to and resolve facility-level incidents. Establish the detection and response standards that protect GPU health and minimize workload disruption.
Vendor & Site Management Manage colocation partners and hardware vendors to ensure they meet ai& standards. Execute physical audits and ensure every rack is delivered to spec.
Scale-Out Deployment Support Support the deployment of new capacity in Japan, managing logistics and physical install processes in coordination with the electrical and systems teams.

You may be a fit if you have the following skills

Advanced Facility Operations Experience 10 or more years of hands-on experience in data center facility operations with deep knowledge of mission-critical electrical and mechanical systems.
NOC Operations Experience Experience standing up or running a NOC in a data center or critical infrastructure environment. You know how to build monitoring workflows, alert thresholds, escalation paths, and shift handover processes. You have managed major incidents from the room and you know how to keep things moving when they go wrong.
AI & HPC Infrastructure Experience Direct experience with high-density compute environments of 30kW or more per rack. You understand the unique thermal and power challenges of AI workloads at scale.
Liquid Cooling Expertise Hands-on experience with liquid cooling technologies including direct-to-chip, immersion, and RDHx. You understand the infrastructure required to support them and the failure modes to watch for.
Japan Operations Track Record Experience operating data center sites in Japan. You can navigate local building codes, utility relationships, and regulatory requirements while coordinating with a global team.
Bilingual Technical Fluency Professional fluency in Japanese and English is required.
Great Team Spirit A mission-driven approach to operations, valuing clear communication, hands-on execution, and collective success over individual silos.

Member of Technical Staff - Data Center Facility Operations

Description

About ai&

Stack