.png?1697105647)
Compensation
Salary undisclosedDescription
Your impact
This is a great opportunity to play a defining role in building and operating our inference platform. You will connect deep expertise in distributed systems programming with hands-on knowledge of operating Kubernetes-based platforms according to best practices, all while supporting the platform in production. You will be instrumental in serving cutting-edge machine learning models at massive scale for scientific applications, often requiring you to think from first principles. Success in this role demands excellent technical skills, independence, strong ownership, and a relentless user-focus.
What you will do
- Contribute to the development and operation of the inference platform, serving fleets of cutting-edge machine learning models to scientific applications.
- Deliver high-quality and well-tested user-focused features.
- Provide support to users of the platform.
- Perform maintenance work and drive internal tech investments for platform stability, reliability and scalability.
- Build observability and alerting mechanisms for the platform.
- Improve the Continuous Integration/Continuous Deployment (CICD) setup of the platform.
- Operate effectively in a fast-paced and ambiguous environment, ensuring independent delivery.
- Provide great documentation and guidance for other contributors and users.
Skills and qualifications
Essential:
- Experience writing and maintaining Python code in production environments, with an emphasis on concurrent programming (with a strong knowledge of async, threads, processes, GIL, etc).
- Experience building, maintaining and operating Kubernetes services.
- Experience working with distributed systems.
- Experience maintaining APIs that serve a moderately large set of internal users.
- Experience working with ML models; an understanding of ML lifecycle and how serving and operating ML models differs from other kinds of workloads.
Nice to have:
- Experience working on an inference platform.
- Experience managing a fleet of ML models.
- Experience building and maintaining CI/CD processes for complex systems.
- Experience with GCP or other comparable clouds.
- Experience with building internal and user-focused dashboards.
Stack
PythonDistributed SystemsGCPCI/CDKubernetesMachine Learning
- Posted
- Mar 4, 2026
- Last seen
- Jun 25, 2026
- First seen
- Jun 25, 2026
- Status
- active