Kairos
Back to jobs

AI Cluster Validation Student

On-site
NVIDIAIL / Yokneam11 hours agoWebsite
FreshRecently launched
Part-time

Compensation

Salary undisclosed
Apply
Share

Description

We are looking for a motivated Student AI Cluster Validation Engineer to join the Networking Solution Validation (NSV) team within the Networking Cluster Solutions (NCS) organization. You will work on large-scale AI cluster solutions, helping validate infrastructure, monitor system health, analyze telemetry data, and improve reliability across hardware, software, and AI workloads.

This is an excellent opportunity to gain hands-on experience with AI infrastructure, cluster operations, system reliability, and advanced engineering workflows while working alongside experienced engineers on cutting-edge technologies.

 

What you'll be doing:

  • Support cluster owners in maintaining cluster health, readiness, and operational stability.
  • Participate in cluster bring-up, validation, monitoring, and reliability activities.
  • Monitor and analyze PHY health, telemetry, logs, and system metrics.
  • Assist in troubleshooting system-level, hardware-level, and PHY-related issues.
  • Support MTBI/MTBF analysis, reliability assessments, and long-term cluster health monitoring.
  • Assist in root-cause analysis and corrective action tracking.
  • Work with AI-based engineering tools to improve troubleshooting, analysis, and workflow efficiency.
  • Collaborate with hardware, infrastructure, software, validation, and AI teams.
  • Learn and work with advanced technologies including AI clusters, GPUs, telemetry systems, and high-speed interfaces.

What we need to see:

  • B.Sc. student in Electrical Engineering, Information Systems Engineering, Computer Engineering, Computer Science, or a related field.
  • Strong analytical, troubleshooting, and problem-solving skills.
  • Interest in system architecture, reliability engineering, PHY technologies, and AI infrastructure.
  • Ability to analyze logs, telemetry data, and monitoring metrics.
  • Experience using AI-based engineering tools for analysis, automation, or productivity improvements.
  • Strong communication, collaboration, and documentation skills.

Ways to stand out from the crowd:

  • Understanding of PHY concepts and high-speed communication systems.
  • Familiarity with telemetry, monitoring platforms, or data analysis.
  • Exposure to AI infrastructure, GPUs, HPC environments, or data center technologies.
  • Familiarity with InfiniBand, Ethernet, PCIe, NVLink, or similar high-speed interfaces.
  • Participation in technical projects, military technology units, hackathons, open-source projects, or personal engineering initiatives.

Stack

GPU
Posted
Jun 30, 2026
Last seen
Jun 30, 2026
First seen
Jun 30, 2026

Similar roles