What You Need to Know Before
You Start

Starts 7 July 2025 01:14

Ends 7 July 2025

00 Days
00 Hours
00 Minutes
00 Seconds
course image

Monitoring GPUs at Scale for AI - ML and HPC Clusters

Join us as we delve into how NVIDIA effectively monitors GPU clusters tailored for AI and machine learning workloads. This comprehensive session will guide you through the application of open-source tools to address the major challenges of deployment, maintenance, security, and scaling in heterogeneous user environments.
CNCF [Cloud Native Computing Foundation] via YouTube

CNCF [Cloud Native Computing Foundation]

2825 Courses


36 minutes

Optional upgrade avallable

Not Specified

Progress at your own speed

Conference Talk

Optional upgrade avallable

Overview

Join us as we delve into how NVIDIA effectively monitors GPU clusters tailored for AI and machine learning workloads. This comprehensive session will guide you through the application of open-source tools to address the major challenges of deployment, maintenance, security, and scaling in heterogeneous user environments.

Syllabus

  • Introduction to GPU Monitoring
  • Importance of GPU monitoring in AI/ML and HPC clusters
    Overview of NVIDIA's approach to GPU monitoring
  • Understanding GPU Architectures and Performance Metrics
  • Basics of GPU architecture relevant to AI/ML workloads
    Key performance metrics for monitoring GPUs
  • Tools for Monitoring NVIDIA GPUs
  • Introduction to open-source tools
    Overview of NVIDIA-s specific tools
  • Deployment of Monitoring Solutions at Scale
  • Strategies for deploying monitoring tools in large clusters
    Automation in deployment
  • Maintenance and Updates
  • Routine maintenance practices
    Handling updates and upgrades in a monitored environment
  • Security Considerations in GPU Monitoring
  • Identifying potential security threats
    Implementing security measures for monitoring solutions
  • Scaling GPU Monitoring Solutions
  • Challenges of scale in GPU monitoring
    Solutions and best practices for scalable monitoring
  • Addressing User Personas in GPU Monitoring
  • Different user personas in GPU monitoring (e.g., Admins, Engineers, Data Scientists)
    Tailoring monitoring solutions to different user needs
  • Case Studies and Real-world Examples
  • Examination of real-world implementations
    Lessons learned from industry examples
  • Practical Exercises and Lab Sessions
  • Hands-on exercises with open-source monitoring tools
    Setting up a small-scale monitoring solution
  • Conclusion and Future Trends
  • Summary of key takeaways
    Emerging trends in GPU monitoring for AI/ML and HPC clusters
  • Q&A and Course Wrap-up

Subjects

Conference Talks