What You Need to Know Before
You Start

Starts 7 June 2025 18:26

Ends 7 June 2025

00 days
00 hours
00 minutes
00 seconds
course image

Monitoring GPUs at Scale for AI - ML and HPC Clusters

Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.
CNCF [Cloud Native Computing Foundation] via YouTube

CNCF [Cloud Native Computing Foundation]

2544 Courses


36 minutes

Optional upgrade avallable

Not Specified

Progress at your own speed

Conference Talk

Optional upgrade avallable

Overview

Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.

Syllabus

  • Introduction to GPU Monitoring
  • Importance of GPU monitoring in AI/ML and HPC clusters
    Overview of NVIDIA's approach to GPU monitoring
  • Understanding GPU Architectures and Performance Metrics
  • Basics of GPU architecture relevant to AI/ML workloads
    Key performance metrics for monitoring GPUs
  • Tools for Monitoring NVIDIA GPUs
  • Introduction to open-source tools
    Overview of NVIDIA-s specific tools
  • Deployment of Monitoring Solutions at Scale
  • Strategies for deploying monitoring tools in large clusters
    Automation in deployment
  • Maintenance and Updates
  • Routine maintenance practices
    Handling updates and upgrades in a monitored environment
  • Security Considerations in GPU Monitoring
  • Identifying potential security threats
    Implementing security measures for monitoring solutions
  • Scaling GPU Monitoring Solutions
  • Challenges of scale in GPU monitoring
    Solutions and best practices for scalable monitoring
  • Addressing User Personas in GPU Monitoring
  • Different user personas in GPU monitoring (e.g., Admins, Engineers, Data Scientists)
    Tailoring monitoring solutions to different user needs
  • Case Studies and Real-world Examples
  • Examination of real-world implementations
    Lessons learned from industry examples
  • Practical Exercises and Lab Sessions
  • Hands-on exercises with open-source monitoring tools
    Setting up a small-scale monitoring solution
  • Conclusion and Future Trends
  • Summary of key takeaways
    Emerging trends in GPU monitoring for AI/ML and HPC clusters
  • Q&A and Course Wrap-up

Subjects

Conference Talks