What You Need to Know Before
You Start
Starts 7 July 2025 01:14
Ends 7 July 2025
Monitoring GPUs at Scale for AI - ML and HPC Clusters
CNCF [Cloud Native Computing Foundation]
2825 Courses
36 minutes
Optional upgrade avallable
Not Specified
Progress at your own speed
Conference Talk
Optional upgrade avallable
Overview
Join us as we delve into how NVIDIA effectively monitors GPU clusters tailored for AI and machine learning workloads. This comprehensive session will guide you through the application of open-source tools to address the major challenges of deployment, maintenance, security, and scaling in heterogeneous user environments.
Syllabus
- Introduction to GPU Monitoring
- Understanding GPU Architectures and Performance Metrics
- Tools for Monitoring NVIDIA GPUs
- Deployment of Monitoring Solutions at Scale
- Maintenance and Updates
- Security Considerations in GPU Monitoring
- Scaling GPU Monitoring Solutions
- Addressing User Personas in GPU Monitoring
- Case Studies and Real-world Examples
- Practical Exercises and Lab Sessions
- Conclusion and Future Trends
- Q&A and Course Wrap-up
Subjects
Conference Talks