What You Need to Know Before
You Start
Starts 7 June 2025 18:26
Ends 7 June 2025
00
days
00
hours
00
minutes
00
seconds
Monitoring GPUs at Scale for AI - ML and HPC Clusters
Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.
CNCF [Cloud Native Computing Foundation]
via YouTube
CNCF [Cloud Native Computing Foundation]
2544 Courses
36 minutes
Optional upgrade avallable
Not Specified
Progress at your own speed
Conference Talk
Optional upgrade avallable
Overview
Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.
Syllabus
- Introduction to GPU Monitoring
- Understanding GPU Architectures and Performance Metrics
- Tools for Monitoring NVIDIA GPUs
- Deployment of Monitoring Solutions at Scale
- Maintenance and Updates
- Security Considerations in GPU Monitoring
- Scaling GPU Monitoring Solutions
- Addressing User Personas in GPU Monitoring
- Case Studies and Real-world Examples
- Practical Exercises and Lab Sessions
- Conclusion and Future Trends
- Q&A and Course Wrap-up
Importance of GPU monitoring in AI/ML and HPC clusters
Overview of NVIDIA's approach to GPU monitoring
Basics of GPU architecture relevant to AI/ML workloads
Key performance metrics for monitoring GPUs
Introduction to open-source tools
Overview of NVIDIA-s specific tools
Strategies for deploying monitoring tools in large clusters
Automation in deployment
Routine maintenance practices
Handling updates and upgrades in a monitored environment
Identifying potential security threats
Implementing security measures for monitoring solutions
Challenges of scale in GPU monitoring
Solutions and best practices for scalable monitoring
Different user personas in GPU monitoring (e.g., Admins, Engineers, Data Scientists)
Tailoring monitoring solutions to different user needs
Examination of real-world implementations
Lessons learned from industry examples
Hands-on exercises with open-source monitoring tools
Setting up a small-scale monitoring solution
Summary of key takeaways
Emerging trends in GPU monitoring for AI/ML and HPC clusters
Subjects
Conference Talks