What You Need to Know Before
You Start

Starts 18 July 2026 15:56

Ends 18 July 2026

00 Days

00 Hours

00 Minutes

00 Seconds

Monitoring GPUs at Scale for AI - ML and HPC Clusters

Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.

CNCF [Cloud Native Computing Foundation] via YouTube

36 minutes

Optional upgrade avallable

Not Specified

Progress at your own speed

Conference Talk

Optional upgrade avallable

Overview

Learn how NVIDIA monitors GPU clusters for AI/ML workloads using open-source tools, addressing deployment, maintenance, security, and scale challenges for various user personas.

Syllabus

Introduction to GPU Monitoring

Importance of GPU monitoring in AI/ML and HPC clusters

Overview of NVIDIA's approach to GPU monitoring

Understanding GPU Architectures and Performance Metrics

Basics of GPU architecture relevant to AI/ML workloads

Key performance metrics for monitoring GPUs

Tools for Monitoring NVIDIA GPUs

Introduction to open-source tools

Overview of NVIDIA-s specific tools

Deployment of Monitoring Solutions at Scale

Strategies for deploying monitoring tools in large clusters

Automation in deployment

Maintenance and Updates

Routine maintenance practices

Handling updates and upgrades in a monitored environment

Security Considerations in GPU Monitoring

Identifying potential security threats

Implementing security measures for monitoring solutions

Scaling GPU Monitoring Solutions

Challenges of scale in GPU monitoring

Solutions and best practices for scalable monitoring

Addressing User Personas in GPU Monitoring

Different user personas in GPU monitoring (e.g., Admins, Engineers, Data Scientists)

Tailoring monitoring solutions to different user needs

Case Studies and Real-world Examples

Examination of real-world implementations

Lessons learned from industry examples

Practical Exercises and Lab Sessions

Hands-on exercises with open-source monitoring tools

Setting up a small-scale monitoring solution

Conclusion and Future Trends

Summary of key takeaways

Emerging trends in GPU monitoring for AI/ML and HPC clusters

Q&A and Course Wrap-up

Subjects

Conference Talks

What You Need to Know Before You Start

Monitoring GPUs at Scale for AI - ML and HPC Clusters

36 minutes

Not Specified

Conference Talk

Overview

Syllabus

Subjects

AI for FP&A Automation & Modeling

FP&A with AI: Capstone Project

Interpretability of LLMs - Generating SAE Feature Descriptions - Spring 2026

CodeCloak: A DRL-Based Method for Mitigating Code Leakage by LLM Code Assistants

Generative AI for NLP with PyTorch

Machine Learning Engineer: ML and Deep Learning Models

What You Need to Know Before
You Start