Qué necesitas saber antes de
comenzar

Inicio 14 July 2026 22:48

Fin 14 July 2026

00 Días

00 Horas

00 Minutos

00 Segundos

Registrarse

Implementación del Aprendizaje Profundo: Cuantización, Servicio, e IA en el Borde

Domina todo el ciclo de vida de implementación del aprendizaje profundo: cuantiza LLMs con AWQ/GPTQ, sirve a gran escala con vLLM y Triton, y despliega en dispositivos de borde utilizando ONNX, Llama.cpp y TensorRT.

Board Infinity via Coursera

21 hours

Actualización opcional disponible

Avanzado

Avanza a tu propio ritmo

Paid Course

Actualización opcional disponible

Resumen

"Production Deep Learning:

Inference, Quantization & Edge Deployment is designed for ML engineers and developers who want to master the full deployment lifecycle — from compressing and quantizing models to serving them at scale using vLLM, Triton, ONNX, and Llama.cpp. Module 1 covers model compression fundamentals, including pruning, distillation, and INT8/INT4 quantization using AWQ and GPTQ, with a focus on the accuracy–latency tradeoff.

Module 2 dives into high-throughput serving architectures, exploring vLLM's PagedAttention, NVIDIA Triton, TensorRT, and scaling inference across GPU clusters with autoscaling patterns. Module 3 focuses on CPU and edge deployment using ONNX Runtime, GGUF, and Llama.cpp, plus multimodal inference with CLIP and LLaVA on resource-constrained devices.

Module 4 is a capstone project where you'll quantize a fine-tuned LLM, build a production API with vLLM, benchmark performance, and containerize your model with Docker for cloud and edge deployment. By the end of this course, you will:

- Apply INT4/INT8 quantization techniques (AWQ, GPTQ, GGUF) to compress LLMs for production - Deploy high-throughput inference servers using vLLM, Triton, and ONNX Runtime - Run optimized models on GPU, CPU, and edge devices using Llama.cpp and TensorRT - Build, benchmark, and containerize an end-to-end production-ready inference API" Disclaimer:

This is an independent educational resource created by Board Infinity for informational and educational purposes only.

This course is not affiliated with, endorsed by, sponsored by, or officially associated with any company, organization, or certification body unless explicitly stated. The content provided is based on industry knowledge and best practices but does not constitute official training material for any specific employer or certification program.

All company names, trademarks, service marks, and logos referenced are the property of their respective owners and are used solely for educational identification and comparison purposes.

Programa

Compresión de Modelos, Cuantificación y Optimización de Latencia

Aprender los fundamentos de la compresión de modelos, el perfilado de memoria y las técnicas modernas de cuantificación INT8/INT4, incluyendo AWQ y GPTQ, para optimizar modelos para la inferencia en producción.

Servicio de Alto Rendimiento - vLLM, PagedAttention y Triton

Dominar motores de servicio de producción, incluyendo vLLM con PagedAttention y NVIDIA Triton para escalar la inferencia a través de GPUs y nodos.

ONNX, Llama.cpp y Despliegue en Edge / CPU

Exportar modelos a ONNX para interoperabilidad, desplegar LLMs en dispositivos CPU y edge con Llama.cpp y GGUF, y construir pipelines multimodales con CLIP y LLaVA.

Proyecto Final - La API Lista para Edge (Cuantificar para Servir para Evaluar)

Aplicar todos los conceptos del curso en un proyecto final para cuantificar un modelo ajustado, servirlo a través de vLLM, evaluarlo y empaquetarlo para implementación en la nube y en el edge.

Impartido por

Board Infinity

Materias

Programming

Qué necesitas saber antes de comenzar

Implementación del Aprendizaje Profundo: Cuantización, Servicio, e IA en el Borde

21 hours

Avanzado

Paid Course

Resumen

Programa

Impartido por

Materias

CodeCloak: Un método basado en DRL para mitigar la fuga de código por asistentes de código LLM

IA generativa para PLN con PyTorch

Ingeniero de Aprendizaje Automático: Modelos de ML y Aprendizaje Profundo

Preparación de Datos y Aprendizaje Automático Aplicado

Fundamentos del Aprendizaje por Reforzamiento

Construyendo un Asistente de Cocina con IA usando Django

Qué necesitas saber antes de
comenzar