What You Need to Know Before
You Start
Starts 20 June 2025 10:29
Ends 20 June 2025
00
days
00
hours
00
minutes
00
seconds
46 minutes
Optional upgrade avallable
Not Specified
Progress at your own speed
Free Video
Optional upgrade avallable
Overview
Explore probabilistic safety guarantees for language models through analysis of model internals with Jacob Hilton from Alignment Research Center.
Syllabus
- Introduction to Probabilistic Safety
- Fundamentals of Model Internals
- Analyzing Model Internals
- Probabilistic Methods in AI Safety
- Developing Safety Guarantees
- Case Studies and Practical Examples
- Implementing Safety Frameworks
- Evaluating Safety in Language Models
- Tools and Resources
- Guest Lecture by Jacob Hilton
- Conclusion and Future Directions
- Final Project
Overview of Safety in AI Systems
Understanding Probabilistic Guarantees
Architecture of Language Models
Key Components and Their Functions
Techniques for Internal Inspection
Tools and Software for Analysis
Basics of Probability Theory
Application of Probabilistic Methods in AI
Criteria for Safety in Language Models
Constructing Safety Guarantees using Probabilistic Approaches
Review of Past Research and Findings
Analysis of Real-world Language Model Scenarios
Designing Safety Mechanisms Based on Internals
Testing and Validating Safety Measures
Metrics for Safety Assurance
Continuous Assessment and Improvement Strategies
Software Libraries for Model Analysis
Datasets for Testing Safety Protocols
Insights from the Alignment Research Center
Q&A on Advanced Safety Topics
Summary of Key Learnings
Future Challenges and Opportunities in AI Safety
Application of Course Concepts
Development of a Probabilistic Safety Framework for a Language Model
Subjects
Computer Science