What You Need to Know Before
You Start
Starts 4 June 2026 05:30
Ends 4 June 2026
00
Days
00
Hours
00
Minutes
00
Seconds
1 hour
Optional upgrade avallable
Intermediate
Progress at your own speed
Free Certificate
Optional upgrade avallable
Overview
This course covers tokenization techniques used in modern AI models, including rule-based methods, subword tokenization (BPE, WordPiece, SentencePiece), and vocabulary optimizations. Learners will implement these methods and understand their impact on NLP model performance.
Syllabus
- Unit 1: Introduction to Tokenization (Rule-Based Tokenization)
- Unit 2: Byte-Pair Encoding (BPE) – Subword Tokenization
- Unit 3: Comparing BPE, WordPiece, and SentencePiece in NLP
- Unit 4: Tokenization and Out-of-Vocabulary (OOV) Handling in NLP
Tokenize Text with NLTK
Sentence Tokenization with NLTK
Extract Monetary Values with Regex
Tokenization Showdown with NLTK and spaCy
Exploring Pre-trained Tokenizers with GPT-2
Using Pre-trained Tokenizers with RoBERTa
Comparing Tokenization with GPT-2 and RoBERTa
WordPiece Tokenization Challenge
Tokenization Techniques in Action
Tokenization Techniques in Action
Tokenization Techniques for Special Texts
Tokenization Showdown BERT vs GPT2
Multilingual Tokenization Challenge
Multilingual Tokenization and OOV Reduction
Subjects
Computer Science