Tokenising LLM input
December 17, 2024
LLMs break down user input - prompts - into "tokens" to understand and process language
Tokenization is the foundational process of converting text into smaller units that Large Language Models (LLMs) can process.
Token Types
Basic Units: Tokens can be:
Individual characters
Parts of words
Complete words
Phrases or symbols
For example, the sentence "Hello, world!" might be tokenized as ["Hello", ",", " world", "!"]
Tokenization Process
The process involves several key steps:
Breaking down input text into tokens from the model's established vocabulary
Assigning unique integer IDs to each token
Adding any necessary special tokens for processing
A helpful rule of thumb is that one token typically corresponds to about 4 characters in common English text
Common Methods
Byte Pair Encoding (BPE) is the predominant tokenization method used by modern LLMs like GPT models. It works by:
Finding the most frequently occurring pairs of characters
Merging these pairs into single tokens
Repeating until reaching a desired vocabulary size
Other Methods include:
Word tokenization (splitting by delimiters)
Character tokenization (individual characters)
Subword tokenization (partial words)3
Practical Considerations
Token Limits: LLMs have maximum token limits for their context windows. For example, GPT-4 has an 8,192 token limit for combined input and output
Trade-offs:
Smaller tokens provide better handling of multiple languages and typos
Larger tokens allow for more efficient processing and better context understanding
The choice affects vocabulary size and model flexibility
Benefits:
Enables processing of large datasets
Helps manage memory and computational resources
Facilitates understanding of language patterns
Allows handling of unseen or rare words
Challenges:
Can lose contextual information
May struggle with languages that rely heavily on context
Has difficulty with unusual formats or out-of-vocabulary words5