- odiase

Tokenising LLM input

December 17, 2024

LLMs break down user input - prompts - into "tokens" to understand and process language

Tokenization is the foundational process of converting text into smaller units that Large Language Models (LLMs) can process.

Token Types
Basic Units: Tokens can be:

Individual characters
Parts of words
Complete words
Phrases or symbols

For example, the sentence "Hello, world!" might be tokenized as ["Hello", ",", " world", "!"]

Tokenization Process
The process involves several key steps:

Breaking down input text into tokens from the model's established vocabulary
Assigning unique integer IDs to each token
Adding any necessary special tokens for processing

A helpful rule of thumb is that one token typically corresponds to about 4 characters in common English text

Common Methods
Byte Pair Encoding (BPE) is the predominant tokenization method used by modern LLMs like GPT models. It works by:

Finding the most frequently occurring pairs of characters
Merging these pairs into single tokens
Repeating until reaching a desired vocabulary size

Other Methods include:

Word tokenization (splitting by delimiters)
Character tokenization (individual characters)
Subword tokenization (partial words)3

Practical Considerations
Token Limits: LLMs have maximum token limits for their context windows. For example, GPT-4 has an 8,192 token limit for combined input and output

Trade-offs:

Smaller tokens provide better handling of multiple languages and typos
Larger tokens allow for more efficient processing and better context understanding
The choice affects vocabulary size and model flexibility

Benefits:

Enables processing of large datasets
Helps manage memory and computational resources
Facilitates understanding of language patterns
Allows handling of unseen or rare words

Challenges:

Can lose contextual information
May struggle with languages that rely heavily on context
Has difficulty with unusual formats or out-of-vocabulary words5