Tokenising LLM input

December 17, 2024

LLMs break down user input - prompts - into "tokens" to understand and process language

Tokenization is the foundational process of converting text into smaller units that Large Language Models (LLMs) can process. 

Token Types
Basic Units: Tokens can be:

    Individual characters
    Parts of words
    Complete words
    Phrases or symbols

For example, the sentence "Hello, world!" might be tokenized as ["Hello", ",", " world", "!"]

Tokenization Process
The process involves several key steps:

    Breaking down input text into tokens from the model's established vocabulary
    Assigning unique integer IDs to each token
    Adding any necessary special tokens for processing

A helpful rule of thumb is that one token typically corresponds to about 4 characters in common English text

Common Methods
Byte Pair Encoding (BPE) is the predominant tokenization method used by modern LLMs like GPT models. It works by:

    Finding the most frequently occurring pairs of characters
    Merging these pairs into single tokens
    Repeating until reaching a desired vocabulary size
    

Other Methods include:

    Word tokenization (splitting by delimiters)
    Character tokenization (individual characters)
    Subword tokenization (partial words)3

Practical Considerations
Token Limits: LLMs have maximum token limits for their context windows. For example, GPT-4 has an 8,192 token limit for combined input and output


Trade-offs:

    Smaller tokens provide better handling of multiple languages and typos
    Larger tokens allow for more efficient processing and better context understanding
    The choice affects vocabulary size and model flexibility

Benefits:

    Enables processing of large datasets
    Helps manage memory and computational resources
    Facilitates understanding of language patterns
    Allows handling of unseen or rare words

Challenges:

    Can lose contextual information
    May struggle with languages that rely heavily on context
    Has difficulty with unusual formats or out-of-vocabulary words5