Tokenization Strategies for LLM Applications

Explore the mechanics like BPE and Unigram and how to choose suitable tokenizer for your LLM application

Deep LearningData ScienceLLM

By Kuriko IWAI

Kuriko IWAI

Table of Contents

IntroductionLLM Foundations:
Inference and De-tokenization
Architectural Trade-OffsWord-based Tokenizer
Variable Evaluation
Use Cases
Suitable Tasks
Character-Based Tokenizer
Variable Evaluation
Use Cases
Suitable Tasks
Subword-Based Tokenizers
Variable Evaluation
Subword Tokenization Algorithms & Tools
BPE Tokenizer
Vocabulary Building
Execution
Advantages
Limitations
Use Cases
Suitable Tasks
WordPiece Tokenizer
Vocabulary Building
Greedy Longest Match
Advantages
Limitations
Use Cases
Suitable Tasks
Unigram
Vocabulary Building
Redundancy Test
Advantages
Limitations
Use Cases
Suitable Tasks
SentencePiece
Advantages
Limitations
Use Cases
Suitable Tasks
Wrapping Up

Introduction

Large Language Models (LLMs) have transformed how we interact with digital information.

Tokenization is a critical preprocessing step that dictates how raw text is translated into a language LLMs can comprehend.

Choosing the right tokenizer can drastically impact an LLM’s memory usage, inference speed, and ability to handle specialized terms.

This guide breaks down the core architectures and provide a framework for choosing the right one for specific applications.

LLM Foundations: Inside the Tokenization Pipeline

Tokenization is the process of breaking down raw text into smaller units called tokens (words, characters, or subwords) to enable LLMs process human language.

Below diagram illustrates how a sentence “I am backpacking now.“ is processed in the embedding pipeline:

Figure A. Tokenization process of the word-based tokenizer (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Tokenization process of the word-based tokenizer (Created by Kuriko IWAI)

First, a tokenizer splits the input text and assigns a unique Token ID to each unit based on a predefined vocabulary.

Then, these IDs are converted into token embeddings—dense numerical vectors that capture the semantic meaning of the words. In Figure A, because the model has four dimensions, each token embedding has a shape of 1 × 4.

Because transformers (LLMs) process all tokens simultaneously, positional embeddings are added to the token embeddings to provide information about the order of the words.

Lastly, the combined embeddings are normalized and fed into the transformer layers (such as a decoder-based architecture, pink box in Figure A).

Inference and De-tokenization

Once the embeddings are processed, the LLM performs an autoregressive process, calculating probabilities to predict the most likely next token in the sequence.

Finally, the tokenizer works in reverse to de-tokenized the predicted ID back into human-readable text.

In Figure A, the pipeline decides the next word “so“.

Architectural Trade-Offs

Tokenizers has three major architectures based on tokenization strategies:

  1. Word-based,

  2. Character-based, and

  3. Subword-based.

Below diagram compares how each architecture tokenize a sentence:

Figure B. Comparison of tokenization by architecture (Subword-based is based on BPE) (Created by Kuriko IWAI)

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Comparison of tokenization by architecture (Subword-based is based on BPE) (Created by Kuriko IWAI)

Each architecture has different tokenization, balancing the three variables:

  1. Vocabulary size: Total number of unique tokens that the model knows,

  2. Sequence length: How many tokens it needs to represent a sentence, and

  3. Unknown word handling: How well it can tackle the Out-of-Vocabulary (OOV) problems.

These impact the model's overall capabilities like:

  • Memory efficiency: Smaller tokens can save RAM and GPU memory.

  • Computational efficiency: Shorter sequence lengths can be processed much faster.

  • Learning efficiency: Smarter models can understand language patterns with less data.

I’ll take a sample sentence from Figure A:

I am backpacking now.

for an example and explore how each architecture works.

Word-based Tokenizer

A word-based tokenizer splits text by whitespace, punctuation, or specific linguistic rules.

Figure B shows it simply splits the sentence by whitespace (whitespace tokenizer).

Although simple enough, a word-based tokenizer struggles with large vocabulary size and OOV problems due to its nature of learning every single word including those sharing prefixes or suffixes.

For example, it cannot recognize the words “backpacking“ and “backpacker“ share the same root word “backpack“.

So, when “backpacker" is not in its vocabulary, it maps “backpacker" to an [UNK] (Unknown) tag and loses all meaning, leading to the OOV problem.

Variable Evaluation

  • Vocabulary size: Massive (500,000+ tokens).

  • Sequence length: Moderate compared to character-based.

  • Unknown word (OOV) handling: Fails completely.

Use Cases

  • Whitespace split: The simplest form. Python's .split().

  • Keras text_to_word_sequence: An utility function used in deep learning for basic preprocessing.

  • NLTK word_tokenize: A Python library tool that uses rules to separate punctuation and words.

  • spaCy tokenizer: A tokenizer to handle prefixes, suffixes, and special cases like URLs.

Suitable Tasks

Tasks which don’t require a wide range of words are suitable for this tokenizer.

  • Keyword search systems: When the primary goal is exact matching of terms in a database.

  • Simple sentiment analysis: For small, domain-specific datasets where words don't change much.

  • Traditional information retrieval: Classic TF-IDF based systems that index whole words.

  • NLP learning: Ideal for understanding basic text processing before moving to complex models.

Character-Based Tokenizer

A character-based tokenizer splits text into individual letters.

Figure B shows that the sentence is tokenized into 18 tokens, each of which represents an alphabet.

Because its vocabulary is fixed like all alphabets, for example, a character-based tokenizer can avoid OOV problems completely.

Yet, its sequence length poses a challenge.

As Figure B shows, it needs 18 tokens to represent the sentence, but its counterparts only takes 4 or 7 tokens.

LLMs have a context window where the model limits the total number of tokens it can process at once during the autoregressive process.

When the sequence length is too long, the sentence is streamlined by the context window, degrading the model’s response.

Variable Evaluation

  • Vocabulary size: Very small.

  • Sequence length: Very long (might exceed the context window).

  • Unknown word handling: Very good. No OOV problem exists.

Use Cases

  • Python string iteration: Splits string into a list of characters (e.g., "Hi"['H', 'i']).

  • Custom regex tokenizers: re.findall(r'.', text) to capture every individual character in a sequence.

  • Traditional RNNs: Uses character-level mapping where the vocabulary is just the alphabet + punctuation.

Suitable Tasks

Good at handling tasks which requires special attentions to characters:

  • Spelling correction: Learn to fix words like hapy to happy.

  • OCR (Optical Character Recognition): Reading text from images where character-level precision is vital.

  • Noisy text handling: Social media data full of emojis or intentional misspellings.

  • Morphological analysis: Study how words are formed at the most granular level.

Subword-Based Tokenizers

A subword-based tokenizer splits text by sub-parts of the word.

Figure B shows that the word backpacking is tokenized into the three tokens: back, pack, and ing.

Capturing prefixes and suffixes, the subword-based tokenizer can recognize backpacking, backpacker, and backpacked share the same root word, backpack, without increasing its vocabulary size.

This enhances parameter efficiency and learning capabilities of LLMs, which leads it to be the industry standard.

Variable Evaluation

  • Vocabulary size: Smaller than word-based (30k - 100k tokens).

  • Sequence length: Moderate.

  • Unknown word handling: Avoid OOV problems by leveraging known prefixes and suffixes.

Subword Tokenization Algorithms & Tools

Subword tokenizers has primary algorithms and tools:

  • Byte Pair Encoding (BPE):

  • WordPiece,

  • Unigram, and

  • SentencePiece.

BPE, WordPiece, and Unigram are algorithms for tokenization, and SentencePiece is a tool to apply these algorithms:

  • BPE: Algorithms to merge frequent-seen subword pairs into a single token.

  • WordPiece: Algorithm to merge rare subword pairs into a single token.

  • Unigram: Algorithm to iteratively purge the least likely subwords until reaching the desired vocabulary size.

  • SentencePiece: A tool for applying BPE or Unigram to multilingual models.

Taking a word “backpacking" for an example:

TokenizerKey StrategyLikely SplitReasoning
BPEMost frequent pairsback, pack, ingback and pack are very common, frequently appears in text.
WordPieceMaximize likelihoodback, ##pack, ##ingback and pack are independent and versatile with high likelihood of appearing in different words (i.e., backbones, package).
UnigramBest overall probabilitybackpack, ingAmong all possible subword pairs in backpacking, the backpack and ing pair is the longest possible meaningful chunk.
SentencePieceSpace as a character (_)_back, pack, ing (When used with BPE)There’s a space before "back". Attach it to the most likely next chunk pack.

Table 1. Comparison of subword-based tokenizers (Created by Kuriko IWAI)

BPE Tokenizer

A BPE tokenizer focuses on raw frequency of subwords, and merges the most frequent pairs of subwords into a single token, while leaves less frequent pairs as separate.

Vocabulary Building

In a massive training dataset of text, the algorithm sees millions of combinations.

It then ranks them by how many times they appear in the dataset, and store the ranking in a merge table.

For example:

  1. Rank 1: t + hth (Millions of times)

  2. Rank 2: i + nin

  3. Rank 3: in + ging

  4. ...

  5. Rank 1,000: back

  6. Rank 1,500: pack

  7. ...

  8. Rank 45,000: tion

  9. ...

  10. Rank 60,000: backpack

This merge table indicates the priority queue, where the subword backpack ranks 60,000th because it appears 10,000 times in the dataset.

On the other hands, the subwords back and pack, appear over a million times (i.e., “backdrop“, “packaging“), ranked 1,000th and 1,500th - way much higher ranks than the rank of backpack.

So, the algorithm does not justify a new merge of back and pack due to low frequency (ranking) of backpack.

Execution

In Table 1, initially, the tokenizer sees:

b a c k p a c k i n g

Then, it scans the entire vocabulary asking:

Which two symbols in b a c k p a c k i n g appear next to each other most often?

In Merge 1, BPE merges i and n into in because it sees in has the highest ranking in the merge table (i.e., "thin," "win," "king").

In Merge 2, it merges in and g into ing because ing ranks higher than in in the merge table.

Finally, it decides to leave ing as a single token because it's one of the highest ranking (most frequent pair) among all possible pairs in b a c k p a c k i n g.

By applying the same logic to the rest part b a c k p a c k, it decides to keep back and pack as separate tokens because both has good high ranking in the merge table.

Advantages

  • High utility: All subwords - back, pack, and ing - are very common building blocks in text, extendable to thousands of other words.

  • Greedy efficiency: As the merge table shows, the algorithm strictly follows the merge priority based on frequency. Less frequent pairs are instantly cut off from the vocabulary due to its limit (e.g., 50,000. Pairs below 50,001 rankings are not in the vocabulary).

Limitations

  • Fragile to typos: Typos like “backpaking“ can be split into back, pak, and ing, leaving pak (less common, likely OOV) as an unknown piece. As a result, the model gets confused, thinking that the user is "back-ing" something.

Use Cases

  • Used by the GPT family.

Suitable Tasks

  • Creative writing and chat: Compress common phrases, making generation faster and more efficient.

  • High-speed LLMs: Great for models where memory efficiency and inference speed are top priorities.

WordPiece Tokenizer

WordPiece merges tokens based on maximizing the likelihood of the training data.

Vocabulary Building

WordPiece measure how much more likely two subwords are to appear together than to float around randomly using the likelihood score such that:

Score=P(uv)P(u)P(v)(1)\text{Score} = \frac{P(uv)}{P(u)P(v)} \quad \cdots (1)

where

  • P(uv): The togetherness, indicating the probability of seeing the sequence uv (like back + pack) appear together in the text.

  • P(u) P(v): The coincidence, indicating the probability of u and v appearing next to each other by pure chance.

The logic of the score is that If both u and v are common, they will naturally bump into each other often without being together as uv.

So, high score indicates priority in merging u and v. For example:

  • uv = electromagnet (u = electro, v = magnet)

  • P(uv) = 0.25: Not very high because the word electromagnet is not common in general text.

  • P(u)P(v) = 0.005: Very low because both electro and magnet are very specific terms.

  • Score = 0.25/0.005 = 50: Very high score.

WordPiece sees this as a strong dependence on each other for meaning and prioritizes merging them.

Low score indicates the independence of each component. For example:

  • uv = th (u = t, v = h)

  • P(uv) = 0.85: Very high because th is in many words like "the", "this", "there".

  • P(u)P(v) = 0.95: Extremely high because t and h are in almost every English word.

  • Score = 0.85/0.95 = 0.89: Very low score.

WordPiece waits long time before merging t and h because they are so common individually.

Greedy Longest Match

Once the vocabulary is built, WordPiece uses a Greedy Longest Match (also called MaxMatch) strategy to split a word.

In Table 1, the tokenizer first searches for the whole word "backpacking" in its vocabulary and recognizes it’s not there.

So, it starts to shorten the search, checking "backpackin"... "backpack"... "backpa"... until finding the longest matching string in its vocabulary, “back“.

Then, it moves onto the remaining string: packing, searching for tokens starting with the continuation marker ##; ##packing, ##packin,…, and finds ##pack.

The same process is applies to the rest: ing, finding ##ing.

Advantages

  • Meaning preservation: The likelihood score measuring stickiness of each component can identify more meaningful subwords than BPE, which focuses on P(uv) only.

  • Vocabulary efficiency: Perfect middle ground between word-based and character-based, ensuring every words can be represented by a smaller, fixed vocabulary.

  • Contextual awareness: The likelihood score can pre-optimize the vocabulary for LLMs to predict next tokens.

Limitations

  • Suboptimal split risk: The Greedy Longest Match can lead to suboptimal splits (i.e., hello is split into he and llo because both are in the vocabulary, although keeping hello as a single token is better in some cases).

  • Dependency on space/language bias: WordPiece must use space to split the sentence, ineffective in handling languages without space (i.e., Chinese, Japanese, Thai).

Use Cases

  • Used by BERT and DistilBERT.

Suitable Tasks

Tasks which require deep understanding of the given sentences are suitable:

  • Natural Language Understanding (NLU) such as Question Answering or Logic Reasoning.

  • Sequence classification: Determines the intent or category of a whole sentence.

  • Named entity recognition: Identifying people, places, and organizations where subword boundaries are meaningful.

Unigram

Unigram splits text based on its vocabulary built based on pruning strategies.

Vocabulary Building

Unlike BPE or WordPiece building up vocabulary from the ground up, Unigram starts with a massive vocabulary with every possible substring.

Then, Unigram prunes it down to the maximum vocabulary size that it is allowed to have, asking

Which combination of subwords gives this sentence the highest overall probability?

For a word like backpacking, Unigram first considers many possible segmentations S:

  • S1 = { back, pack, ing }

  • S2 = { backpack, ing }

  • S3 = { backpacking }

  • S3 = { b, a, c, … ,g }

Then, based on a global probability model, it calculates the overall probability of the word backpacking when it is split by the segmentation such that:

P(S)=xSP(x)(2)P(S) = \prod_{x \in S} P(x) \quad \cdots (2)

where P(x) is the probability of token x in the segmentation S.

For example:

  • S1 = { back, pack, ing }P(S1) = P(back) P(pack) P(ing) = 0.1 * 0.2 * 0.5 = 0.01

  • S2 = { backpack, ing }P(S2) = P(backpack) P(ing) = 0.05 * 0.5 = 0.25

When Unigram splits the word into more subwords, the probability gets lower because it is a multiplication of positive numbers smaller than 1 (0 < P < 1).

In consequence, Unigram naturally favors the longest possible meaningful chunks (S2 in this case) that exist in its vocabulary because it tends to achieve the highest possibility.

Redundancy Test

After computing P’s for all possible S’s, Unigram runs the redundancy test, asking:

If I delete the single token backpacking from my vocabulary, how much do I need to compromise my ability to explain the dataset?

This compromise is quantified by the loss for every token.

For example:

  • Case A (Keep): If the probability score in Function 2 for S3 = { backpacking } is way much higher than S2 = { backpack, ing }, the loss of backpacking is huge. So, the model keeps it.

  • Case B (Discard): If the probability score for S3 is slightly higher than S2, the loss of backpacking is limited.

Because when the model can perfectly explain the word "backpacking" using two other tokens it already owns (backpack + ing), keeping the big word backpacking is considered a waste.

So, the model discards it, allocating the slot to other words that cannot be easily split, like rhythm or quartz.

Advantages

  • Subword regularization: Shuffles segmentations during training to provides multiple possible segmentations, making it more robust to typos and noise.

  • Optimal splits: Uses the Viterbi Algorithm to find subwords that maximizes the probability of the entire sequence, making it more accurate in finding meaningful linguistic roots than FIFO BPE/WordPiece.

  • Flexibility in vocabulary pruning: Removes tokens that increase the loss, allowing the model to keep long, but useful words as single tokens, while discarding shorter, less useful subwords (BPE might accidentally keep them just because they appear frequently in the training dataset).

Limitations

  • Computation expense to calculate probabilities for all possible ways to segment a string.

  • Needs for high quality initial seed vocabulary to work effectively.

Use Cases

  • Used by ALBERT and T5.

Suitable Tasks

Specialized domains and high-precision modeling.

  • Domain-specific models: Technical domains like medicine, law, or code can leverage Unigram’s segmentations because a single word might have very specific internal structures.

  • Probabilistic research: Useful to make LLMs more robust to typos or weird spelling.

SentencePiece

SentencePiece treats the entire sentence, including spaces, as a raw stream of characters.

For the sentence “I am backpacking now.“, SentencePiece streams

_I _am _backpacking _now _.

Then, it apples a tokenization algorithm like BPE or Unigram to find the best subwords.

Advantages

  • Lossless detokenization: Because SentencePiece uses a physical token _ for a space, it is each to reconstruct the original sentence.

  • No needs for pre-tokenization: SentencePiece can map noisy raw text like emojis, double spacing, or indentation in code snippets into tokens (i.e., _ _ for two spaces), while other tokenizers delete them.

  • Language agnostic: SentencePiece can leverage the raw stream to any languages without spaces.

Limitations

  • SentencePiece itself is a wrapper, not an algorithm.

Use Cases

  • Standard for modern models. Used by XLM-RoBERTa, mBART, Llama 3, T5, and Mistral.

Suitable Tasks

Multilingual models are best to use SentencePiece.

  • Multilingual applications: Treating a space just like a character using _ works perfectly for languages that don't use spaces such as Chinese, Japanese, or Thai.

  • Dirty/Web-Scraped Data: Robust for handling noisy text data like raw web text with full of emojis or code snippets.

Wrapping Up

Refer to the comparison table below to determine the most suitable tokenizer for your specific task.

If your task is...Use this TokenizerLogicExample Models
Simple logic/small datasets.Word-basedSplits by whitespace.Early RNNs / SpaCy
Handling typos / emojis.Character-basedSplits into every letter.CANINE / Pixel-based models
Generating long English stories/chats.BPEMost frequent byte pairs.GPT-4 / RoBERTa
Classifying or deep understanding text.WordPieceLikelihood-based splits.BERT / DistilBERT
Processing domain-specific (medical, legal, etc) data.UnigramProbabilistic pruning.ALBERT
Building a multilingual translator.SentencePieceTreats space as a character (_).T5 / Llama 3

Table 2. Suitable tokenizers for tasks (Created by Kuriko IWAI)

Tokenization is a critical component in the LLM pipeline, directing how a model interprets and generates human language.

As the field evolves, new concepts like token-free models are changing traditional boundaries and further improving computational efficiency.

Continue Your Learning

If you enjoyed this blog, these related entries will complete the picture:

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Hands-On Large Language Models: Language Understanding and Generation

Hands-On Large Language Models: Language Understanding and Generation


Share What You Learned

Kuriko IWAI, "Tokenization Strategies for LLM Applications" in Kernel Labs

https://kuriko-iwai.com/tokenization-strategies-for-llm-application

Looking for Solutions?

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.