N-gram Language Modelling Explained

What is a Language Model?

At its core, a language model (LM) is a probability distribution over a sequence of words. Its main goal is to answer a simple question: given a sequence of words, what is the probability of the next word?

For example, for the phrase "The cat sat on the...", a good language model would assign a high probability to the word "mat" and a very low probability to a word like "skyscraper". Language models are the backbone of many Natural Language Processing (NLP) tasks like machine translation, speech recognition, and the autocomplete on your phone's keyboard.

1. N-Grams: The Building Blocks

An N-gram is a contiguous sequence of N items (in our case, words) from a given sample of text. The "N" can be any positive integer, defining the size of the word chunk.

1-gram (Unigram)

A single word. For "The quick brown fox", the unigrams are "The", "quick", "brown", "fox".

2-gram (Bigram)

A sequence of two words. The bigrams are ("The", "quick"), ("quick", "brown"), ("brown", "fox").

3-gram (Trigram)

A sequence of three words. The trigrams are ("The", "quick", "brown"), ("quick", "brown", "fox").

How N-Grams Create a Language Model

N-gram models are based on a simplifying assumption called the Markov assumption. It states that the probability of the next word depends only on the previous N-1 words.

For a bigram model (N=2), the probability of a word $ w_i $ depends only on the preceding word, $ w_{i-1} $. $$ P(w_i | w_1, ..., w_{i-1}) \approx P(w_i | w_{i-1}) $$
For a trigram model (N=3), it depends on the two preceding words. $$ P(w_i | w_1, ..., w_{i-1}) \approx P(w_i | w_{i-2}, w_{i-1}) $$

We calculate these probabilities from a large body of text called a corpus. The probability is a simple ratio of counts:

$$ P(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})} $$

This reads: "The probability of word $ w_i $ appearing after word $ w_{i-1} $ is the number of times we saw the pair `(wi-1, wi)` divided by the number of times we saw `wi-1`."

Interactive Example: Building a Bigram Model

Let's use a tiny corpus:

the cat sat on the mat the cat is cute the dog sat on the floor

Calculate P( | )

2. Generalization, Zeros, and Smoothing

A good language model must generalize from the training corpus to predict unseen but perfectly valid sentences. The biggest barrier to generalization is the problem of zeros. What happens if we encounter an n-gram that never appeared in our training data?

Problem: What is $ P(\text{"purrs"} | \text{"cat"}) $ using our corpus?

Count("cat", "purrs") = 0
Resulting Probability: $ \frac{0}{\text{Count("cat")}} = 0 $

Our model states that the phrase "cat purrs" is impossible. This is the zero-probability problem, also known as data sparsity. It prevents our model from generalizing.

Solution: We use smoothing techniques to redistribute probability mass from seen events to unseen events. This ensures no n-gram has a probability of exactly zero.

Laplace (Add-One) Smoothing

The simplest technique. We add 1 to every numerator count and add the total vocabulary size (V) to the denominator.

$$ P_{\text{Laplace}}(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i) + 1}{\text{Count}(w_{i-1}) + V} $$

Example: Applying Laplace Smoothing

From our corpus, the vocabulary size V = 9.

Recalculate $ P(\text{"sat"} | \text{"cat"}) $:
$$ P_{\text{Laplace}}(\text{"sat"} | \text{"cat"}) = \frac{1 + 1}{2 + 9} = \frac{2}{11} \approx 0.18 $$
Calculate the "impossible" $ P(\text{"purrs"} | \text{"cat"}) $: (Assuming V is now 10 to include "purrs")
$$ P_{\text{Laplace}}(\text{"purrs"} | \text{"cat"}) = \frac{0 + 1}{2 + 10} = \frac{1}{12} \approx 0.083 $$ The probability is no longer zero!

3. The Web and Stupid Backoff

When working with enormous, web-scale corpora (billions or trillions of words), sophisticated smoothing algorithms like Kneser-Ney can be computationally expensive. Stupid Backoff is a simple, efficient, and surprisingly effective alternative designed for this scale.

How Stupid Backoff Works

The idea is incredibly simple: if we have evidence for a high-order n-gram, we use it. If not, we "back off" to a lower-order n-gram, multiplying its score by a fixed weight (typically 0.4). We continue this until we reach unigrams.

It's called "stupid" because it doesn't bother producing a true probability distribution (the scores don't sum to 1). However, for tasks like ranking the most likely next word, it performs very well.

$$ S(w_i | w_{i-n+2}^{i-1}) = \begin{cases} \frac{\text{count}(w_{i-n+2}^{i})}{\text{count}(w_{i-n+2}^{i-1})} & \text{if count}(w_{i-n+2}^{i}) > 0 \\ \alpha \cdot S(w_i | w_{i-n+3}^{i-1}) & \text{otherwise} \end{cases} $$

Where $ S $ is the score and $ \alpha $ is the backoff factor (e.g., 0.4).

Example: Calculating a Score with Stupid Backoff

Let's calculate the score for "is" following "the cat" (i.e., $S(\text{"is"}|\text{"the cat"})$) using a trigram model and $ \alpha=0.4 $. We will use a smoothed unigram probability for the final backoff step, let's assume $P(\text{"is"}) = 0.05$.

Attempt Trigram: Check for "the cat is".
count("the cat is") = 1 > 0
count("the cat") = 2 > 0
Score = 1 / 2 = 0.5

Success! We found the trigram and calculated the score directly.
Attempt another Trigram: Let's calculate $S(\text{"runs"}|\text{"the cat"})$.
count("the cat runs") = 0

Trigram not found. We must back off.
Back off to Bigram: Now we calculate $ \alpha \cdot S(\text{"runs"}|\text{"cat"}) $.
count("cat runs") = 0

Bigram not found. We back off again.
Back off to Unigram: Now we calculate $ \alpha \cdot \alpha \cdot S(\text{"runs"}) $. The score for a unigram is its pre-computed (smoothed) probability. Let's assume $P(\text{"runs"}) = 0.01$.
Final Score = 0.4 * 0.4 * P("runs") = 0.16 * 0.01 = 0.0016

We arrived at a non-zero score through backoff.

4. Evaluating Language Models

How do we know if our model is any good? We measure its performance on a test set—a separate chunk of text the model was not trained on. The most common metric is Perplexity.

Perplexity (PP)

Intuition: Perplexity measures how "surprised" a model is by new text. A lower perplexity is better, indicating the model was less surprised and did a better job of predicting the text.

Perplexity is the inverse probability of the test set, normalized by the number of words (N).

$$ PP(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, ..., w_N)}} $$

Example: Calculating Perplexity

Let's evaluate a bigram model on a new sentence: "the cat sat".

1. Calculate sentence probability:

$$ P(\text{"the cat sat"}) \approx P(\text{"the"}) \times P(\text{"cat"} | \text{"the"}) \times P(\text{"sat"} | \text{"cat"}) $$

Imagine our model gives these probabilities:

$ P(\text{"the"}) = 0.1 $
$ P(\text{"cat"} | \text{"the"}) = 0.3 $
$ P(\text{"sat"} | \text{"cat"}) = 0.4 $

$ P(\text{sentence}) = 0.1 \times 0.3 \times 0.4 = 0.012 $

2. Calculate Perplexity:

The sentence has N=3 words.

$$ PP(W) = \sqrt[3]{\frac{1}{0.012}} = \sqrt[3]{83.33} \approx 4.37 $$

The perplexity is 4.37. We would compare this to other models' perplexity on the same test set to determine which is better.