Entropy, Surprise, and Coding Limits in Machine Learning

Entropy measures the average surprise of the universe and the minimum cost of describing it.

Imagine you want to send the outcome of rolling a die to a friend using binary codes.

If the die is fair, each outcome is equally likely, you’ll need enough bits to represent all 6 outcomes.
If the die is biased (say it lands on 6 most of the time), you can save space by using shorter codes for frequent outcomes and longer codes for rare ones.

Entropy tells you the minimum average number of bits required to communicate the outcome of a random variable without loss.

Entropy thus quantifies:

Uncertainty: how unpredictable the variable is
Compressibility: how efficiently it can be encoded

In machine learning, this forms the basis of:

Cross-entropy loss
Decision tree splits (information gain)
Bayesian inference
Data compression

Example

Consider a random variable (x) having 8 possible states, each of which is equally likely...

For a uniform variable:

p(x_i) = \frac{1}{8}, \quad i = 1, \dots, 8

You need ( log₂ 8 = 3 ) bits to represent each state (since (2³ = 8)).

Entropy:

H[x] = -\sum_{i=1}^{8} \frac{1}{8}\log_2\frac{1}{8} = -8\left(\frac{1}{8}\right)\log_2\frac{1}{8} = 3~\text{bits.}

Entropy = minimum average code length.

We see that the nonuniform distribution has a smaller entropy than the uniform one...

If probabilities are unequal, uncertainty decreases (on average), fewer bits are needed because predictable events carry less information.

We can take advantage of the nonuniform distribution by using shorter codes for the more probable events...

This principle underlies Huffman and arithmetic coding:

Frequent events → shorter binary codes (e.g., 0)
Rare events → longer codes (e.g., 1101)

This reduces the average message length while preserving all information.

Shannon’s Noiseless Coding Theorem (1948)

Entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

Mathematically:

L_{\text{avg}} \ge H[x]

where $L_{\text{avg}}$ = average number of bits per symbol.

🔹 Entropy is the theoretical limit of lossless compression.
No encoding scheme can achieve an average code length below (H[x]).

Entropy sets the floor, clever coding only dances around it.

Entropy as Average Information

Entropy measures the expected surprise:

H[x] = \mathbb{E}[h(x)] = -\sum_x p(x)\log_2 p(x)

$(h(x) = -\log_2 p(x))$ : information from observing one event
(H[x]): average information across all possible events

Interpretation:

(H[x]) is largest when all outcomes are equally likely (maximum uncertainty)
(H[x] = 0) when one outcome is certain

Entropy and Distribution Shape

Distributions $(p(x_i))$ that are sharply peaked have low entropy... those spread evenly have higher entropy.

Formally:

H[x] = -\sum_i p_i \log p_i \ge 0

Equality (H[x] = 0) holds when one $(p_k = 1)$ and all others $(p_j = 0)$ .
Uniform distributions, $(p(x_i) = 1/M)$ , maximize entropy:

H_{\max} = \log_2 M

Deriving Maximum Entropy via Lagrange Multipliers

We maximize:

H = -\sum_i p(x_i)\ln p(x_i)

subject to the constraint:

\sum_i p(x_i) = 1

Using a Lagrange multiplier (\lambda):

\mathcal{L} = -\sum_i p_i\ln p_i + \lambda\left(\sum_i p_i - 1\right)

Taking the derivative:

-\ln p_i - 1 + \lambda = 0 \Rightarrow p_i = e^{\lambda - 1} = \frac{1}{M}

Substitute into (H):

H = \ln M

or in bits:

H = \log_2 M

Maximum entropy occurs for the uniform distribution.

Concrete Examples

Example 1: Uniform variable (8 outcomes)

H = -8\left(\frac{1}{8}\right)\log_2\left(\frac{1}{8}\right) = 3~\text{bits.}

→ Need 3 bits per symbol (no compression possible).

Example 2: Nonuniform variable

p = [0.5, 0.25, 0.25]

H = -[0.5\log_2 0.5 + 0.25\log_2 0.25 + 0.25\log_2 0.25] = 1.5~\text{bits.}

→ Average message length smaller than 3 bits: more predictable system.

Entropy and Coding Intuition

Distribution Type	Description	Entropy	Predictability	Coding Efficiency
Uniform	All outcomes equally likely	🔺 High (max)	Very uncertain	No compression
Nonuniform	Some outcomes dominate	🔻 Moderate	Some predictability	Compressible
Deterministic	One outcome certain	🟢 Zero	Fully predictable	No info to send

Common Pitfalls

Lower entropy ≠ better: It means more predictable, not necessarily more desirable.
Confusing entropy with disorder: In information theory, it’s about uncertainty, not chaos.
Ignoring coding interpretation: Entropy literally means average bits per symbol.
Forgetting the log base:
- log₂ → bits
- ln → nats
Believing you can beat entropy:
You can’t. It’s the fundamental limit of lossless compression.

Broader Connections

Concept	Formula	Role / Interpretation
Cross-Entropy	$(-\sum q(x)\log p(x))$	Measures how far predicted distribution (p) diverges from true (q)
KL Divergence	$(\sum p(x)\log\frac{p(x)}{q(x)})$	Extra bits needed when using (q) instead of (p)
Information Gain	$(\Delta H = H_{\text{before}} - H_{\text{after}})$	Reduction in uncertainty, used in decision trees
Model Confidence	Entropy of predicted probabilities	Lower entropy → more confident model

Entropy links uncertainty, knowledge, and efficiency.
To learn is to reduce entropy, to turn surprise into understanding.

Links to

[[Information Theory: The Architecture of Uncertainty and Human Power]]