Entropy, Surprise, and Coding Limits in Machine Learning

Shannon entropy as average surprise and minimum lossless code length, using dice and coding examples to link uncertainty, compressibility, and optimal coding. Covers uniform vs nonuniform distributions, Shannon’s noiseless coding theorem, maximum-entropy derivation via Lagrange multipliers, and connections to cross-entropy, KL divergence, decision trees, and data compression.

Fri, Nov 28th
informationtheoryentropymachinelearningpatternrecognition
Created: 2025-12-15Updated: 2025-12-15

Entropy measures the average surprise of the universe and the minimum cost of describing it.

Imagine you want to send the outcome of rolling a die to a friend using binary codes.

  • If the die is fair, each outcome is equally likely, you’ll need enough bits to represent all 6 outcomes.
  • If the die is biased (say it lands on 6 most of the time), you can save space by using shorter codes for frequent outcomes and longer codes for rare ones.

Entropy tells you the minimum average number of bits required to communicate the outcome of a random variable without loss.

Entropy thus quantifies:

  • Uncertainty: how unpredictable the variable is
  • Compressibility: how efficiently it can be encoded

In machine learning, this forms the basis of:

  • Cross-entropy loss
  • Decision tree splits (information gain)
  • Bayesian inference
  • Data compression

Example

Consider a random variable (x) having 8 possible states, each of which is equally likely...

For a uniform variable:

p(xi)=18,i=1,,8p(x_i) = \frac{1}{8}, \quad i = 1, \dots, 8

You need ( log2 8 = 3 ) bits to represent each state (since (23 = 8)).

Entropy:

H[x]=i=1818log218=8(18)log218=3 bits.H[x] = -\sum_{i=1}^{8} \frac{1}{8}\log_2\frac{1}{8} = -8\left(\frac{1}{8}\right)\log_2\frac{1}{8} = 3~\text{bits.}

Entropy = minimum average code length.


We see that the nonuniform distribution has a smaller entropy than the uniform one...

If probabilities are unequal, uncertainty decreases (on average), fewer bits are needed because predictable events carry less information.


We can take advantage of the nonuniform distribution by using shorter codes for the more probable events...

This principle underlies Huffman and arithmetic coding:

  • Frequent events → shorter binary codes (e.g., 0)
  • Rare events → longer codes (e.g., 1101)

This reduces the average message length while preserving all information.


Shannon’s Noiseless Coding Theorem (1948)

Entropy is a lower bound on the number of bits needed to transmit the state of a random variable.

Mathematically:

LavgH[x]L_{\text{avg}} \ge H[x]

where LavgL_{\text{avg}} = average number of bits per symbol.

🔹 Entropy is the theoretical limit of lossless compression.
No encoding scheme can achieve an average code length below (H[x]).

Entropy sets the floor, clever coding only dances around it.


Entropy as Average Information

Entropy measures the expected surprise:

H[x]=E[h(x)]=xp(x)log2p(x)H[x] = \mathbb{E}[h(x)] = -\sum_x p(x)\log_2 p(x)
  • (h(x)=log2p(x))(h(x) = -\log_2 p(x)): information from observing one event
  • (H[x]): average information across all possible events

Interpretation:

  • (H[x]) is largest when all outcomes are equally likely (maximum uncertainty)
  • (H[x] = 0) when one outcome is certain

Entropy and Distribution Shape

Distributions (p(xi))(p(x_i)) that are sharply peaked have low entropy... those spread evenly have higher entropy.

Formally:

H[x]=ipilogpi0H[x] = -\sum_i p_i \log p_i \ge 0

Equality (H[x] = 0) holds when one (pk=1)(p_k = 1) and all others (pj=0)(p_j = 0).
Uniform distributions, (p(xi)=1/M)(p(x_i) = 1/M), maximize entropy:

Hmax=log2MH_{\max} = \log_2 M

Deriving Maximum Entropy via Lagrange Multipliers

We maximize:

H=ip(xi)lnp(xi)H = -\sum_i p(x_i)\ln p(x_i)

subject to the constraint:

ip(xi)=1\sum_i p(x_i) = 1

Using a Lagrange multiplier (\lambda):

L=ipilnpi+λ(ipi1)\mathcal{L} = -\sum_i p_i\ln p_i + \lambda\left(\sum_i p_i - 1\right)

Taking the derivative:

lnpi1+λ=0pi=eλ1=1M-\ln p_i - 1 + \lambda = 0 \Rightarrow p_i = e^{\lambda - 1} = \frac{1}{M}

Substitute into (H):

H=lnMH = \ln M

or in bits:

H=log2MH = \log_2 M

Maximum entropy occurs for the uniform distribution.


Concrete Examples

Example 1: Uniform variable (8 outcomes)

H=8(18)log2(18)=3 bits.H = -8\left(\frac{1}{8}\right)\log_2\left(\frac{1}{8}\right) = 3~\text{bits.}

→ Need 3 bits per symbol (no compression possible).


Example 2: Nonuniform variable

p=[0.5,0.25,0.25]p = [0.5, 0.25, 0.25] H=[0.5log20.5+0.25log20.25+0.25log20.25]=1.5 bits.H = -[0.5\log_2 0.5 + 0.25\log_2 0.25 + 0.25\log_2 0.25] = 1.5~\text{bits.}

→ Average message length smaller than 3 bits: more predictable system.


Entropy and Coding Intuition

Distribution TypeDescriptionEntropyPredictabilityCoding Efficiency
UniformAll outcomes equally likely🔺 High (max)Very uncertainNo compression
NonuniformSome outcomes dominate🔻 ModerateSome predictabilityCompressible
DeterministicOne outcome certain🟢 ZeroFully predictableNo info to send

Common Pitfalls

  • Lower entropy ≠ better: It means more predictable, not necessarily more desirable.
  • Confusing entropy with disorder: In information theory, it’s about uncertainty, not chaos.
  • Ignoring coding interpretation: Entropy literally means average bits per symbol.
  • Forgetting the log base:
    • log₂ → bits
    • ln → nats
  • Believing you can beat entropy:
    You can’t. It’s the fundamental limit of lossless compression.

Broader Connections

ConceptFormulaRole / Interpretation
Cross-Entropy(q(x)logp(x))(-\sum q(x)\log p(x))Measures how far predicted distribution (p) diverges from true (q)
KL Divergence(p(x)logp(x)q(x))(\sum p(x)\log\frac{p(x)}{q(x)})Extra bits needed when using (q) instead of (p)
Information Gain(ΔH=HbeforeHafter)(\Delta H = H_{\text{before}} - H_{\text{after}})Reduction in uncertainty, used in decision trees
Model ConfidenceEntropy of predicted probabilitiesLower entropy → more confident model

Entropy links uncertainty, knowledge, and efficiency.
To learn is to reduce entropy, to turn surprise into understanding.

Links to

  • [[Information Theory: The Architecture of Uncertainty and Human Power]]