Entropy measures the average surprise of the universe and the minimum cost of describing it.
Imagine you want to send the outcome of rolling a die to a friend using binary codes.
- If the die is fair, each outcome is equally likely, you’ll need enough bits to represent all 6 outcomes.
- If the die is biased (say it lands on 6 most of the time), you can save space by using shorter codes for frequent outcomes and longer codes for rare ones.
Entropy tells you the minimum average number of bits required to communicate the outcome of a random variable without loss.
Entropy thus quantifies:
- Uncertainty: how unpredictable the variable is
- Compressibility: how efficiently it can be encoded
In machine learning, this forms the basis of:
- Cross-entropy loss
- Decision tree splits (information gain)
- Bayesian inference
- Data compression
Example
Consider a random variable (x) having 8 possible states, each of which is equally likely...
For a uniform variable:
You need ( log2 8 = 3 ) bits to represent each state (since (23 = 8)).
Entropy:
Entropy = minimum average code length.
We see that the nonuniform distribution has a smaller entropy than the uniform one...
If probabilities are unequal, uncertainty decreases (on average), fewer bits are needed because predictable events carry less information.
We can take advantage of the nonuniform distribution by using shorter codes for the more probable events...
This principle underlies Huffman and arithmetic coding:
- Frequent events → shorter binary codes (e.g.,
0) - Rare events → longer codes (e.g.,
1101)
This reduces the average message length while preserving all information.
Shannon’s Noiseless Coding Theorem (1948)
Entropy is a lower bound on the number of bits needed to transmit the state of a random variable.
Mathematically:
where = average number of bits per symbol.
🔹 Entropy is the theoretical limit of lossless compression.
No encoding scheme can achieve an average code length below (H[x]).
Entropy sets the floor, clever coding only dances around it.
Entropy as Average Information
Entropy measures the expected surprise:
- : information from observing one event
- (H[x]): average information across all possible events
Interpretation:
- (H[x]) is largest when all outcomes are equally likely (maximum uncertainty)
- (H[x] = 0) when one outcome is certain
Entropy and Distribution Shape
Distributions that are sharply peaked have low entropy... those spread evenly have higher entropy.
Formally:
Equality (H[x] = 0) holds when one and all others .
Uniform distributions, , maximize entropy:
Deriving Maximum Entropy via Lagrange Multipliers
We maximize:
subject to the constraint:
Using a Lagrange multiplier (\lambda):
Taking the derivative:
Substitute into (H):
or in bits:
Maximum entropy occurs for the uniform distribution.
Concrete Examples
Example 1: Uniform variable (8 outcomes)
→ Need 3 bits per symbol (no compression possible).
Example 2: Nonuniform variable
→ Average message length smaller than 3 bits: more predictable system.
Entropy and Coding Intuition
| Distribution Type | Description | Entropy | Predictability | Coding Efficiency |
|---|---|---|---|---|
| Uniform | All outcomes equally likely | 🔺 High (max) | Very uncertain | No compression |
| Nonuniform | Some outcomes dominate | 🔻 Moderate | Some predictability | Compressible |
| Deterministic | One outcome certain | 🟢 Zero | Fully predictable | No info to send |
Common Pitfalls
- Lower entropy ≠ better: It means more predictable, not necessarily more desirable.
- Confusing entropy with disorder: In information theory, it’s about uncertainty, not chaos.
- Ignoring coding interpretation: Entropy literally means average bits per symbol.
- Forgetting the log base:
log₂→ bitsln→ nats
- Believing you can beat entropy:
You can’t. It’s the fundamental limit of lossless compression.
Broader Connections
| Concept | Formula | Role / Interpretation |
|---|---|---|
| Cross-Entropy | Measures how far predicted distribution (p) diverges from true (q) | |
| KL Divergence | Extra bits needed when using (q) instead of (p) | |
| Information Gain | Reduction in uncertainty, used in decision trees | |
| Model Confidence | Entropy of predicted probabilities | Lower entropy → more confident model |
Entropy links uncertainty, knowledge, and efficiency.
To learn is to reduce entropy, to turn surprise into understanding.
Links to
- [[Information Theory: The Architecture of Uncertainty and Human Power]]