Parametric and Non-Parametric Learning

In statistical learning, every method we ever encounter, from the simplest linear regression to the most elaborate neural network, must make a choice at this fork. That choice concerns how much we are willing to assume about the shape of reality before we even look at the evidence.

Parametric methods are those that assume the data-generating process can be described by a model with a fixed, finite number of parameters. Before we see a single data point, we commit to a specific functional form. Linear regression, for instance, declares: "I believe the relationship between input and output is a weighted sum, plus some noise. My only job is to find the weights." Logistic regression says: "The probability of an event follows a particular S-shaped curve; I need only estimate its steepness and position." The Gaussian distribution announces: "Reality is bell-shaped; tell me the mean and variance, and I will tell you everything."

Non-parametric methods, by contrast, refuse this early commitment. They do not fix the number of parameters in advance. Instead, the complexity of the model is allowed to grow with the data. A k-nearest neighbors algorithm does not assume any functional form at all, it simply says, "Show me what is nearby, and I will guess accordingly." A decision tree carves the space into regions based on what it observes, adding branches as needed. Kernel density estimation builds a smooth curve by placing little bumps around each data point. The model's "memory" expands as more evidence arrives.

In mathematical terms: a parametric model lives in a finite-dimensional space (estimate $\theta_1, \theta_2, \ldots, \theta_k$ and we are done), while a non-parametric model lives in an infinite-dimensional function space, constrained only by the data and some regularization principle.

The soul of this distinction lies in a question that has haunted philosophers, scientists, and engineers for centuries:

How much should we trust our prior beliefs about the world, versus letting the world reveal itself to us?

When we choose a parametric method, we are making a wager. I am saying: "I believe I understand the underlying mechanism well enough to describe it with a simple formula. I am willing to be wrong about the shape in exchange for the power of simplicity—fewer things to estimate, more stability, more interpretability, and the ability to learn from small amounts of data."

When we choose a non-parametric method, we are making the opposite wager. I am saying: "I do not trust my assumptions about the world's structure. I would rather let the data speak, even if it means I need more evidence, even if the result is harder to interpret, even if I risk capturing noise along with signal."

This is the bias-variance tradeoff.

Parametric methods carry high bias (they may systematically miss the truth if reality does not match their assumed form) but low variance (they are stable and do not fluctuate wildly with new data). Non-parametric methods carry low bias (they can approximate almost any truth, given enough data) but high variance (they are sensitive, flexible, and prone to overfitting when data is scarce).

The tension, is epistemic. It asks: In a world of limited information and infinite complexity, what is the wisest posture for a learner to take? Should we impose structure and risk being wrong? Or should we remain agnostic and risk being lost?

This question echoes far beyond statistics. It lives in the debate between rationalists and empiricists in philosophy. It appears in how children learn language—with innate grammatical structures or purely from exposure? It surfaces in economics, where elegant models with strong assumptions collide with messy, data-driven approaches. It haunts the artist choosing between the constraint of a sonnet and the freedom of blank verse.

The Cartographer's Dilemma

Imagine you are a cartographer in an age before satellites, before aerial photography, before any god's-eye view of the terrain. You must produce a map of a coastline you have never fully seen. You have two choices.

The first choice: you decide, before setting out, that coastlines are fundamentally smooth curves. Perhaps gentle arcs, perhaps a series of connected circular segments. You carry with you a template and your job becomes one of fitting that template to the observations you gather. You walk the shore, take measurements at key points, and adjust your arcs until they pass reasonably close to the data. Your map will be elegant. It will be easy to describe ("the coast curves eastward with radius 12 kilometers, then bends north..."). It will be compact. But if the true coastline is jagged, fractal, full of inlets and peninsulas at every scale, your beautiful map will be a lie. A useful lie, perhaps, for navigation at a distance, but a lie nonetheless.

The second choice: you make no assumption about the coastline's shape. Instead, you walk every inch of it, recording each twist and turn with painstaking fidelity. Your map grows with every step. It has no formula, no template—it is simply an accumulation of observed points. If you walk long enough and record carefully enough, your map will approach the truth with arbitrary precision. But here is the trouble: the coastline, as Benoit Mandelbrot would later show, has no "true length." The closer you look, the more detail emerges. Your non-parametric map threatens to become infinite. And if you have only walked a portion of the shore, your map for the unvisited stretches is... silence. You have nothing to say about what you have not seen.

This is the cartographer's dilemma. And it is precisely the dilemma faced by anyone who wishes to learn from data.

Gauss and the Heavens

In the year 1801, the astronomer Giuseppe Piazzi has just discovered Ceres, the first known asteroid, but after tracking it for forty days, he loses it in the glare of the sun. The scientific world is desperate. Where will Ceres reappear? The orbit is unknown, the data is sparse, and the heavens offer no second chances.

Enter Carl Friedrich Gauss, then twenty-four years old, already whispered about as the prince of mathematicians. Gauss takes Piazzi's meager forty observations and performs a feat that will shape statistics for centuries: he assumes that Ceres moves in an ellipse (Kepler's parametric form), that the errors in observation are normally distributed (another parametric assumption), and that the best estimates are those that minimize the sum of squared errors. With these assumptions locked in place, the problem transforms from impossible to merely difficult. Gauss calculates. He predicts where Ceres will reappear.

On December 31, 1801, the astronomer Franz Xaver von Zach points his telescope to Gauss's coordinates. Ceres is there.

This is the triumph of parametric thinking. Gauss did not need to observe the entire orbit. He did not need infinite data. By committing to a structural form (the ellipse) he leveraged centuries of Keplerian physics. His assumptions were not arbitrary; they were grounded in the deepest knowledge available about celestial mechanics. And because the assumptions were correct (or nearly so), the payoff was immense: reliable prediction from minimal evidence.

The method of least squares, the normal distribution, the entire edifice of classical statistics, these are Gauss's children. And they share his philosophy: commit to a form, estimate its parameters, and let the structure do the heavy lifting.

The Rebellion of the Empiricists

But assumptions can betray us.

For two centuries after Gauss, the parametric worldview dominated. Statisticians built models: the normal distribution for heights, the Poisson for rare events, the exponential for waiting times. Economists assumed rational agents maximizing utility. Physicists assumed smooth, differentiable fields. The universe was modeled as a place governed by elegant equations with a handful of free parameters.

Then came the cracks.

In the early 20th century, philosophers of science began to worry. Karl Popper asked: How do we know our models are true? He answered that we do not, we can only falsify them. But parametric models are slippery creatures; they can often accommodate contradictory data by adjusting parameters, never quite admitting failure.

In economics, the elegant models of perfect rationality began to clash with observed human behavior. People did not maximize expected utility; they were loss-averse, anchored by irrelevant numbers, swayed by framing. Daniel Kahneman and Amos Tversky documented these deviations as systematic patterns that the parametric models had assumed away.

And in the 1960s, a quiet revolution began in statistics itself. John Tukey, the great American statistician, championed what he called Exploratory Data Analysis, the radical idea that perhaps we should look at the data before imposing a model. His tools were non-parametric: boxplots, stem-and-leaf displays, smoothers that made no distributional assumptions. "The data," Tukey insisted, "may not be Gaussian. Let us see what it is before we decide what it should be."

Around the same time, Emanuel Parzen, Murray Rosenblatt, and others developed kernel density estimation, a way to estimate probability distributions without assuming any parametric form. Instead of declaring "this is a normal distribution," you let the data build its own shape: place a small smooth bump (a kernel) at each observation, then add them up. The result was a curve that emerges from the data, not upon it.

The k-nearest neighbors algorithm, formalized by Evelyn Fix and Joseph Hodges in 1951, took the idea even further. To classify a new point, simply look at the k closest points in your training data and vote. No functional form. No parameters to fit. Just memory and proximity.

These methods were the empiricists' answer to Gauss: Do not assume. Observe. Let the world speak.

The Parable of the Mapmaker and the Poet

A parametric method is like a poet who has chosen to write a sonnet. Before a single word is composed, the structure is fixed: fourteen lines, iambic pentameter, a specific rhyme scheme (ABAB CDCD EFEF GG, if we follow Shakespeare). The poet's task is to find the best words to fill this predetermined form. The constraint is severe, but it is also generative. The sonnet form has been refined over centuries; it creates a kind of compression, a tension between what must be said and how it can be said. When the subject fits the form, the result is sublime: Shakespeare's sonnets, Keats's "On First Looking into Chapman's Homer." When the subject does not fit, the sonnet feels forced, truncated, or baroque.

A non-parametric method is like a poet who writes in free verse. There is no predetermined line length, no rhyme scheme, no fixed stanza. The poem grows organically, taking whatever shape the subject demands. Walt Whitman's "Song of Myself" sprawls across pages; it could not exist within a sonnet's walls. Free verse can capture the jagged, the sprawling, the uncontainable. But it also places an enormous burden on the poet: without external structure, the poem must generate its own coherence. And in the hands of a lesser artist, free verse can become shapeless, self-indulgent, lost.

Neither form is superior. The question is always: What does the subject demand? What constraints are appropriate? And that question cannot be answered in the abstract. It requires judgment, taste, knowledge of the material.

Statistical learning demands the same judgment.

Nature's Vote

Here is a curious fact: nature herself seems to oscillate between parametric and non-parametric strategies.

Consider the human immune system. When a new pathogen enters your body, how does your immune system "learn" to recognize it? Does it assume the pathogen will take a certain parametric form, a template of what viruses "should" look like?

Not at all. The adaptive immune system is gloriously non-parametric. Your body generates an enormous random library of antibodies, billions of different shapes, without any prior knowledge of what threats exist. When a pathogen arrives, those antibodies that happen to bind to it are selected and amplified. The system explores, then remembers. It is k-nearest neighbors implemented in biochemistry: the antibody that is "closest" to the antigen's shape wins the vote.

But consider, by contrast, how your visual cortex processes edges and shapes. There, neuroscientists have discovered something remarkable: the early layers of visual processing seem to implement parametric assumptions. Neurons in V1 are tuned to detect oriented edges, as if the brain has "assumed" that the visual world is built from lines at various angles. This is not learned from scratch; it is present in newborns and even in animals raised in darkness. It is a prior belief, wired in by evolution, because edges and lines are so fundamental to natural scenes that it would be wasteful to learn their importance from scratch every generation.

Evolution has discovered what statisticians laboriously proved: when you have strong prior knowledge, parametric assumptions accelerate learning; when the environment is unpredictable, non-parametric flexibility is essential.

Your brain uses both. The edge detectors are parametric; the association cortex that learns your grandmother's face is non-parametric. The interplay between them is what makes cognition possible.

The Curse of Dimensionality

In 1961, Richard Bellman gave a name to a demon that haunts all non-parametric methods: the curse of dimensionality.

Suppose you want to estimate a function $f(x)$ using non-parametric methods, say, by averaging nearby observations. In one dimension, "nearby" is easy to define: points within some small distance $\epsilon$ of your query point. If your data is reasonably dense, you will find enough neighbors to make a stable estimate.

Now increase the dimension. In two dimensions, the "neighborhood" is a disk. In three, a sphere. In ten, a hypersphere. And the problem is: the volume of a hypersphere grows exponentially with dimension, but the data you have remains finite. To maintain the same density of neighbors in a 10-dimensional space as you had in 1 dimension, you would need astronomically more data.

The mathematics is stark. If you need $n$ points to achieve a certain accuracy in 1 dimension, you need roughly $n^{10}$ points in 10 dimensions. If $n = 100$ , then $n^{10} = 10^{20}$ . No dataset on Earth is that large.

This is the curse. Non-parametric methods, in high-dimensional spaces, starve for data. The "neighborhood" of any point becomes a vast empty desert. The method that promised to "let the data speak" finds that the data has nothing to say, because there is no data nearby to speak.

Parametric methods suffer less from this curse because their assumptions act as a dimensional reduction. If you assume the function is linear, you do not need to fill a 10-dimensional space with data; you only need to estimate 10 coefficients (plus an intercept). The assumption compresses the problem.

And this is the profound trade-off: parametric methods tame the curse of dimensionality by borrowing strength from assumptions; non-parametric methods remain assumption-free but pay the price in data hunger.

The Strange Middle Kingdom: Modern Machine Learning

If the story ended here, we would have a clean dichotomy: parametric for small data and strong assumptions, non-parametric for large data and weak assumptions. But the story does not end here, because the late 20th and early 21st centuries saw the emergence of methods that blur the boundary in fascinating ways.

Consider the support vector machine. It finds a linear decision boundary (parametric!) in a transformed feature space, and the transformation can be so radical that the resulting boundary in the original space is highly nonlinear. Is this parametric or non-parametric? In a sense, it is both: parametric in spirit (seeking a simple separator), non-parametric in power (capable of arbitrary complexity through the "kernel trick").

Consider decision trees and random forests. A single decision tree is non-parametric: it grows as needed, with no fixed form. But a random forest introduces a different kind of regularization. The averaging over many trees produces a smoother, more stable estimate, inheriting some of the virtues of parametric stability.

And then there are neural networks, the great disruptors. A neural network has a fixed architecture and so it seems parametric. The number of parameters is finite and determined before training. But modern networks have millions or billions of parameters. In practice, they are so flexible that they can approximate almost any function, behaving in many ways like non-parametric methods. And yet, the architecture itself imposes inductive biases: convolutional networks assume spatial locality in images; recurrent networks assume temporal dependence in sequences. These are parametric assumptions smuggled back in through architecture.

The philosopher might say: neural networks are parametric in form but non-parametric in spirit. Or perhaps: they occupy a strange middle kingdom, where the number of parameters is so vast that the distinction begins to dissolve.

The Mediator: Regularization

There is a concept that stands between the parametric and non-parametric worlds like a diplomat negotiating between warring kingdoms. That concept is regularization.

To understand regularization, we must first understand the fundamental anxiety of all statistical learning: overfitting. A model that is too flexible, too willing to accommodate the data will contort itself to fit even the noise. It will memorize the training set rather than learn the underlying pattern. When confronted with new data, it will fail spectacularly, because the noise it memorized is not the noise that will appear tomorrow.

Non-parametric methods, with their infinite flexibility, are particularly vulnerable to this disease. The k-nearest neighbors algorithm with k=1 achieves perfect accuracy on its training data, but this perfection is a mirage. A decision tree grown without constraint will produce a leaf for every single observation, achieving perfect training accuracy and abysmal generalization.

Regularization is the medicine. It is a penalty on complexity, a tax on flexibility, a force that pulls the model back toward simplicity. And here is the deep insight: regularization transforms non-parametric methods into something that behaves more like parametric methods.

Consider ridge regression. Ordinary least squares fits a linear model by minimizing the sum of squared errors. Ridge regression adds a penalty: it minimizes squared errors plus a term proportional to the sum of squared coefficients. What does this penalty do? It shrinks the coefficients toward zero. It says: "Yes, you may fit the data, but not too enthusiastically. Keep your parameters small. Be modest."

This penalty has a remarkable Bayesian interpretation. Adding a ridge penalty is mathematically equivalent to assuming that the true coefficients are drawn from a normal distribution centered at zero. The penalty is a prior belief. The non-parametric flexibility of fitting any linear combination is tempered by a parametric prior that favors small coefficients.

The same logic extends everywhere. In spline smoothing, a penalty on the second derivative enforces smoothness. In neural networks, weight decay (L2 regularization) and dropout prevent the network from memorizing. In Gaussian processes, the choice of kernel encodes assumptions about smoothness, periodicity, or length scale.

Regularization is the bridge. It allows us to start with a flexible, non-parametric architecture and inject parametric assumptions through the penalty term. The strength of the regularization controls how much we trust the assumptions versus the data. A strong penalty says: "I believe in simplicity; ignore the data's noise." A weak penalty says: "Let the data speak; I will intervene only to prevent catastrophe."

The great statistician Leo Breiman called this "algorithmic modeling", using flexible methods but controlling their flexibility through regularization, cross-validation, and ensemble techniques. It is neither purely parametric nor purely non-parametric. It is the synthesis.

The Bayesian Illumination

In Bayesian statistics, every analysis begins with a prior distribution. The data then updates this prior, via Bayes' theorem, to produce a posterior distribution.

In Bayesian terms, the choice between parametric and non-parametric methods is a choice about the prior.

When you fit a linear regression, your prior is effectively: "I believe the true function is a straight line (or a hyperplane), and I need only learn its slope and intercept." This prior assigns zero probability to nonlinear functions. It is an infinitely confident prior on the function's form.

When we use a Gaussian process, our prior is over all possible functions consistent with certain smoothness properties. We do not commit to a specific form; we commit only to beliefs about continuity, differentiability, or periodicity. The prior is diffuse over the function space, and the data sculpts out the posterior.

The Bayesian framework thus reveals the hidden assumptions in every method. A parametric model is one with a concentrated prior (we believe strongly in a specific form). A non-parametric model is one with a diffuse prior (we believe only in general properties like smoothness). The strength of the prior relative to the likelihood determines how much the data can move us.

This illuminates the bias-variance tradeoff in new light. A strong prior (parametric) introduces bias, but it stabilizes our estimates, reducing variance. A weak prior (non-parametric) reduces bias but allows variance to flourish, because the data alone must determine the function's shape.

The great Bayesian E.T. Jaynes loved to say that there is no such thing as a "non-informative" prior, every prior encodes assumptions, even if only assumptions about scale or smoothness. The non-parametric practitioner, who believes they are "letting the data speak," is in fact assuming something profound: that functions in nature are typically smooth, or local, or have bounded complexity. These are assumptions. They are merely more modest assumptions than "the truth is a straight line."

Gaussian Processes

If there is a method that perfectly embodies the marriage of parametric intuition and non-parametric flexibility, it is the Gaussian process.

A Gaussian process is a distribution over functions. Not a distribution over parameters but a distribution over entire functions. Before seeing any data, you specify a kernel function (also called a covariance function) that encodes your beliefs about how similar function values should be at nearby inputs. The squared exponential kernel, for instance, enforces smoothness: points close together in input space should have similar outputs. The periodic kernel encodes the belief that the function repeats.

Once you observe data, the Gaussian process posterior is simply a conditional distribution: "Given what I've seen, what functions are still plausible?" The result is not a single predicted function, but a distribution of functions, with uncertainty quantified at every point.

A Gaussian process is non-parametric in the sense that it can approximate any continuous function given enough data. The number of effective parameters grows with the data. Yet the kernel itself is parametric, it has hyperparameters (length scale, variance, periodicity) that control the prior beliefs. A Gaussian process with a linear kernel is Bayesian linear regression. A Gaussian process with a squared exponential kernel smoothly interpolates, shrinking toward the prior mean in regions without data.

The Gaussian process thus exposes the false dichotomy between parametric and non-parametric methods. They are not opposites; they are endpoints on a spectrum. The spectrum is controlled by the prior, how much structure you impose, how much flexibility you allow.

David MacKay, the physicist and machine learning pioneer, wrote that Gaussian processes answer the question: "What would linear regression look like if it could dream of curves?" The answer is: it would dream of all possible curves, weighted by their plausibility given the prior, and then wake to the evidence and narrow its dreams accordingly.

Occam's Razor, Sharpened

Occam's Razor says that the principle that simpler explanations should be preferred over complex ones, all else being equal.

But what does "simpler" mean? And why should nature favor simplicity?

In the context of statistical learning, simplicity has a precise meaning: a simpler model is one that can explain fewer possible datasets. A straight line can fit infinitely many datasets, but only those that lie roughly along a line. A degree-20 polynomial can fit any 21 points exactly, it is more flexible, hence more complex, hence less "simple."

Precisely, if two models explain the observed data equally well, prefer the one that explains fewer datasets overall. Why? Because the simpler model, by being more restrictive, is making a bolder claim. If that bold claim turns out to match reality, we should reward it. The complex model is hedging its bets, spreading its probability across many possibilities. It is less surprised by the data, and therefore less informative when it succeeds.

Bayesian model comparison embodies this principle automatically. The marginal likelihood of a model, the probability it assigns to the observed data, averaged over all parameter values, naturally penalizes complexity. A complex model spreads its prior probability thinly across many possible datasets; it assigns low probability to any specific dataset. A simple model concentrates its probability; if the data matches, the reward is high.

This is why parametric models, when they are correct, are epistemically virtuous. They make specific predictions. They take risks. They can be refuted. A linear model that predicts a nonlinear world will fail obviously and promptly. This falsifiability is a feature, not a bug.

Non-parametric models, by contrast, are epistemically cautious. They can accommodate almost any data. They are harder to refute, and therefore, in a subtle sense, they tell us less about the world. They describe; they do not explain.

The philosopher of science might say: parametric models are theories; non-parametric models are summaries. Both are valuable. But they serve different purposes.

The Interpretability Question

There is a dimension to this debate that has grown urgent in the age of machine learning: interpretability.

When a doctor uses a logistic regression model to predict the risk of heart disease, the model produces a story. Each coefficient tells: "If cholesterol increases by one unit, the log-odds of disease increase by this much." The story may be simplified but it is intelligible. The doctor can explain the prediction to the patient. The patient can ask: "Why?" And the model can answer.

When a deep neural network makes the same prediction, it often cannot answer "why?" The network has learned some function but that function is not a story. It is a tangled web of weights, a compressed representation of patterns too intricate for human language. The prediction may be accurate, perhaps more accurate than logistic regression, but it is opaque.

This is the interpretability trade-off, and it often aligns with the parametric/non-parametric divide. Parametric models, because they commit to a simple structure, are often interpretable. Non-parametric models (and their modern descendants like neural networks), because they allow arbitrary complexity, are often not.

The alignment is imperfect, decision trees are non-parametric but interpretable; neural networks have fixed architectures but are opaque, but the tendency is real. And it matters. In medicine, in criminal justice, in lending, in all the domains where algorithms touch human lives, interpretability is not a luxury. It is a demand of justice. If an algorithm denies you a loan, you have a right to know why. "The neural network said so" is not an acceptable answer.

The statistician Leo Breiman, in his famous 2001 paper "Statistical Modeling: The Two Cultures," lamented that his field had become obsessed with interpretable parametric models at the expense of predictive accuracy. He championed random forests and other "algorithmic" methods that prioritized prediction. But the tension he identified has not been resolved. We want both: the accuracy of non-parametric methods, the interpretability of parametric ones.

Modern efforts like SHAP values, LIME, and attention mechanisms attempt to retrofit interpretability onto complex models, to ask, after the fact, "Which inputs mattered most?" These are imperfect solutions, but they reveal the hunger for explanation. We do not want merely to predict; we want to understand.

A Tale of Two Economies

For most of the 20th century, economics was a proudly parametric discipline. The dominant paradigm assumed that economic agents were rational, that markets were in equilibrium, and that behavior could be derived from first principles. Macroeconomic models had a handful of equations: IS-LM curves, Phillips curves, production functions, with parameters to be estimated from aggregate data. The form was fixed by theory; the econometrician's job was merely to calibrate.

This approach had triumphs. It provided a common language. It enabled rigorous debate. It made predictions that could be tested and sometimes were confirmed.

But it also had failures. The rational expectations revolution assumed that agents knew the true model of the economy. The models failed to predict the stagflation of the 1970s. They failed to predict the financial crisis of 2008. The parametric forms, beautiful on paper, did not match the irregular, panic-prone, bounded-rationality-driven reality of actual economic life.

Meanwhile, a quieter revolution was underway. Econometricians developed non-parametric methods for treatment effect estimation, ways to measure the causal impact of policies without assuming a specific functional form. The "credibility revolution" championed by Joshua Angrist, Guido Imbens, and others favored designs (randomized experiments, natural experiments, regression discontinuity) that let the data reveal effects without heavy structural assumptions.

In finance, the efficient market hypothesis, a parametric claim about how information is incorporated into prices, clashed with observed anomalies: momentum, value effects, bubbles. Machine learning practitioners began to predict stock returns using random forests and neural networks, making no assumptions about market efficiency. These models often outperformed (slightly) the parametric alternatives, but they could not say why one stock would beat another. They were profitable but silent.

The economist Daron Acemoglu has written about the dangers of applying machine learning naively to economic questions. Prediction and causation are not the same. A non-parametric model might predict that people who own umbrellas tend to experience rain, but it will not tell you that umbrellas cause rain. Economic policy requires causal understanding, not just predictive accuracy. And causal understanding often requires the kind of structural (parametric) assumptions that machine learning eschews.

The tension remains unresolved. Economists increasingly use machine learning for prediction and measurement, but they rely on parametric structural models for policy analysis. The two approaches coexist uneasily, each suspicious of the other's assumptions, or lack thereof.

The Wisdom of the Ensemble

An ensemble method combines multiple models and aggregates their predictions. Random forests are ensembles of decision trees. Gradient boosting builds trees sequentially, each correcting the errors of its predecessors. Model stacking combines linear regressions, neural networks, and nearest neighbor methods into a single prediction.

Here is the profound insight of ensembles: you do not have to choose. If you are uncertain whether the truth is parametric or non-parametric, include both types of models in your ensemble. The aggregation will, under broad conditions, perform at least as well as the best individual model, and often better.

This is the wisdom of diversification, familiar from portfolio theory in finance. Just as a diversified portfolio reduces risk by combining assets that do not move together, a diverse ensemble reduces error by combining models that make different mistakes.

But there is something deeper here. An ensemble is an epistemological humility machine. It says: "I do not know which assumptions are correct. I will let multiple worldviews vote, and trust the consensus."

The success of ensembles in machine learning competitions suggests that the parametric/non-parametric choice may be, in practice, a false dilemma. The correct answer may be: both, together, in tension and collaboration.