Data-Centric, Not Data-Driven

Why Data-Centric is more appropriate term compared to Data-Driven.

Mon, Nov 24th
datasciencestrategyproblem-framingperspective
Created: 2025-12-15Updated: 2025-12-15

Data should serve the problem, not dictate it.

Why This Matters

The phrase “data-driven” sounds scientific but often leads to intellectual submission, allowing available data to steer strategy, model design, and business direction.

The author’s choice of “data-centric” restores the proper hierarchy:

  • Goals and problems define purpose.
  • Data, software, and models exist to serve that purpose.

This distinction prevents the common trap of mistaking data availability for strategic insight.


Core Idea

“Data-driven” is reactive. “Data-centric” is intentional.

AspectData-DrivenData-Centric
Mentality“What can we do with this data?”“What data do we need to solve this problem?”
Decision DriverExisting datasetsBusiness or human goals
OutcomeMisaligned insights, spurious correlationsPurposeful, goal-aligned analysis
Model FocusFit metrics (accuracy, AUC, RMSE)Fitness for purpose (actionability, reliability)
RiskIrrelevance, data delusionContextual understanding

Deep Explanation

  • Data ≠ Direction: Having data isn’t having meaning. Data-driven cultures chase correlations; data-centric ones pursue causal understanding.
  • Goals as Priors: Goals act as reasoning constraints, limiting what’s worth analyzing.
  • Data as Infrastructure: Data is the concrete, not the architecture. The “why” defines the structure; the data merely supports it.

Modern MLOps Implications

  1. Data-Centric AI (Andrew Ng’s concept):
    True performance gains now come from improving data quality, not endlessly tuning models.

    • Consistent labels
    • Representative sampling
    • Continuous validation
  2. Feedback Loops:
    Build pipelines that improve data quality automatically (label corrections, schema versioning).

  3. Feature Stores:
    Treat features as reusable knowledge assets: carefully documented, versioned, and aligned with business meaning.

  4. Monitoring:
    Define monitoring around goal metrics, not just technical ones. E.g., customer satisfaction or retention, not only F1-score.


The Correct Hierarchy of Focus

flowchart TD A[Goals & Problems] --> B[Decisions Needed] B --> C[Data Required] C --> D[Software / Models / Systems] D --> E[Business Impact & Feedback] E --> A

Practical Prompts for Implementation

Use these as reflection or project kick-off questions:

  • What problem or decision am I actually trying to support with this data?
  • If I had no data, how would I still reason about this problem?
  • Is the current dataset fit for purpose, or just convenient?
  • Are my metrics aligned with the goal or merely with the data?
  • Does my team measure success by impact or model accuracy?