Whose Culture Trains the Machine?
Large language models are built on data scraped from the internet. This fact is both obvious and opaque — technically true, but rarely interrogated. What parts of the internet? In what languages? From what worldviews? With whose permission?
If artificial intelligence is trained on culture, then we have to ask: whose culture? And who gets left out?
The Internet Isn’t Neutral
There’s a persistent myth that the internet is a reflection of global knowledge — a sprawling archive of human diversity. But in practice, it reflects power, access, and legacy systems of visibility.
English dominates. Western norms dominate. Corporate platforms dominate. Whole histories, whole epistemologies, exist outside the searchable surface of the web. Many are oral. Many are behind paywalls. Many were never digitized. Many were excluded deliberately.
When AI systems are trained on this skewed archive, they don’t just learn language — they learn bias, tone, and prioritization. They internalize who is allowed to explain the world — and who is not.
Data as a Mirror of Power
Training data is often described in terms of volume and scale. “X billion tokens.” “X terabytes of text.” But data isn’t neutral mass. It’s shaped by choices: what gets scraped, what gets filtered, what gets weighted.
Most foundational models are trained on data collected from U.S.-centric sources, corporate forums, academic journals, public subreddits, Wikipedia, GitHub. This reflects a very specific cultural sphere — educated, techno-optimistic, English-speaking, often male, and often white.
What happens when a model trained on this corpus is used in places with different values, histories, and priorities?
Cultural mismatch is more than tone-deafness. It’s erasure.
Whose Language Is “Correct”?
Language models are trained to mimic “correct” usage. But correctness is itself a social construct. What counts as proper grammar, or persuasive rhetoric, is deeply tied to colonial histories, class divisions, and systemic exclusions.
Accents become “errors.” Dialects are flagged as “non-standard.” AAVE, Spanglish, Patois — all risk being misunderstood or misrepresented by systems that weren’t trained on them with care.
The result isn’t just inaccuracy. It’s a slow normalization of one kind of expression as “smart,” and others as less-than.
Cultural Flattening at Scale
To scale globally, many AI companies prioritize linguistic homogenization — training models to handle the “most common” versions of language and culture. But this convenience comes at a cost: subtle forms of knowledge are lost in translation.
Sarcasm, metaphor, spiritual concepts, humor — these are all culturally embedded. They don’t map easily onto datasets. They require context that machines don’t have. And when they’re mistranslated, the culture that holds them is distorted.
Toward Pluralistic Training
What would it look like to train models with cultural humility?
It might mean:
Partnering with communities to curate their own corpora
Including oral histories and non-Western forms of storytelling
Weighting underrepresented languages and registers more heavily
Creating opt-out mechanisms for scraped content
Building smaller, local models with intentional boundaries
True intelligence isn’t just scale — it’s specificity.
Conclusion: Whose Intelligence Is It?
The question of culture in AI isn’t just about fairness. It’s about worldbuilding. These systems don’t just reflect reality — they generate it. They suggest answers, offer narratives, fill in blanks.
If their training data is skewed, their vision of the world will be too.
So we must ask — not just what we can do with AI, but who it was built to serve. And what it leaves behind in the process.
Because culture isn’t content. And training on it isn’t permission.
References and Resources
The following sources inform the ethical, legal, and technical guidance shared throughout The Daisy-Chain:
U.S. Copyright Office: Policy on AI and Human Authorship
Official guidance on copyright eligibility for AI-generated works.
UNESCO: AI Ethics Guidelines
Global framework for responsible and inclusive use of artificial intelligence.
Partnership on AI
Research and recommendations on fair, transparent AI development and use.
OECD AI Principles
International standards for trustworthy AI.
Stanford Center for Research on Foundation Models (CRFM)
Research on large-scale models, limitations, and safety concerns.
MIT Technology Review – AI Ethics Coverage
Accessible, well-sourced articles on AI use, bias, and real-world impact.
OpenAI’s Usage Policies and System Card (for ChatGPT & DALL·E)
Policy information for responsible AI use in consumer tools.