Was Your Work Used to Train AI Without Permission?

31 Mar

When you see a piece of AI-generated text, art, or code that sounds or looks eerily familiar, it might not be your imagination. There’s a real possibility that your creative work was part of the data used to train that system — without your knowledge, and without your permission.

As the creator rights conversation gains momentum, one question has become more urgent: How do you find out if your work was used to train AI? And if it was — what can you actually do about it?

This article explores how training datasets are collected, why most creators aren’t informed, and the emerging tools and movements aimed at transparency and accountability.

How AI Training Datasets Are Collected

Most large-scale AI models are trained on publicly accessible data scraped from the internet. This includes:

Websites
Forums
Blogs and portfolios
Art-sharing platforms
Code repositories
Academic archives
Social media posts

In many cases, the scraping is automated — run through bots or scripts that collect huge volumes of content. And unless you’ve actively hidden or blocked your work online (and sometimes even if you have), it may have been captured.

These datasets are then compiled into massive corpora — like Common Crawl, LAION, or The Pile — which become the foundation for training models like GPT, Stable Diffusion, and others.

The result? An internet-scale ingestion of creative labor, largely without consent.

Why You Were Never Notified

There is currently no legal requirement in most countries for AI developers to notify individuals that their work is being used for training.

There are also no standardized systems for opting in or out, and no dataset-level attribution mechanisms for tracking whether a specific creator’s work is present.

Most AI developers treat public content as “free” — conflating accessibility with ethical usability.

This is one of the core ethical failings of the current ecosystem: scale has been prioritized over consent.

Can You Check If Your Work Was Used?

In some cases, yes — though the process is far from straightforward.

For Visual Artists:

You can check if your art was included in training sets used by image generation models like Stable Diffusion using tools like:

Have I Been Trained? by Spawning
Glaze and Nightshade from University of Chicago — which also help cloak your style

These tools allow you to upload samples of your work and compare them to known datasets.

For Writers, Coders, and Academics:

This is trickier. Most language and code models are trained on large, often undocumented datasets. There is no public tool (yet) that allows you to search text-based corpora at scale.

However, you can:

Search whether your domain or blog is listed in dataset summaries (e.g., in the LAION or Common Crawl metadata)
Monitor GitHub discussions or disclosures from companies using your work
Ask direct questions of AI systems (e.g., “Where did you learn this?”) — though answers may be vague or fabricated

Transparency is limited by design — not by technological inability, but by structural choice.

What If You Find Out Your Work Was Used?

If you confirm or strongly suspect your work has been used:

Document it: Take screenshots, archive dataset pages, or record matches from discovery tools
Consider opt-out tools: If available, submit removal requests (e.g., Spawning’s opt-out registry)
Join collective actions: Several lawsuits and advocacy campaigns are forming around artist, writer, and programmer rights
Speak publicly: Share your experience with your community to raise awareness and pressure platforms
Support regulation: Engage with efforts pushing for consent-based AI training standards

Right now, remedies are limited. But growing awareness is shifting the balance.

The Push for Transparency and Consent

Movements are underway to make dataset transparency a norm. These include:

Public dataset audits by researchers and watchdog groups
Legal challenges from artists and authors
Opt-out infrastructure being built by ethical developers
Advocacy for new copyright and data ownership laws

The goal isn’t to shut down AI — it’s to create a system where creators are respected, informed, and given a choice.

Conclusion: Visibility Is the First Step

You may never be able to recover full control over how your work has been used in training datasets. But you can reclaim visibility — and that’s the first form of power.

As more creators speak up, more tools are built, and more pressure is applied, we get closer to a future where AI isn’t built on silent exploitation, but informed collaboration.

Your work has value. It should be treated that way — even in the age of machines.

References and Resources

The following sources inform the ethical, legal, and technical guidance shared throughout The Daisy-Chain:

Aira Thorne is an independent researcher and writer focused on the ethics of emerging technologies. Through The Daisy-Chain, she shares clear, beginner-friendly guides for responsible AI use.

Was Your Work Used to Train AI Without Permission?

How AI Training Datasets Are Collected

Why You Were Never Notified

Can You Check If Your Work Was Used?

For Visual Artists:

For Writers, Coders, and Academics:

What If You Find Out Your Work Was Used?

The Push for Transparency and Consent

Conclusion: Visibility Is the First Step

References and Resources

U.S. Copyright Office: Policy on AI and Human Authorship

UNESCO: AI Ethics Guidelines

Partnership on AI

OECD AI Principles

Stanford Center for Research on Foundation Models (CRFM)

MIT Technology Review – AI Ethics Coverage

OpenAI’s Usage Policies and System Card (for ChatGPT & DALL·E)

The Daisy-Chain

Was Your Work Used to Train AI Without Permission?

How AI Training Datasets Are Collected

Why You Were Never Notified

Can You Check If Your Work Was Used?

For Visual Artists:

For Writers, Coders, and Academics:

What If You Find Out Your Work Was Used?

The Push for Transparency and Consent

Conclusion: Visibility Is the First Step

References and Resources

U.S. Copyright Office: Policy on AI and Human Authorship

UNESCO: AI Ethics Guidelines

Partnership on AI

OECD AI Principles

Stanford Center for Research on Foundation Models (CRFM)

MIT Technology Review – AI Ethics Coverage

OpenAI’s Usage Policies and System Card (for ChatGPT & DALL·E)

Creative Commons vs. Creative Control: Where’s the Line?

What Does It Mean to Train AI Ethically?

The Daisy-Chain