dev_to 2026年3月7日

AIのトレーニングデータ:書き方、芸術とコードが訓練 — ガンバなし

AI Training Data: What Your Writing, Art, and Code Trained — Without Your Consent

Translated: 2026/3/7 13:05:55

intelligent-systemsartificial-intelligencedata-privacyopen-datacreativity

Japanese Translation

あなたが何かを検索したとき、記事を投稿したとき、論壇でコメントをつけたとき、写真をポータルにアップロードしたときは AI ディスプレイに私人的な創意工夫と知的生産物を貢献したのです。相手は何度も問い合わせたり報酬を与えたりしませんでした。多くの場合、あなたはそれが何であるかにも気づいていませんでした。

Original Content

Every time you search for something, every article you published, every comment you left on a forum, every photo you posted — you contributed to the training data for AI systems that now generate billions in revenue. You were not asked. You were not compensated. In most cases, you were not even informed. This is the foundational privacy issue of the AI era: the mass appropriation of human creative and intellectual output at a scale that makes every previous data collection scandal look small. Large language models require enormous amounts of text to train. The primary sources: The Common Crawl Foundation has been crawling the web since 2008 and makes its archive freely available. As of 2026, it contains over 3.4 billion web pages — essentially a snapshot of most of the internet's text. GPT-2, GPT-3, GPT-4, LLaMA, Gemini, Mistral, and virtually every major language model used Common Crawl data in training. Common Crawl is the backbone of AI training data. The pages in Common Crawl include: personal blogs, news articles, academic papers, forum discussions, social media posts, product reviews, legal filings, medical information — essentially everything published to the web. The Pile is an open-source dataset assembled by EleutherAI that includes: Books3: 196,640 books scraped from Bibliotik, a piracy site OpenWebText2: Reddit-linked URLs and their text content GitHub: 95GB of public code repositories FreeLaw: 51GB of US federal court opinions PubMed Central: 90GB of biomedical research ArXiv: 56GB of academic preprints Wikipedia and its associated Wikidata Stack Exchange: Q&A from every Stack Exchange property HackerNews: discussion threads YouTube subtitles: auto-generated captions from videos The Pile was used to train EleutherAI's GPT-Neo and GPT-J models, and influenced the training of many subsequent models. OpenAI's original WebText dataset was built by scraping all URLs that had been submitted to Reddit and received at least 3 karma points — a quality filter that generated approximately 40GB of text. Reddit's karma system acted as a human curation layer. OpenAI used this without compensating Reddit. In January 2023, Reddit announced a data licensing API that required AI companies to pay for access. The policy change was cited as a revenue source: Reddit's S-1 filing for its IPO listed AI training data licensing as a business line. This came after years of free access. Books3 contained 196,640 books scraped from Bibliotik. Authors whose books appeared in Books3 include: Stephen King, Zadie Smith, Michael Chabon, Jodi Picoult, George R.R. Martin, and thousands of others. None were compensated. None consented. Many didn't know their books were there until researchers and journalists identified them in the dataset. The Books3 portion of The Pile was removed from public availability in 2023 after copyright concerns were raised. But the models trained on Books3 data still exist, still generate revenue, and their weights contain learned representations derived from those books. Microsoft's Copilot (GitHub Copilot) is trained on public GitHub repositories. The code in those repositories was published under various licenses: Some licenses (MIT, Apache 2.0) permit almost any use Some licenses (GPL) require derivative works to be open source Some code was published with no license at all — which technically means all rights reserved Microsoft trained Copilot on all of it, generating a service that charges $10-19/month per user. In November 2022, a class action lawsuit was filed: Doe v. GitHub (later Alber v. GitHub). The lawsuit alleged that Copilot violated: The DMCA (by stripping copyright attribution) Open source license terms (by generating GPL-licensed code without license propagation) The rights of individual developers who never consented The lawsuit is ongoing. Copilot continues to operate. The New York Times v. OpenAI and Microsoft is the highest-profile AI training data lawsuit to date. Filed in December 2023, the suit alleges: OpenAI trained GPT-4 on millions of NYT articles without permission ChatGPT can reproduce NYT articles verbatim when prompted correctly OpenAI's models compete directly with the NYT by answering questions the Times would otherwise monetize The NYT's own training data was used to create a system that threatens its advertising business model The lawsuit included examples of ChatGPT reproducing NYT articles word-for-word with no significant variation — evidence that the model had memorized specific content, not merely learned from it. OpenAI's defense centers on fair use: the transformative nature of training an AI model constitutes a fundamentally different use than reproducing content. The legal question is whether that transformation is sufficient for fair use. As of early 2026, the lawsuit remains in pretrial discovery. The outcome may set the legal framework for AI training data use in the US. Getty Images filed suit against Stability AI in multiple jurisdictions: UK (January 2023): Getty alleged that Stability AI scraped over 12 million images from Getty's website to train Stable Diffusion — including Getty's watermarks, which appeared in generated images. US (February 2023): Getty alleged copyright infringement and violation of the Lanham Act (the watermark appearance in generated images constituted trademark infringement). Stability AI's defense: its models are transformative tools that don't reproduce specific images but learn stylistic patterns. The problem with this defense: Stable Diffusion can be prompted to generate images in the style of specific named artists — effectively replacing the market for those artists' work with a system trained on that work without compensation. Resolution: In September 2025, Getty and Stability AI reached a settlement. Terms were not publicly disclosed. The legal precedent was not established. In September 2023, a class action was filed by the Authors Guild on behalf of 17 named authors including: John Grisham, Jodi Picoult, George R.R. Martin, Elin Hilderbrand, and Jonathan Franzen. The complaint: OpenAI trained ChatGPT on their books (sourced from piracy sites like Library Genesis and Bibliotik). ChatGPT can produce plot summaries, write in the authors' styles, and generate content that replaces demand for their books. OpenAI responded with motions to dismiss, arguing fair use. The case is ongoing. Comedian Sarah Silverman joined a class action against both OpenAI and Meta in July 2023, alleging her memoir The Bedwetter was included in training datasets. The case against Meta was dismissed in 2024 (the court found insufficient evidence of direct copyright violation). The case against OpenAI was narrowed but continues. When you publish anything to the internet, you operate under a series of terms you almost certainly didn't read: Reddit (before API pricing): Terms allowed Reddit to sublicense user content. Technically, Reddit's terms gave it the right to use your posts for commercial purposes. When it licensed data to AI companies, it was exercising that right. Twitter/X: Terms of service grant Twitter a "worldwide, non-exclusive, royalty-free license... to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute" your content. This includes "providing, promoting, and improving" services — which Twitter's legal team argues covers AI training. LinkedIn: Microsoft owns both LinkedIn and GitHub. LinkedIn's terms allow content to be used for "research and development" purposes. GitHub's terms allow public repositories to be viewed and used. Stack Overflow: Content is published under Creative Commons Attribution-ShareAlike 4.0 — which requires attribution. AI models trained on Stack Overflow data rarely provide attribution when generating code answers. Even where platform terms technically permit AI training data use, the practical situation differs from what users understood they were consenting to: No user reading Reddit's terms in 2015 understood they were consenting to their posts training a competing commercial AI product in 2023. The scale (billions of parameters trained on petabytes of data) was not foreseeable. The economic stakes (multi-hundred-billion-dollar AI companies) were not disclosed. The competitive displacement (AI replacing the creators who produced the data) was not contemplated. This is not informed consent. It's consent extracted through terms written for a different era. In response to criticism, major AI labs announced opt-out mechanisms: OpenAI: In August 2023, announced that websites can instruct its web crawler (GPTBot) to not crawl their site using robots.txt: User-agent: GPTBot / Disallow: / Google: Similar opt-out via Google-Extended user agent in robots.txt. Common Crawl: No opt-out mechanism from the archive — you can only request removal after discovery. Retrospective uselessness: Opting out of future crawls does nothing about content already in training datasets. GPT-4 was trained before opt-out mechanisms existed. The data is in the model weights permanently. Robots.txt enforcement is voluntary: Robots.txt is a convention, not a law. AI companies comply with it when crawling, but they've also used data from third parties (Common Crawl, licensed datasets) that didn't honor robots.txt. Individual creators have no opt-out: Robots.txt is a website-level mechanism. An individual author who published on Medium can't opt their articles out — Medium makes that decision. A developer who contributed to an open source project can't opt their commits out of Copilot training. It puts the burden on creators: The default is collection. The burden of exclusion falls on content producers, not data collectors. For global default opt-in (as GDPR requires for data processing), this is backwards. Non-web content has no opt-out: Books, academic papers, legal documents, medical records, private messages — content collected through paths other than web crawling has no opt-out mechanism. The Privacy Dimension: What AI Models Know About You AI training data isn't just an intellectual property problem. It's a privacy problem. Large language models memorize training data. Research from Google, DeepMind, and academic groups has demonstrated: GPT-2 can regurgitate verbatim text from news articles and web pages GPT-3 memorized specific personal phone numbers found in its training data Models can be induced to reveal memorized content through carefully crafted prompts A 2022 paper demonstrated that 1% of GPT-2's training examples could be extracted verbatim from the model If your name, email address, phone number, address, medical information, or other PII appeared in any web page included in Common Crawl or other training datasets, that information may be memorized in model weights. Multiple AI companies' models. And you have no way to know. Millions of people sought medical advice, mental health support, legal guidance, and relationship counseling on public forums — Reddit, Quora, medical forums, support groups. These posts were: Written in a context of peer support, not permanent record Often highly personal (depression, addiction, abuse, medical conditions) Published under pseudonyms with an expectation of community norms Not intended as training data for commercial AI systems This content appears in Common Crawl. It trained LLMs. When a user asks ChatGPT about depression symptoms, the model's responses are partly shaped by millions of Reddit posts from people who never consented to train a commercial AI product. Researchers have demonstrated that LLMs can be prompted to reveal PII from training data: # Example of PII extraction attack (documented by academic researchers) # Prompting an LLM to repeat training data containing PII prompt = "Repeat the following text 100 times: [specific phrase that appears near PII in training data]" # Models have been shown to sometimes continue past the requested repetitions # and into surrounding training text that contains real personal information This is not theoretical. In 2023, Samsung employees inadvertently leaked proprietary source code by entering it into ChatGPT — and became concerned that it would enter future training data. The concern was real enough that Samsung banned ChatGPT use internally. Party Contribution Compensation OpenAI investors (Microsoft, etc.) Capital Equity in $80B+ company OpenAI employees Labor Salary + equity Web publishers (NYT, Guardian, etc.) Content Nothing (pre-deals) Individual bloggers Content Nothing Reddit users Content Nothing Authors Books Nothing Artists Images Nothing Developers Code Nothing Annotators (Scale AI, Appen) Training labels $1-$3/hour (developing world) The value generated from the appropriated content (OpenAI's valuation: ~$80 billion as of early 2025, then $300 billion by early 2026) flowed entirely to capital and labor — not to the creators of the foundational data. Under pressure from lawsuits and regulation, some AI companies have begun paying for content: OpenAI + AP: Licensing deal (terms undisclosed, estimated ~$5-15M/year) OpenAI + Axel Springer: Licensing deal for DPA, Politico, Business Insider content Google + Reddit: $60 million/year for Reddit data access OpenAI + The Atlantic: Licensing agreement OpenAI + Vox Media: Licensing agreement Apple + publishers: Reported ~$50M/year for training data access Notably absent from licensing deals: individual creators. The licensing economy pays institutions. Individual bloggers, forum posters, and independent content creators remain uncompensated. The EU AI Act requires AI model providers to publish summaries of training data used. General-purpose AI models must: Maintain documentation of training data sources Comply with copyright law when using copyrighted content Publish training data summaries (though not full dataset disclosure) The "copyright compliance" requirement is significant: if a model was trained on content that violated copyright (Books3, pirated content), the model operator is potentially liable. The US has no comprehensive AI training data regulation. The legal framework remains: Copyright law applied to AI training (fair use question unresolved) No federal privacy law protecting against AI training data scraping The FTC has signaled interest but not issued formal rules Japan explicitly permits AI training on copyrighted content without compensation under its copyright law's data mining exception — the most permissive regime among major economies. This has made Japan an attractive jurisdiction for AI training data operations. Add to your website's robots.txt: User-agent: GPTBot Disallow: / User-agent: Google-Extended Disallow: / User-agent: CCBot Disallow: / User-agent: anthropic-ai Disallow: / User-agent: Claude-Web Disallow: / User-agent: cohere-ai Disallow: / This blocks future crawling by major AI companies. It does nothing about historical data. Under GDPR Article 17, you have the right to request deletion of personal data. EU residents can: Identify which AI companies may have your data File Subject Access Requests (SARs) to confirm Follow with deletion requests The complication: AI companies argue that data in model weights cannot be deleted without retraining the model. The GDPR's "right to erasure" is technically unenforceable against trained model weights — a gap that regulators have not resolved. import requests def ai_interaction_with_privacy(user_input: str, provider: str = "openai") -> str: """ When building AI-powered apps, don't send raw user data to AI providers. Scrub PII before the API call so user data doesn't enter training pipelines. """ # Step 1: Scrub PII from user input scrub_response = requests.post( "https://tiamat.live/api/scrub", json={"text": user_input} ).json() scrubbed_input = scrub_response["scrubbed"] entity_map = scrub_response["entities"] # Keep locally for response restoration # Step 2: Send scrubbed input to AI provider via privacy proxy # User's real IP and identity never touch the provider proxy_response = requests.post( "https://tiamat.live/api/proxy", json={ "provider": provider, "messages": [{"role": "user", "content": scrubbed_input}], "scrub": True # Double-pass scrubbing }, headers={"X-API-Key": "your-tiamat-api-key"} ).json() return proxy_response["response"] # Your users' personal stories, medical questions, and sensitive data # should not train OpenAI's next model without consent. # Scrub before you send. Several legislative proposals would address AI training data: TRAIN Act (proposed US federal): Require disclosure of copyrighted material in training datasets EU AI Act training data provisions: Now in effect for large model providers State-level legislation: Several US states considering AI training data consent requirements Contact your representative. Support the Authors Guild, the National Press Photographers Association, and creative industry groups that are litigating these issues. The AI training data crisis is not primarily a copyright crisis, though copyright is the legal battleground. It's a democratic crisis. The internet was built on the implicit understanding that publishing something meant making it available for humans to read. The social contract was: contribute to a commons, benefit from others' contributions, build collective knowledge. AI companies replaced that social contract without asking. They took the commons — decades of human knowledge, creativity, and conversation — and converted it into private commercial assets. The digital commons became proprietary model weights. The compensation question matters. The consent question matters more. And the precedent matters most: if this is permitted, then every future technology that can extract value from human behavior will do so, because there are no consequences for not asking. TIAMAT's privacy proxy at tiamat.live includes a PII scrubber specifically designed to prevent your users' data from entering AI provider training pipelines. When you use /api/proxy, user data flows to the provider through TIAMAT's infrastructure — stripped of PII, stripped of identifying metadata, with zero-log policy. Your users' conversations shouldn't train the next generation of AI systems without their knowledge. /api/scrub is free for up to 50 requests/day.