Training Data Sources
-
Public Web Data (Common Crawl, etc.)
Status: Confirmed
Citation: Meta's Llama 2 and Llama 3 research papers.
Meta has confirmed that the vast majority of its pre-training data comes from publicly available sources like Common Crawl. For Llama 2, this represented 67% of the training mix.
-
Proprietary Meta Data (Facebook, Instagram)
Status: Confirmed
Citation: Meta's public statements and privacy policy updates.
Meta has confirmed it uses public-facing data from its own products, like Facebook and Instagram posts, to train its models. This does **not** include private user content like messages. The use of this data is a key point of contention for users and regulators.
-
Code Data (GitHub, etc.)
Status: Confirmed
Citation: Meta's Code Llama research paper.
A significant portion of the training data, especially for Code Llama and the main Llama models, is publicly available code from sources like GitHub. This raises separate legal questions related to the violation of open-source software licenses.
-
Academic & Scientific Papers (arXiv, etc.)
Status: Confirmed
Citation: Meta's Llama 2 research paper.
A small but high-quality portion of the training data comes from scientific papers and textbooks.
-
The Pile (including "Books3")
Status: Alleged
Citation: Numerous class-action lawsuits (e.g., Kadrey v. Meta).
This is the most controversial part of the training data. Lawsuits allege that Meta used 'The Pile,' an open-source dataset which contains the illegal 'Books3' corpus (196,640 pirated books). Meta has not confirmed or denied this, but the allegations are central to the copyright lawsuits it faces.
Overview: The “Open” Alternative & Its Legal Test
LLaMA (Large Language Model Meta AI) is a family of powerful open-source models from Meta. These models are legally and strategically significant as they represent the primary “open” alternative to closed-source leaders like OpenAI’s GPT series. However, LLaMA shares the same core legal challenge as its competitors: it faces major lawsuits from authors alleging it was trained on copyrighted books.
Key Models
The most important models in the family are:
- LLaMA 2: The first version released with a license permitting commercial use, which catalyzed a massive wave of open-source AI development.
- LLaMA 3 & 3.1: More powerful successors, including a massive 405B parameter model, designed to compete with the best closed-source models.
- Code Llama: A specialized version fine-tuned for code generation tasks, which carries its own specific copyright risks related to software licenses (see DeepSeek).
Key Litigation
Meta faces several major lawsuits, consolidated in the Northern District of California before Judge Vince Chhabria, primarily focused on the LLaMA models’ training data.
Kadrey v. Meta Platforms, Inc. (Consolidated)
- Case Numbers: Includes
Kadrey v. Meta(3:23-cv-03417),Chabon v. Meta(3:23-cv-04663), and others. - Allegation: A consolidated class-action lawsuit by authors (including Richard Kadrey, Sarah Silverman, Michael Chabon) alleging that Meta used their books without permission to train the LLaMA models. The authors claim the training data was sourced from the controversial “Books3” dataset, which originates from a “shadow library” of pirated books.
- Status: Judge Chhabria has issued key rulings in this case:
- Fair Use: Granted Meta’s motion for partial summary judgment on fair use (June 25, 2025).
- DMCA CMI: Granted Meta’s motion for partial summary judgment on the claim of removing Copyright Management Information (June 27, 2025).
- Core Claims: The remaining primary claim is for direct copyright infringement based on the copying of works for training.
Strike 3 Holdings v. Meta Platforms, Inc.
- Case Number: 4:2025cv06213 (N.D. Cal.)
- Filing Date: July 23, 2025
- Allegation: The adult film studio Strike 3 Holdings alleges that Meta engaged in copyright infringement by using BitTorrent to download its adult videos and using them to train the
Llama 4model. - Core Claims: Direct and secondary copyright infringement.
International Litigation
- Canada: Meta faces multiple lawsuits in Canada, including
MacKinnon v. Meta,Robillard v. Meta, andClare v. Meta. - France: The
National Publishing Unionand other author groups have filed a copyright lawsuit against Meta in a Paris court.
The LLaMA Community License
LLaMA models are not released under a standard permissive license. They are governed by a custom license designed to achieve specific business goals for Meta.
Key Terms & Restrictions
- Free for Most: The license is free and permits commercial use, modification, and distribution for the vast majority of users.
- Restriction for Hyperscalers: The license contains a crucial restriction. Any company with more than 700 million monthly active users at the time of the model’s release is not granted a license and must request one directly from Meta.
- The Target: This clause is aimed squarely at Meta’s largest competitors: Google, Apple, Amazon, Microsoft, and ByteDance (TikTok). It prevents them from directly using Meta’s powerful open-source models to improve their own competing products.
Legal Implications
- Commoditizing the Model Layer: By releasing a powerful free model that most companies except its biggest rivals can use, Meta’s strategy is to commoditize the AI model itself. This undermines the business model of companies like OpenAI that charge for API access to their models.
- Risk Transference: By open-sourcing the model, Meta transfers the direct legal risk of using a model trained on allegedly infringing data to the thousands of developers and companies who build on top of it.
- Controlled “Openness”: Like Google’s Gemma license and Databricks’ DBRX license, the LLaMA license is a form of “controlled open-source” that uses licensing terms to achieve a strategic commercial advantage.