Training Data Sources
-
Licensed Content (News, Books, etc.)
Status: Confirmed
Citation: Apple's WWDC presentations and public statements.
Apple has confirmed it licenses content from various publishers. The specific partners are largely undisclosed, but this is a key part of their strategy to mitigate copyright risk. This likely includes news archives, books, and other structured data.
-
Web Data via AppleBot
Status: Confirmed
Citation: Apple's public documentation on the AppleBot web crawler.
Apple uses its web crawler, AppleBot, to gather publicly available information from the internet. This is the same method used by other AI labs and exposes Apple to similar copyright challenges from publishers who do not want their content scraped. Publishers can use `robots.txt` to opt out.
-
On-Device Data (User Content)
Status: Denied
Citation: Apple's privacy policies and public statements.
Apple has explicitly and repeatedly stated that it does **not** use private user data (e.g., photos, emails, messages) to train its cloud-based foundation models. On-device models may learn from user context locally, but this data is not sent to Apple's servers.
Overview: A Hybrid System
Apple Intelligence is not a single AI model but a hybrid system integrated into Apple’s operating systems. It uses a combination of on-device models, cloud-based models running on “Private Cloud Compute,” and a partnership with OpenAI’s ChatGPT.
- On-Device Models: Handle most tasks to ensure user privacy. These are smaller, efficient models.
- Private Cloud Compute: For more complex requests, Apple uses larger models running on its own servers with Apple Silicon. The company emphasizes that data is not stored and is protected by end-to-end encryption.
- OpenAI Partnership: When a query requires broader world knowledge that Apple’s models are not trained on, the system can pass the request to ChatGPT. Users are prompted before this happens.
Apple’s Foundation Models
Apple has developed its own family of models, though specific versioning is not always public.
- On-Device Foundation Model: A 3-billion-parameter model optimized to run directly on iPhones and Macs.
- Server-Based Models: Larger, more capable models running in Apple’s Private Cloud Compute.
- MM1 Family: A family of multimodal models (up to 30B parameters) capable of understanding and interpreting both text and images.
Training Data & Copyright Risk
Apple’s training data strategy is a critical area for legal scrutiny.
- Licensed Data: Apple has publicly stated it licenses content for training its foundation models, a direct attempt to mitigate copyright risk. However, the scope and nature of these licensing deals are not public.
- Web-Crawled Data (AppleBot): Apple also uses its web crawler, AppleBot, to collect public data from the internet. This practice exposes Apple to the same legal challenges as other AI developers who scrape web content. Publishers can use
robots.txtto block AppleBot, but it’s an opt-out system. - No Use of User Data: Apple explicitly states it does not use private user data (e.g., photos, emails) to train its commercial foundation models.
Key Legal & Liability Questions
The structure of Apple Intelligence raises several unique legal questions relevant to copyright disputes.
Division of Liability
- Apple vs. OpenAI: When a user receives an infringing output, determining whether it came from Apple’s proprietary model or was generated by ChatGPT is critical. The user is notified before the handoff, which may be an attempt by Apple to shift liability to OpenAI for those queries.
- “Black Box” Problem: It may be difficult for outside parties to determine which model produced a given output, complicating discovery in a legal dispute.
”Privacy-First” Defense
- Apple’s heavy emphasis on privacy and on-device processing may be used as a “good faith” argument in legal disputes, positioning the company as a more responsible actor than its competitors. However, this does not absolve it of potential infringement from its web-crawled data.
Use of Licensed Data
- The existence of licensing deals could be used to argue that Apple has made a good-faith effort to respect copyright. However, the effectiveness of this defense depends on the breadth of the licenses. If infringing content is generated from sources outside these deals (e.g., from AppleBot’s crawling), the company remains exposed.
Key Litigation
Hendrix v. Apple Inc.
- Case Number: 3:25-cv-075583 (N.D. Cal.)
- Filing Date: September 11, 2025
- Allegation: A proposed class action by book authors, including Gardy Hendrix and Jennifer Roberson, alleging that Apple used their books without permission to train Apple Intelligence. The complaint claims that Apple sourced these works from “shadow libraries” (pirated book databases).
- Core Claim: Direct copyright infringement for the initial act of copying works for the training data library.