AI Litigation Support | S-Square Research

Training Data Sources

Licensed Content (News, Books, etc.)

Status: Confirmed

Citation: Apple's WWDC presentations and public statements.

Apple has confirmed it licenses content from various publishers. The specific partners are largely undisclosed, but this is a key part of their strategy to mitigate copyright risk. This likely includes news archives, books, and other structured data.
Web Data via AppleBot

Status: Confirmed

Citation: Apple's public documentation on the AppleBot web crawler.

Apple uses its web crawler, AppleBot, to gather publicly available information from the internet. This is the same method used by other AI labs and exposes Apple to similar copyright challenges from publishers who do not want their content scraped. Publishers can use `robots.txt` to opt out.
On-Device Data (User Content)

Status: Denied

Citation: Apple's privacy policies and public statements.

Apple has explicitly and repeatedly stated that it does **not** use private user data (e.g., photos, emails, messages) to train its cloud-based foundation models. On-device models may learn from user context locally, but this data is not sent to Apple's servers.

Overview: A Hybrid System

Apple Intelligence is not a single AI model but a hybrid system integrated into Apple’s operating systems. It uses a combination of on-device models, cloud-based models running on “Private Cloud Compute,” and a partnership with OpenAI’s ChatGPT.

On-Device Models: Handle most tasks to ensure user privacy. These are smaller, efficient models.
Private Cloud Compute: For more complex requests, Apple uses larger models running on its own servers with Apple Silicon. The company emphasizes that data is not stored and is protected by end-to-end encryption.
OpenAI Partnership: When a query requires broader world knowledge that Apple’s models are not trained on, the system can pass the request to ChatGPT. Users are prompted before this happens.

Apple’s Foundation Models

Apple has developed its own family of models, though specific versioning is not always public.

On-Device Foundation Model: A 3-billion-parameter model optimized to run directly on iPhones and Macs.
Server-Based Models: Larger, more capable models running in Apple’s Private Cloud Compute.
MM1 Family: A family of multimodal models (up to 30B parameters) capable of understanding and interpreting both text and images.

Training Data & Copyright Risk

Apple’s training data strategy is a critical area for legal scrutiny.

Licensed Data: Apple has publicly stated it licenses content for training its foundation models, a direct attempt to mitigate copyright risk. However, the scope and nature of these licensing deals are not public.
Web-Crawled Data (AppleBot): Apple also uses its web crawler, AppleBot, to collect public data from the internet. This practice exposes Apple to the same legal challenges as other AI developers who scrape web content. Publishers can use robots.txt to block AppleBot, but it’s an opt-out system.
No Use of User Data: Apple explicitly states it does not use private user data (e.g., photos, emails) to train its commercial foundation models.

Key Legal & Liability Questions

The structure of Apple Intelligence raises several unique legal questions relevant to copyright disputes.

Division of Liability

Apple vs. OpenAI: When a user receives an infringing output, determining whether it came from Apple’s proprietary model or was generated by ChatGPT is critical. The user is notified before the handoff, which may be an attempt by Apple to shift liability to OpenAI for those queries.
“Black Box” Problem: It may be difficult for outside parties to determine which model produced a given output, complicating discovery in a legal dispute.

”Privacy-First” Defense

Apple’s heavy emphasis on privacy and on-device processing may be used as a “good faith” argument in legal disputes, positioning the company as a more responsible actor than its competitors. However, this does not absolve it of potential infringement from its web-crawled data.

Use of Licensed Data

The existence of licensing deals could be used to argue that Apple has made a good-faith effort to respect copyright. However, the effectiveness of this defense depends on the breadth of the licenses. If infringing content is generated from sources outside these deals (e.g., from AppleBot’s crawling), the company remains exposed.

Key Litigation

Hendrix v. Apple Inc.

Case Number: 3:25-cv-075583 (N.D. Cal.)
Filing Date: September 11, 2025
Allegation: A proposed class action by book authors, including Gardy Hendrix and Jennifer Roberson, alleging that Apple used their books without permission to train Apple Intelligence. The complaint claims that Apple sourced these works from “shadow libraries” (pirated book databases).
Core Claim: Direct copyright infringement for the initial act of copying works for the training data library.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About