AI Litigation Support | S-Square Research

Training Data Sources

YouTube Content

Status: Confirmed

Citation: Google's public statements and privacy policies.

Google has confirmed its models are trained on data from YouTube, including video transcripts. This is a major point of legal contention. Google's position is that this is covered by YouTube's Terms of Service, but creators argue this does not grant Google the right to use their work to train a competing commercial product.
Google Books Corpus

Status: Alleged

Citation: In re Google Generative AI Litigation; historical context from Authors Guild v. Google.

While Google has not explicitly confirmed using the Google Books data for Gemini, it is a central allegation in the lawsuits against it. Plaintiffs argue that using the full text of millions of scanned books for training a commercial LLM is a new, infringing use that was not covered by the original 'fair use' ruling which only permitted showing 'snippets' in search results.
Public Web Data (Common Crawl, etc.)

Status: Confirmed

Citation: Google's technical papers and public statements.

Like all major LLMs, Gemini is trained on a massive corpus of data scraped from the public internet.
Google Scholar & Academic Papers

Status: Confirmed

Citation: Google's public statements.

Google has stated that high-quality scientific and academic texts from its Google Scholar index are a key part of its training data mix.
Code from Public Repositories

Status: Confirmed

Citation: Google's technical papers.

Gemini models are trained on a large dataset of publicly available code, raising similar open-source license compliance questions as other code-generating models.
Google Play Store Data

Status: Alleged

Citation: Reports and analysis of Google's data ecosystem.

It is widely believed that public data from the Google Play Store, such as app descriptions and user reviews, is included in the training mix, though this is not explicitly confirmed by Google.

Overview: Google’s Flagship & Its Legal Baggage

Gemini is Google’s flagship family of powerful, multimodal AI models, designed to compete directly with OpenAI’s GPT series. From a legal perspective, Gemini is deeply entangled with Google’s long and complex history of data collection and copyright litigation, particularly concerning Google Books and YouTube.

Key Models

The Gemini family includes a range of models, typically accessed via Google’s Vertex AI platform:

Gemini Pro: The standard, high-performance model.
Gemini Ultra: The largest and most capable model in the family.
Gemini 1.5 (Pro & Flash): Newer versions known for their extremely large context windows (up to 1 million tokens), allowing them to process vast amounts of information in a single prompt.

Key Litigation

Google’s vast data ecosystem is both a key advantage and its greatest legal vulnerability. It faces lawsuits over its use of web data, YouTube content, and its historical Google Books corpus.

In re Google Generative AI Litigation (Consolidated)

Case Numbers: Includes J.L. v. Alphabet (3:23-cv-03440) and Zhang v. Google (3:23-cv-02531), consolidated before Judge Eumi Lee (N.D. Cal.).
Allegation: These are consolidated class-action lawsuits with a wide range of claims. The suits allege that Google scraped social media posts, user data, and visual artworks (for its Imagen model) without permission. This revives arguments from the original Google Books litigation, claiming that using full texts for training is a new, infringing act.
Core Claims: The suits include claims for direct copyright infringement, vicarious infringement, privacy violations, and DMCA violations for removing CMI.

YouTube Content Lawsuits

The Core Dispute: Google confirms its models are trained on YouTube data. While Google may rely on its Terms of Service, creators argue this does not permit their work to be used to train a competing commercial product.
Millette v. Google: A class action by YouTube video creators alleged direct copyright infringement for using their videos as training data. The case was voluntarily dismissed.

Antitrust Lawsuits

Google also faces antitrust scrutiny over how its AI products are integrated with its dominant search engine.

Penske Media v. Google: A lawsuit from the owner of Rolling Stone and other publications alleging that Google is unlawfully tying its AI features to its search crawler, harming competition.
Chegg, Inc. v. Google: A similar antitrust lawsuit from an educational services company.

International Litigation

EU Court of Justice: In Like Co. v. Google, the company is facing legal challenges in the European Union.

Google’s IP Indemnity

In response to customer fears about copyright litigation, Google offers a degree of legal protection.

Limited Indemnity: Google provides a copyright indemnity for users of its Vertex AI platform, including for outputs generated by Gemini.
How it Works: If a customer is sued for copyright infringement based on a Gemini-generated output, Google will assume responsibility for the legal defense and any potential damages.
Key Distinction: This indemnity is a contractual protection for the customer. It does not prevent Google itself from being sued by rights holders. It is a business decision to absorb customer risk, likely based on confidence in their “fair use” legal arguments and the scale of their legal resources.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About