Known Lawsuits 1

Training Data Sources

  • Gemini training data

    Status: Alleged

    Citation: Google states Gemma is 'built from the same research and technology' as Gemini.

  • Web data

    Status: Reported

  • Licensed content

    Status: Reported

Overview: An “Open” Model with Proprietary Roots

Gemma is a family of open-source language models from Google. From a legal perspective, Gemma’s most critical feature is its origin: Google states it is “built from the same research and technology used to create the Gemini models.” This direct lineage means that Gemma is an “open” asset derived from a proprietary, legally-encumbered data supply chain.

Key Models

The Gemma family consists of several lightweight and efficient models:

  • Gemma 2B & 7B: Smaller models designed for on-device applications, research, and less demanding tasks.
  • Gemma 3: The latest and most powerful version in the family.

Training Data: The Gemini Inheritance

The legal risk associated with Gemma is inherited directly from Gemini.

  • Shared Data Pipeline: Because Gemma is based on Gemini’s technology, it is almost certain that it was trained on a similar corpus of data. This includes data from the Google Books project, which is the subject of the In re Google Gen. AI Ltgn. lawsuit (Judge Lee, ND Cal), as well as data from YouTube.
  • Risk Transference: By releasing Gemma as an open-source model, Google effectively transfers the legal risk of using a model trained on this data to the end-user. Unlike with the proprietary Gemini models (where Google offers an IP indemnity), a developer who downloads and uses Gemma is the one directly deploying the model and would be the primary target of a copyright infringement lawsuit based on its output.
  • “Filtered” Data: Google claims the training data was filtered to remove sensitive information and “certain material.” However, this is not a guarantee that all copyrighted content was removed, and the filtering process is not transparent.

The Gemma Terms of Use

Gemma is not released under a standard permissive license like Apache 2.0. Instead, it is governed by a custom license with specific, legally significant restrictions.

Key Terms & Restrictions

  • Prohibited Use Policy: Use of the Gemma models is subject to a “Prohibited Use Policy.” This policy forbids using the models for any illegal activities or for generating harmful or abusive content.
  • Use-Based Restrictions: The license includes restrictions on how the model can be used. For example, it may forbid use in certain critical industries or for applications that could have a high-stakes impact.
  • No Indemnity: The license offers no warranty or indemnity to the user. The user assumes all risk and liability for the model’s outputs.
  • A One-Way Street: The Gemma license is a one-way transfer of risk from Google to the user. Google provides the powerful model but explicitly disclaims any responsibility for what it produces.
  • Controlled “Openness”: The custom license allows Google to maintain control over how its “open” models are used, preventing them from being used in ways that could create legal or reputational problems for Google. It’s a strategy of “controlled open-source.”