Training Data Sources

  • Web data

    Status: Reported

    Citation: AI21 Labs documentation

Overview

AI21 Labs is a prominent Israeli AI company that develops and commercializes large language models under the Jurassic family. These are proprietary, closed-source models accessible primarily through a commercial API. From a legal standpoint, the company is notable for its limited public disclosure regarding its training data and methodologies.

Key Models

The flagship model series is Jurassic-2 (J2), which comes in several sizes to balance performance and cost:

  • J2-Ultra: The most powerful model in the series.
  • J2-Jumbo: A large model designed for high-performance tasks.
  • J2-Grande & J2-Mid: Smaller models for less intensive applications.

The primary legal risk associated with AI21 Labs stems from the opacity of its training data.

  • Stated Sources: The company has stated its models are trained on vast amounts of public web data.
  • Lack of Transparency: Crucially, AI21 Labs has not disclosed the specific datasets used to train the Jurassic models. This lack of transparency means it is impossible for third parties to audit the sources for copyrighted material.
  • Implied Risk: While AI21 Labs has not faced major lawsuits, its use of broad, unaudited web data places it in a similar risk category as other AI developers who have been sued for copyright infringement. The core issue remains the same: training a commercial product on potentially copyrighted internet content without attribution or license.
  • Proprietary & Closed-Source: Because the Jurassic models are closed-source and only accessible via an API, it is difficult for external researchers to probe them for issues like copyright regurgitation, unlike open-source models.
  • “Black Box” Nature: The lack of data transparency creates a “black box” problem. In the event of a dispute, establishing a clear link between a model’s output and a specific copyrighted source would require significant legal discovery to uncover the training data composition.
  • Low Litigation Profile: To date, AI21 Labs has avoided the high-profile copyright litigation faced by competitors like OpenAI, Google, and Stability AI. However, this could change as legal scrutiny of the AI industry intensifies. Their current low profile does not negate the underlying legal risks associated with their data practices.