AI Litigation Support | S-Square Research

Training Data Sources

Public Web Data & Books

Status: Confirmed

Citation: Databricks' DBRX blog post and technical documentation.

Databricks states that DBRX was trained on a 12-trillion-token dataset composed of a 'carefully curated' mix of public and proprietary data. The public portion includes well-known open datasets like Common Crawl and The Pile.
Proprietary Licensed Data

Status: Confirmed

Citation: Databricks' DBRX blog post.

A significant portion of the training data is proprietary data licensed by Databricks. The company has not disclosed its data partners, but as an enterprise-focused company, it is believed to have access to high-quality, specialized datasets.
Code Data

Status: Confirmed

Citation: Databricks' technical documentation.

The training data includes a large corpus of publicly available code, used to enhance the model's reasoning and coding abilities.
The Pile (including "Books3")

Status: Alleged

Citation: O'Nan v. Databricks & MosaicML lawsuit.

Because the open-source 'The Pile' dataset is a confirmed part of the training mix, Databricks is exposed to the same legal challenges as other labs over its inclusion of the pirated 'Books3' corpus. The lawsuit against Databricks and its subsidiary MosaicML makes this a central claim.

Overview: Enterprise-Grade Open Source

DBRX is a powerful open-source large language model developed by Databricks, an enterprise data and AI company. It is a mixture-of-experts (MoE) model, known for its high performance.

From a legal standpoint, DBRX is significant because it represents a major enterprise player releasing a foundational model under a custom open-source license. This strategy aims to balance community collaboration with commercial self-preservation.

Key Models

Databricks released two versions of the model:

DBRX Base: The foundational, pre-trained model.
DBRX Instruct: The version fine-tuned on a proprietary dataset to follow instructions, making it more suitable for direct application.

Training Data & Copyright Risk

Databricks has provided some, but not complete, transparency into its training data.

Stated Sources: The model was trained on a 12-trillion-token dataset composed of both publicly available and proprietary licensed data.
“Carefully Curated”: The company claims the dataset was “carefully curated,” but has not released a full list of the sources. This leaves a degree of “black box” risk, as the public portion of the data could contain copyrighted material scraped from the web.
Lower Risk for Enterprises?: Because Databricks’ core business is selling data management services to large enterprises, there is a strong business incentive for them to have been more diligent in their data sourcing than some other actors. However, this is an inference based on their business model, not a guarantee.

Key Litigation

Databricks and its subsidiary, MosaicML, are defendants in a consolidated class-action lawsuit from authors alleging copyright infringement in the training of their models.

O’Nan v. Databricks & MosaicML (Consolidated)

Case Numbers: Includes O'Nan v. Databricks (3:23-cv-01451) and Makkai v. Databricks (3:24-cv-02653), consolidated before Judge Charles Breyer (N.D. Cal.).
Allegation: A consolidated class-action lawsuit by authors (including Stewart O’Nan, Abdi Nazemian, and Rebecca Makkai) alleging that Databricks and its subsidiary MosaicML used their books without permission to train their large language models. The case targets both MosaicML for the initial training and Databricks for vicarious infringement after acquiring the company.
Core Claims: Direct and vicarious copyright infringement.

International Litigation

Canada: Databricks is also named as a defendant in MacKinnon v. Databricks, a lawsuit filed in the Supreme Court of British Columbia.

The Databricks Open Model License

The most important legal aspect of DBRX is its custom license, which is distinct from standard open-source licenses like MIT or Apache 2.0.

Key Terms & Restrictions

Permissive Use: The license generally allows for broad use, modification, and distribution of the model.
Use-Based Restriction: The license includes a crucial use-based restriction. It prohibits any entity from using the DBRX model (or any derivative of it) to provide a service that competes with Databricks.
Defining “Competition”: This clause is legally interesting because the definition of a “competing” service could be subject to interpretation, creating a potential area of future legal dispute. It is an attempt to prevent a rival company (e.g., another cloud provider) from offering DBRX as a commercial, managed service.

Legal Implications

Not Truly “Open” in the Traditional Sense: The use-based restriction means DBRX is not “open source” in the purest sense of the term, as defined by organizations like the Open Source Initiative (OSI). It restricts who can use the model and for what purpose.
Enforceability: The enforceability of such a clause in a copyright license is a subject of legal debate. It attempts to use copyright law to achieve a commercial, anti-competitive goal.
Risk for Competitors: For any company that could be seen as a competitor to Databricks, using DBRX carries a direct legal risk of violating the license terms.

Comprehensive Research

Training Data Forensics

Evidence Database

Solutions

About