Training Data Sources

  • Web data

    Status: Reported

    Citation: Mistral AI disclosures and technical blog

  • Common Crawl

    Status: Confirmed

  • GitHub

    Status: Confirmed

Overview

Mistral AI is a European AI company known for its high-performance open-source and commercial models. Their key innovation is the Mixture-of-Experts (MoE) architecture, which allows for very large yet efficient models. From a legal perspective, their open-source releases and limited public data disclosures are notable points of interest.

Key Models & Timeline

  • Mistral 7B (Sept 2023): An early, efficient Apache 2.0 licensed model. Its permissive license makes it a common base for other models.
  • Mixtral 8x7B (Dec 2023): A powerful MoE model released with a more restrictive license. This model was the subject of a key copyright infringement study.
  • Mistral Large (Feb 2024): The company’s flagship commercial, closed-source model.
  • Mixtral 8x22B (Apr 2024): A larger, more powerful MoE model, also with a restrictive license.

Mistral AI’s training data composition presents a significant area of legal inquiry.

  • Stated Sources: The company reports using a mix of public web data (Common Crawl) and code repositories (GitHub).
  • Lack of Transparency: Full, detailed breakdowns of training datasets are not provided, which creates ambiguity about the specific sources and volume of copyrighted material ingested by the models.

This section focuses on specific findings and events relevant to copyright litigation.

Patronus AI Study (2024)

A study by Patronus AI benchmarked the Mixtral-8x7B-Instruct-v0.1 model’s rate of regurgitating copyrighted content.

  • Finding: The model produced verbatim copyrighted content in 22% of tested prompts.
  • Comparison: This rate is higher than Meta’s Llama 2 (10%) and Anthropic’s Claude 2.1 (8%).
  • Context: It was still significantly lower than OpenAI’s GPT-4 (44%), which was the highest in the study.

Mistral AI has a public-facing copyright policy with several key assertions:

  • Opt-Out Compliance: Claims to respect web crawling standards like robots.txt.
  • No Circumvention: Asserts that it does not bypass technical measures designed to protect copyrighted works.
  • Takedown Process: Provides a formal mechanism for rights holders to submit infringement complaints.

Model Safeguards & Liability

  • Open-Source Risks: The initial Mistral 7B model was released without typical content moderation safeguards. This led to criticism that the model could be easily prompted to generate harmful or illegal content.
  • Design Philosophy: Mistral has historically favored releasing “raw” or less restricted models, prioritizing performance over built-in safety mechanisms. This philosophy could be a factor in arguments concerning foreseeable misuse and the developer’s responsibility for a model’s outputs.