AI Models

An in-depth look at major AI models, focusing on their architecture, capabilities, and the data they were trained on.

Under the EU AI Act, developers of general-purpose AI models must publish summaries of training data.

GPT

OpenAI • Multimodal (Text and Image) • 25 Lawsuits

Sources: Common Crawl, WebText2, ...

Last updated: 11/1/2025

LLaMA

Meta • Text Generation • 9 Lawsuits

Sources: Public Web Data (Common Crawl, etc.), Proprietary Meta Data (Facebook, Instagram), ...

Last updated: 11/1/2025

Gemini

Google • Multimodal (Text, Image, Video, Audio) • 6 Lawsuits

Sources: YouTube Content, Google Books Corpus, ...

Last updated: 11/1/2025

Claude

Anthropic • Text Generation • 4 Lawsuits

Sources: Public Web Data, Licensed Data (from Partners), ...

Last updated: 11/1/2025

Nemotron

NVIDIA • Text Generation • 4 Lawsuits

Sources: Proprietary Curated Dataset, Real-World Data (Public & Licensed), ...

Last updated: 11/1/2025

Stable Diffusion

Stability AI • Image Generation • 4 Lawsuits

Sources: LAION-5B (and its subsets, e.g., LAION-2B-en), Copyrighted Images from Stock Photo Sites & Art Communities, ...

Last updated: 11/1/2025

DBRX

Databricks • Text Generation • 3 Lawsuits

Sources: Public Web Data & Books, Proprietary Licensed Data, ...

Last updated: 11/1/2025

Phi

Microsoft • Text • 2 Lawsuits

Sources: Filtered "textbook-quality" web data, Synthetic data

Last updated: 11/1/2025

Apple Intelligence

Apple • Multimodal • 1 Lawsuit

Sources: Licensed Content (News, Books, etc.), Web Data via AppleBot, ...

Last updated: 11/1/2025

Cohere

Cohere • Text Generation • 1 Lawsuit

Sources: Public Web Data, Proprietary Licensed Data, ...

Last updated: 11/1/2025

Gemma

Google • Text Generation • 1 Lawsuit

Sources: Gemini training data, Web data, ...

Last updated: 11/1/2025

DeepSeek

DeepSeek AI • Text Generation, Code Generation

Sources: Web and Code data

Last updated: 11/1/2025

Grok

xAI • Text Generation

Sources: Data from X (formerly Twitter), Web data

Last updated: 11/1/2025

Jurassic

AI21 Labs • Text Generation

Sources: Web data

Last updated: 11/1/2025

Mistral

Mistral AI • Text Generation

Sources: Web data, Common Crawl, ...

Last updated: 11/1/2025

Nova

Anonymous • Text Generation

Sources: Undisclosed web and proprietary data

Last updated: 11/1/2025

Qwen

Alibaba • Multimodal (Text and Image)

Sources: Web data (multilingual), Proprietary Alibaba data

Last updated: 11/1/2025

Titan

Amazon • Multimodal (Text and Image)

Sources: Publicly available data, Licensed third-party data

Last updated: 11/1/2025