AI Models
An in-depth look at major AI models, focusing on their architecture, capabilities, and the data they were trained on.
Under the EU AI Act, developers of general-purpose AI models must publish summaries of training data.
GPT
OpenAI • Multimodal (Text and Image) • 25 Lawsuits
Sources: Common Crawl, WebText2, ...
Last updated: 11/1/2025
LLaMA
Meta • Text Generation • 9 Lawsuits
Sources: Public Web Data (Common Crawl, etc.), Proprietary Meta Data (Facebook, Instagram), ...
Last updated: 11/1/2025
Gemini
Google • Multimodal (Text, Image, Video, Audio) • 6 Lawsuits
Sources: YouTube Content, Google Books Corpus, ...
Last updated: 11/1/2025
Claude
Anthropic • Text Generation • 4 Lawsuits
Sources: Public Web Data, Licensed Data (from Partners), ...
Last updated: 11/1/2025
Nemotron
NVIDIA • Text Generation • 4 Lawsuits
Sources: Proprietary Curated Dataset, Real-World Data (Public & Licensed), ...
Last updated: 11/1/2025
Stable Diffusion
Stability AI • Image Generation • 4 Lawsuits
Sources: LAION-5B (and its subsets, e.g., LAION-2B-en), Copyrighted Images from Stock Photo Sites & Art Communities, ...
Last updated: 11/1/2025
DBRX
Databricks • Text Generation • 3 Lawsuits
Sources: Public Web Data & Books, Proprietary Licensed Data, ...
Last updated: 11/1/2025
Phi
Microsoft • Text • 2 Lawsuits
Sources: Filtered "textbook-quality" web data, Synthetic data
Last updated: 11/1/2025
Apple Intelligence
Apple • Multimodal • 1 Lawsuit
Sources: Licensed Content (News, Books, etc.), Web Data via AppleBot, ...
Last updated: 11/1/2025
Cohere
Cohere • Text Generation • 1 Lawsuit
Sources: Public Web Data, Proprietary Licensed Data, ...
Last updated: 11/1/2025
Gemma
Google • Text Generation • 1 Lawsuit
Sources: Gemini training data, Web data, ...
Last updated: 11/1/2025
DeepSeek
DeepSeek AI • Text Generation, Code Generation
Sources: Web and Code data
Last updated: 11/1/2025
Grok
xAI • Text Generation
Sources: Data from X (formerly Twitter), Web data
Last updated: 11/1/2025
Jurassic
AI21 Labs • Text Generation
Sources: Web data
Last updated: 11/1/2025
Mistral
Mistral AI • Text Generation
Sources: Web data, Common Crawl, ...
Last updated: 11/1/2025
Nova
Anonymous • Text Generation
Sources: Undisclosed web and proprietary data
Last updated: 11/1/2025
Qwen
Alibaba • Multimodal (Text and Image)
Sources: Web data (multilingual), Proprietary Alibaba data
Last updated: 11/1/2025
Titan
Amazon • Multimodal (Text and Image)
Sources: Publicly available data, Licensed third-party data
Last updated: 11/1/2025