AI Training Datasets

An overview of prominent datasets used for training large-scale AI models, including details on their contents and licensing.

Books3

10/10 concern

Copyrighted books from shadow libraries

Updated: 12/1/2024

LAION-5B

10/10 concern

5.85 billion image-text pairs from the web

Updated: 12/1/2024

LibGen

10/10 concern

Library Genesis, a shadow library of pirated books and articles.

Updated: 10/30/2024

The Pile

9/10 concern

An 825 GiB dataset of datasets

Updated: 10/30/2024

BookCorpus

8/10 concern

Unpublished books from Smashwords

Updated: 10/30/2024

Built by Google

Spotify Podcast Dataset

8/10 concern

Over 100,000 podcast episodes and transcripts

Updated: 10/30/2024

Built by Spotify

YouTube

8/10 concern

Billions of videos and their transcripts

Updated: 10/30/2024

Built by Google

Common Crawl

7/10 concern

Billions of web pages from the internet

Updated: 12/1/2024

GitHub Code

7/10 concern

Billions of lines of public code from GitHub

Updated: 10/30/2024

Built by Microsoft

OSCAR

7/10 concern

Open Super-large Crawled Aggregated coRpus, a multilingual dataset from Common Crawl.

Updated: 10/30/2024

ROOTS

7/10 concern

The BigScience ROOTS Corpus, a large, documented, multilingual dataset.

Updated: 10/30/2024

WebText / OpenWebText

7/10 concern

High-quality text from Reddit links

Updated: 10/30/2024

Built by OpenAI

C4

6/10 concern

A colossal, cleaned version of Common Crawl.

Updated: 10/30/2024

Built by Google

DataComp

6/10 concern

12.8B image-text pairs for research

Updated: 10/30/2024

FFHQ

6/10 concern

Flickr-Faces-HQ: 70,000 high-quality face images.

Updated: 10/30/2024

Built by NVIDIA

PiLiMi

6/10 concern

Pirate Library Mirror, a major mirror of shadow libraries.

Updated: 10/30/2024

RefinedWeb

5/10 concern

Filtered web data to train Falcon models

Updated: 10/30/2024

ArXiv

3/10 concern

Pre-print scientific papers

Updated: 10/30/2024

Wikipedia

2/10 concern

A corpus of all articles from Wikipedia

Updated: 10/30/2024