Torrenting
In the context of AI litigation, “torrenting” is a euphemism. It refers to the use of vast, illegal archives of copyrighted material—books, images, and personal data—that have been aggregated and shared using peer-to-peer file-sharing technology. While the technology itself is neutral, its primary use case in the world of AI training data is piracy on an industrial scale. For a company to use this data is not an accident; it’s a conscious decision to deal in stolen goods.
Analogy: The Black Market for Data
Think of data as a commodity.
- The Legitimate Marketplace: You can buy data from licensed vendors. You can purchase stock photos from Getty Images, license news articles from the Associated Press, or buy books from a bookstore. The transactions are legal, documented, and traceable. This is a client-server download.
- The Black Market: Alternatively, you can go to a shady, underground market where someone is selling everything for free out of the back of a truck. You get a mix of everything—some legitimate public domain works, but mostly stolen goods: pirated movies, bootleg albums, and scanned copies of every book ever written. This is torrenting.
When an AI company uses a dataset like “Books3,” which contains nearly 200,000 pirated books, they are not an innocent bystander. They have gone to the black market. The BitTorrent protocol itself is designed for mass distribution. When a company uses a torrent client, their computer doesn’t just download the illegal files; it simultaneously uploads (or “seeds”) pieces of those files to other users. They are not just a customer of the black market; they become a participant in the distribution network.
The Legal and Technical Flaws
-
Willful Infringement: The single most important aspect of torrenting for a litigator is the evidence of willfulness. Datasets like The Pile or Books3 are infamous. Their origins in “shadow libraries” like Library Genesis are well-documented. For a multi-billion dollar corporation to claim it was unaware it was using pirated material is not a credible defense. It demonstrates a willful blindness to the law that can lead to findings of willful infringement and, consequently, massively increased statutory damages.
-
It’s Not “Scraping”: Companies sometimes try to muddy the waters by conflating torrenting with web scraping. They are not the same. Scraping involves copying data from public websites, which has a (currently contested) fair use defense. Torrenting involves downloading curated, pre-packaged archives of illegal material. The legal argument is much simpler and more damning. It is the digital equivalent of receiving stolen property.
-
The IP Address is the Evidence: Unlike a direct download from a server, the BitTorrent protocol, by default, exposes the IP address of every participant in the “swarm” to every other participant. This is the primary mechanism copyright holders use to track and sue infringers. A company that uses torrents to download datasets is broadcasting its IP address as a participant in an infringing network, creating a direct, discoverable trail of evidence.
The use of torrented datasets is the most egregious and legally perilous method of data acquisition for AI training. It bypasses the complex legal questions of fair use and web scraping and enters the much clearer territory of mass-scale, willful copyright infringement.