RAG (Retrieval-Augmented Generation)
Retrieval-Augmented Generation (RAG) is an architecture that AI companies are increasingly using to make their models seem more reliable and to sidestep some of the legal arguments about pre-training. Instead of relying solely on its internal, memorized “knowledge,” a RAG system is given a library of documents it can pull from in real-time to answer questions. Far from solving the copyright problem, RAG often makes the infringement more direct and provable.
Analogy: The Plagiarist’s Open-Book Test
Imagine two students taking a history test.
- The Standard AI Model: This student has read a hundred history books but isn’t allowed to bring them to the test. They answer from memory. Some of their answers might be verbatim quotes from the books (regurgitation), and some might be completely made up (hallucination).
- The RAG Model: This student is allowed to bring the entire library of copyrighted history books to the test. When the test asks, “Describe the causes of the Peloponnesian War,” this student doesn’t answer from memory. They open a book by Thucydides, find the relevant chapter, and then copy and paste the key sentences, slightly rephrasing them to make it look like their own work.
This is what a RAG system does. It is not “searching” in the way Google does, which provides you with a link to the source. A RAG system retrieves the source material, ingests it, and generates a new, derivative answer based directly on the text it just read.
The Legal and Technical Flaws
RAG is not a shield against copyright claims; it is a different, and in many ways more blatant, form of potential infringement.
-
On-Demand Infringement: The core legal issue with standard models is the copying that happens during pre-training. RAG adds a new infringing act: copying that happens at the moment of generation. When the user asks a question, the system finds a chunk of a copyrighted book or news article, makes a copy of that chunk, and processes it to create the answer. This is not “fair use” research; it is the direct use of a copyrighted work to provide a competing service.
-
The Library Itself is an Infringing Database: To create a RAG system, a company must first build the “library” of documents. This library is almost always a vector database, which stores the embeddings of the source texts. As we’ve established, these embeddings are arguably infringing copies themselves. So, the entire foundation of a RAG system is often a massive database of infringing copies, created so that further infringement can be performed on demand.
-
The “Citation” Defense is a Smokescreen: Many RAG systems will provide a citation or a link to the source document alongside the answer. Companies argue this is just like a search engine. It’s not. The harm is not just in the final answer; it’s in the system’s unauthorized use of the content to generate that answer. Providing a link back to the article you just plagiarized from doesn’t excuse the plagiarism. Furthermore, the model’s answer competes directly with the source, potentially harming the market for the original work. Why click the link to the New York Times article when the AI has already given you the summary?
RAG systems provide a much clearer and more direct evidence trail for litigation. You can trace a specific, infringing output directly back to the source document that the model copied from at the moment of generation. Far from a defense, RAG is often a confession.