Embedding
An “embedding” is a machine’s translation of content—a word, a sentence, an image—into a string of numbers. AI companies have a vested interest in portraying these embeddings as abstract, magical “concepts.” They are not. An embedding is best understood as a highly detailed, unique fingerprint of the source data. It is a new type of copy, one made for machines instead of humans.
Analogy: A High-Dimensional Barcode
Think of a simple barcode on a product at the grocery store.
- The Barcode: It’s a series of black and white lines. It is not human-readable. It does not “contain” the product name or price in a way you can see.
- The Representation: The barcode is a direct, mathematical representation of a product number. If you change the product, the barcode changes.
- The Reversibility: A scanner can read the barcode and instantly retrieve the original product information from a database. The barcode is a key that unlocks the original data.
Now, imagine a barcode that is not 12 digits, but 1,536 numbers long, with each number having incredible precision. And instead of just representing a product, it represents a paragraph from a book, or a photograph. This is an embedding. It’s a high-dimensional, numerical representation that uniquely corresponds to the data it was created from. While you might not be able to perfectly “un-scan” it to get the original paragraph, you can get incredibly close. And you can use it to find the original in a database with perfect accuracy.
The Legal and Technical Flaws
The debate over embeddings is a debate over the definition of a “copy” in the digital age.
-
The “It’s Just Math” Defense is a Red Herring: AI companies argue that an embedding is just a list of numbers and therefore can’t be a copy of a creative work. This is a specious argument. A digital photo is “just a list of numbers” (pixel values). An MP3 file is “just a list of numbers” representing audio frequencies. An embedding is no different. It is a mathematical format for storing information derived directly from a copyrighted work.
-
Copyright Does Not Require Human Readability: A common defense is that embeddings are not infringing because a human can’t read them. This is legally irrelevant. A computer program’s source code is protected by copyright, but its compiled, machine-readable binary executable is also considered a copy and is protected. Embeddings are the machine-readable version of creative works.
-
Embeddings as a Tool for Infringement: This is most obvious in Retrieval-Augmented Generation (RAG) systems. A company will take a copyrighted library of books, convert every paragraph into an embedding, and store them in a vector database. When a user asks a question, the system creates an embedding of the question, finds the most similar paragraph-embeddings in its database, and then uses the original text of those paragraphs to answer the user. The embeddings are being used as a high-speed index to commit copyright infringement on demand.
The creation and storage of embeddings from copyrighted works without a license is a form of mass-scale infringement. It is the creation of a derivative, machine-readable database that allows a company to exploit the value of the original works without compensation. For a litigator, the key is to reject the “abstract idea” narrative and frame embeddings for what they are: functional, mathematical copies of protected works.