Training Data Sources
-
Public Web Data
Status: Reported
Citation: Cohere's technical documentation and public statements.
Like its competitors, Cohere's foundational models are trained on a large, undisclosed corpus of data scraped from the public internet. This creates a latent risk of copyright infringement.
-
Proprietary Licensed Data
Status: Reported
Citation: Cohere's public statements and partnership agreements.
Cohere reports using proprietary, licensed datasets to improve the quality of its models. As an enterprise-focused company, access to high-quality, specialized data is a key part of its value proposition. The sources are not public.
-
Customer Data for Fine-Tuning
Status: Confirmed
Citation: Cohere's private deployment and data security documentation.
A core part of Cohere's business model is allowing customers to fine-tune models on their own proprietary data within a secure, private environment (e.g., the customer's own cloud). Cohere contractually guarantees that this data is not used to train its generally available models.
-
Copyrighted News Articles
Status: Alleged
Citation: Advance Local Media LLC v. Cohere Inc. lawsuit.
A consortium of news publishers is suing Cohere, alleging that their copyrighted articles were used to train Cohere's enterprise-focused models without permission. This lawsuit directly challenges Cohere's 'black box' of training data.
Overview: Enterprise-Focused & Private Deployments
Cohere is a Canadian AI company that builds large language models with a clear focus on enterprise customers. Their strategy revolves around providing models that can be deployed in a private, secure environment (e.g., a customer’s own cloud or on-premise), which is a key differentiator for businesses concerned with data privacy. Their models are proprietary and accessed via API.
Key Models & Capabilities
Cohere offers a suite of models tailored for enterprise applications:
- Command Series (Command, Command R, Command R+): These are the flagship text generation models, designed for a wide range of tasks from copywriting to complex summarization.
- Rerank: This is a specialized model designed to improve the accuracy of enterprise search systems. Rather than generating new text, it re-orders a list of existing documents for relevance. From a copyright perspective, this model is likely lower risk as it does not create new content.
- Embed: A model for generating vector embeddings from text, used to power semantic search and other NLP applications.
Training Data & Copyright Risk
Cohere’s approach to training data presents a familiar “black box” problem, but its business model offers a different way of managing the associated risks.
- Stated Sources: Like its competitors, Cohere reports using a mix of publicly available data and licensed data. Details about these licensed sources are not public.
- Lack of Transparency: The company has not disclosed the specific datasets used for its core model training. This creates a latent risk of copyright infringement, as the models may have been trained on copyrighted web content.
Legal & Liability Considerations
Cohere’s enterprise focus shapes its approach to liability and risk management.
Private Deployment as a Risk Mitigator
- Cohere allows customers to deploy models within their own virtual private cloud (VPC) or even on-premise. This means a customer’s proprietary data, used for fine-tuning, never leaves their secure environment and is not used to train Cohere’s general-purpose models.
- This “data privacy” guarantee is a core part of their sales pitch to enterprises. It contractually separates a customer’s sensitive data from Cohere’s foundational training data, reducing the risk of data leakage.
No Public IP Indemnity & Active Litigation
- Unlike Amazon, Cohere does not publicly advertise a broad IP indemnity for copyright claims. This is notable as Cohere is a defendant in a publisher lawsuit, Advance Loc. Media v. Cohere (Judge McMahon, SDNY).
- This suggests that liability for model outputs is handled on a case-by-case basis through individual enterprise sales contracts. A large enterprise customer may be able to negotiate for an indemnity clause, but it is not a blanket offer.
The “Rerank” Model Exception
- The
Rerankmodel is legally interesting because it operates on a customer’s own data. It simply re-orders a list of documents provided to it. This makes it less likely to be the source of a direct copyright infringement claim against Cohere, as it is not generating new, potentially infringing content.
Key Litigation
While Cohere has maintained a strong enterprise focus, it has been named as a defendant in a lawsuit from a major news publisher consortium.
Advance Local Media LLC v. Cohere Inc.
- Case Number: 1:25-cv-01305 (S.D.N.Y.)
- Filing Date: February 2, 2025
- Allegation: A group of publishers, including Advance (owner of local news sites) and others like Condé Nast and The Atlantic Monthly, filed a lawsuit against Cohere. They allege that Cohere’s enterprise-focused AI platform engaged in copyright infringement by using their articles for training its models.
- Core Claims: The suit includes claims for direct and secondary copyright infringement, as well as trademark infringement and false designation of origin under the Lanham Act. The case is before Judge Colleen McMahon.