Training Data Sources
-
Web and Code data
Status: Reported
Citation: DeepSeek AI documentation
Overview: A Chinese Open-Source Competitor
DeepSeek AI is a China-based technology company that develops and releases open-source large language models. The company and its models are notable for their strong focus on code generation and for operating within a different legal and regulatory environment than their Western counterparts.
Key Models
DeepSeek has released several powerful models, with a clear emphasis on programming and technical capabilities:
- DeepSeek V2: A powerful mixture-of-experts (MoE) foundation model.
- DeepSeek-Coder: A series of models specifically pre-trained on a massive corpus of code (trillions of tokens), designed to excel at code completion, debugging, and generation.
Training Data & Copyright Risk (Code)
The primary legal questions surrounding DeepSeek models involve the composition of their training data, especially the code.
- Massive, Undisclosed Data: The models are trained on trillions of tokens of text and, crucially, code scraped from the internet. The specific sources of this code (e.g., GitHub repositories, code forums) are not disclosed.
- Risk of Copyleft “Contamination”: A significant, unresolved legal question is whether training a model on code released under “copyleft” licenses (like the GNU General Public License - GPL) requires the resulting model to also be open-sourced under the same terms.
- GPL Contamination: If the model memorizes and reproduces snippets of GPL-licensed code, any software application using that model’s output could be considered a “derivative work,” potentially forcing the entire application to be open-sourced under the GPL. This represents a major legal risk for companies building proprietary software with these models.
The DeepSeek Model License
DeepSeek models are released under a custom license that is not a standard, permissive open-source license.
Key Terms & Restrictions
- Research vs. Commercial Use: The license typically distinguishes between use cases. It often permits free use and modification for academic and research purposes.
- Commercial Restriction: The license places significant restrictions on commercial use. Companies wishing to use DeepSeek models in a commercial product must typically seek a separate, paid license from the company. This dual-licensing model is common for open-source companies.
Legal Implications
- “Source-Available,” Not “Open Source”: Due to the commercial restrictions, the models are more accurately described as “source-available” rather than “open source” in the traditional, permissive sense.
- Geopolitical Context: As a Chinese entity, DeepSeek’s enforcement of its license and its susceptibility to copyright claims from Western rights holders operate in a different legal context. International copyright disputes can be more complex to litigate than domestic ones.