All News

DeepSeek AI Model Suspected of Training on Google Gemini Data

Chinese AI lab DeepSeek released an updated reasoning model, R1-0528, which some researchers believe was trained using outputs from Google's Gemini AI. Evidence includes linguistic similarities and model 'thought traces' resembling Gemini's style. This raises concerns about AI distillation practices and data sourcing ethics amid growing industry security measures.

Published June 3, 2025 at 01:11 PM EDT in Artificial Intelligence (AI)

Last week, Chinese AI lab DeepSeek unveiled an updated version of its R1 reasoning model, dubbed R1-0528, which has demonstrated strong performance on various math and coding benchmarks. However, the company has not disclosed the sources of the training data used for this model, sparking speculation within the AI research community.

Some AI researchers suspect that DeepSeek’s latest model was trained, at least in part, on outputs generated by Google’s Gemini family of AI models. Melbourne-based developer Sam Paeach, known for creating "emotional intelligence" evaluations for AI, published evidence suggesting that R1-0528 favors words and expressions similar to those used by Gemini 2.5 Pro.

Another developer, who runs a "free speech eval" tool called SpeechMap, noted that the "thought traces" — the intermediate reasoning steps generated by DeepSeek’s model — closely resemble those from Gemini. While this is not definitive proof, it adds weight to the theory that DeepSeek may be leveraging Google’s AI outputs.

This is not the first time DeepSeek has faced accusations of training its models on data from competitors. In late 2024, OpenAI identified signs that DeepSeek’s earlier V3 model often self-identified as ChatGPT, suggesting training on OpenAI’s chat logs. OpenAI also found evidence that DeepSeek used distillation, a method of training smaller models by extracting knowledge from larger ones, which is prohibited under OpenAI’s terms of service.

Microsoft, a key OpenAI partner, detected unusual data exfiltration through OpenAI developer accounts believed to be linked to DeepSeek. While distillation is a common AI training technique, unauthorized use of proprietary model outputs raises ethical and legal concerns.

One challenge in verifying such claims is the widespread "contamination" of training datasets with AI-generated content. The open web is flooded with synthetic text from content farms and bots, making it difficult to distinguish original human-generated data from AI outputs. This convergence leads many models to share similar language patterns.

Nathan Lambert, a researcher at the AI2 institute, commented that if he were DeepSeek, he would generate large amounts of synthetic data from the best available API model, such as Gemini, to compensate for limited GPU resources. This strategy effectively leverages external compute power to enhance training.

In response to concerns about distillation and data misuse, AI companies have increased security measures. OpenAI now requires ID verification for access to advanced models, excluding countries like China. Google and Anthropic have begun summarizing model "traces" to make it harder for competitors to train rival models on their outputs.

The DeepSeek case highlights the growing complexities around AI training data provenance, intellectual property, and competitive ethics in the rapidly evolving AI landscape. As models grow more sophisticated, the lines between original data and synthetic training inputs blur, challenging industry norms and regulatory frameworks.

Keep Reading

View All
The Future of Business is AI

AI Tools Built for Agencies That Move Fast.

QuarkyByte offers deep insights into AI model training and data sourcing ethics. Explore how our analysis can help your AI projects navigate risks related to model distillation and data contamination. Stay ahead with QuarkyByte’s expert guidance on securing AI development and maintaining competitive advantage.