Competition among large language models (LLMs) has intensified significantly over the past two years, with many believing that their core competitiveness lies in algorithms. However, this is not the case. The current open-source ecosystem has made mainstream architectures increasingly transparent — model structures such as Llama, GPT, and Gemma can all be publicly reproduced, and the competitive edge at the algorithmic level is rapidly eroding. The real competitive barrier actually exists at a more fundamental level — data.
Data is the sole source of knowledge for LLMs, and data quality determines a model’s “emotional intelligence” and “intelligence quotient.” This means the development of LLMs has largely relied on large-scale, high-quality training data. However, most mainstream training datasets and their processing workflows remain undisclosed, and the scale and quality of publicly available data resources are still limited. This poses significant challenges for the community in building and optimizing training data for LLMs.