Inside Shanghai's "10,000-GPU clusters": how the city is powering its AI ambitions-Jiemian Global

Inside Shanghai's "10,000-GPU clusters": how the city is powering its AI ambitions

The city aims to reach 200,000 PFLOPS of computing power by 2027,

Photo from Jiemian News

by ZHUANG Jian

Opening the doors to the data hall, rows of tightly packed servers come into view, accompanied by the constant hum of cooling fans working to keep temperatures in check.

The scene at a computing center in Shanghai's Songjiang district offers a glimpse into how the city is scaling up infrastructure to support China's fast-growing artificial intelligence industry.

Backed by a state-backed operator, the facility is among the country's first to deploy a "10,000-GPU cluster" — systems that link tens of thousands of graphics processing units (GPUs) into a unified engine used to train and run AI models.

Often described as the "power plants" of the AI era, such clusters are becoming critical as demand surges. The Songjiang center has been operating at full capacity since launch, reflecting a sharp rise in demand. Data from the National Data Administration show that daily token usage in China exceeded 140 trillion in March, more than 1,000 times higher than at the start of 2024.

Shanghai is now planning further expansion, with new computing capacity to be deployed across Pudong, Jinshan, Songjiang, Lingang and Qingpu. The city aims to reach 200,000 PFLOPS of computing power by 2027, up from about 120,000 PFLOPS — enough to support large-scale AI model training.

SUN Yue, general manager of the operating company, said the concentration of industry-specific AI applications in Shanghai makes it necessary to locate computing resources close to users, helping reduce latency and improve response times.

Building a "10,000-GPU cluster" is less about assembling hardware than managing a highly complex system. A single cluster can involve hundreds of thousands of components across more than 100 categories, all of which must operate in sync.

Engineers often compare the process to launching a satellite, where even a minor fault can disrupt the entire system.

One of the most unexpected risks is dust.

Optical modules — key components that enable high-speed data transmission — are extremely sensitive to contamination. Even microscopic particles can affect performance, and in some cases destabilize an entire cluster.

To reduce such risks, installation protocols limit exposure of components to air to just a few seconds. Sticky mats are also placed throughout the facility to prevent dust from being carried in on shoes.

Keeping the system running continuously is another challenge. AI training tasks can be disrupted by equipment failures, but the Songjiang cluster is designed with redundancy at both system and architectural levels, allowing operations to continue even when faults occur. Engineers can locate and isolate problems within seconds or minutes.

As AI becomes more deeply embedded in the economy, computing power is increasingly being treated as a form of infrastructure — much like water and electricity.

"Our goal is to make computing power as reliable as water and electricity," said ZHAI Yujia, a senior platform executive at the center. "Users shouldn't even notice it's there."