Play Live Radio
Next Up:
0:00
0:00
0:00 0:00
Available On Air Stations
Due to work being performed on the Newark transmission tower, WCBE's Newark signal at 106.3FM may experience intermittent outages over the next few days. We apologize for any inconvenience. In the meantime, we recommend listening via our website or the WCBE App, available on iOS and Android devices.

The race to create AI applications is creating demand for training data in China

SCOTT DETROW, HOST:

The race to create more powerful artificial intelligence applications has also created a huge demand for high-quality training data and competition over who gets to use that data, and a lot of that demand is in China, as NPR's Emily Feng and Aowen Cao report.

(SOUNDBITE OF MOUSE CLICKING)

EMILY FENG, BYLINE: In this brand-spanking-new office building in northeastern China, rows and rows of people sit silently clicking at their computer screens. This is the fuel that powers so much of generative AI - raw data - and this data processing center is the brainchild of this man.

HENRY CHEN: My name is Henry, Henry Chen.

FENG: He's the founder of Sapien AI. It hires people around the world to collect data and tag and organize it, so it can be used to train a variety of artificial intelligence applications. China is a big market.

CHEN: Especially after DeepSeek came out.

FENG: DeepSeek, the Chinese chatbot performing on par with American-trained chatbots but trained at a fraction of the cost - that demand for data is why Chen's company now has about 60 employees in China labeling maps of Chinese streets. This data today is being used to train an autonomous driving program.

AOWEN CAO, BYLINE: Looks very abstract.

FENG: That's NPR Producer Aowen Cao.

CAO: I see people working in front of computers, but on the computer screens, they are black backgrounds with squares.

FENG: Squares and green dots - it almost looks like, Aowen says laughing, the television show "Severance." The data may look abstract, but it's a valuable commodity, says Rogier Creemers. He's a professor at Leiden University in the Netherlands who studies China's digital technology policies.

ROGIER CREEMERS: They believe that data is an economic input, and in a way, they see it as akin, in that sense, to raw materials.

FENG: Chatbots today, like ChatGPT, need literally trillions of data points to get up to speed, and who owns that data has increasingly been a competition between companies and between countries like the U.S. and China. Each wants an edge over the other in AI, and that means hoarding data. Data is such a choke point that since last year, China's cyberspace regulators have to approve any bulk export of data out of the country, which is in part why Sapien AI, a Canadian company, is in China to begin with.

CHEN: For the AI models that are trained here, the data needs to be processed in the country and cannot leave the country.

FENG: The race to create and protect data is also because the data AI companies want is getting more complicated. Olga Megorskaya, the founder of an Amsterdam-registered data processing company called Toloka, now specializes in creating datasets for highly technical scientific and engineering fields. She uses an analogy that compares early AI models to human toddlers.

OLGA MEGORSKAYA: The person is like 2 years old. He or she is taught by kids books with very bright pictures.

FENG: And more advanced AI models are like university students.

MEGORSKAYA: When she goes to the university, there are dozens of textbooks that she needs to read.

FENG: For an AI model, that means gobbling up more and more advanced datasets. The data industry is crucial enough that local governments in China, once dependent on dying industries like steelmaking and coal mining, are actively recruiting AI data processing companies. Here's Creemers at Leiden University again.

CREEMERS: China wants to make a large amount of money through developing the industries of the future.

FENG: The rust belt city of Shenyang, where Sapien AI chose to locate one of its offices, is one of seven Chinese cities that says it wants to become an AI data hub. The city offers low interest rates on loans and flexible and affordable office space. Here's Chen again at Sapien AI. They benefited from this help.

CHEN: So they give us a lot of help as well, so we just find a really good environment to set up the office here.

FENG: Because data processing employs a lot of young people - China's economy never fully recovered from a global coronavirus pandemic, and youth unemployment has concerned policymakers enough that they briefly stopped publishing that statistic.

(SOUNDBITE OF MOUSE CLICKING)

FENG: One of the young people working at Sapien AI is Huang Rui, age 21. She's a data quality specialist.

HUANG RUI: (Non-English language spoken).

FENG: She says the work of data processing is suitable for people with obsessive-compulsive tendencies because it requires a high level of attention to detail. Data processing is admittedly not the most exciting work, says Chen, her boss.

CHEN: Just picture yourself sitting at a desk and try to draw bounding boxes around cars for 40 hours a week.

FENG: But sometimes innovation requires someone - actually, a whole lot of people - to do the boring work. Emily Feng, NPR News. Transcript provided by NPR, Copyright NPR.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.

Emily Feng is NPR's Beijing correspondent.