The Data Reality Diet
Do we need a food safety system for information—to shed some of the false and biased data we're being fed by AI?
One of today’s most compelling metaphors compares synthetic media to food. If someone handed you a cheap hot dog made of who-knows-what, you’d want a label. You’d want to know what’s in it, who made it—and how. Should the same be true for the AI-assisted content we consume? A new data safety movement is emerging to take up the challenge.
NEW YORK— Still one of the most memorable deepfakes that rocked the world last year (and which we wrote about then) featured the late Pope Francis decked out in a white Balenciaga puffer jacket to greet parishioners on the streets of Rome.
While this was an endearing scam to some, another deepfake that made the AI-generated hit parade last year was far more troubling—Taylor Swift’s likeness woven onto an adult film star’s body in a porno film and viewed 47 million times.
And let’s not forget that someone supporting Donald Trump last year doctored recordings of Joe Biden’s voice to piece together a false audio message, or robocall, to Democratic voters—to discourage them from voting at all.
Regardless of how silly or outrageous these and other deepfakes have been and are becoming, a lot of people fall for these lies—hard. According to a 2023 survey conducted by Pew, only 42% of Americans can recognize a deepfake image when they see it. And today, the fakes are getting even tougher to spot.
But what hits hardest now is not the spectacle of tech, but the realization that as AI-driven fraud becomes increasingly common, we’ve started living in a world where almost everything might be a lie—and that nearly every photo, video and every sentence or controversial utterance by Trump and other politicians demands a forensic audit to be believed. Just ask Ukranian President Volodymyr Zelenskyy, who has recently become a popular AI target.
Deepfake videos are getting so good, in fact, that longtime email scammers are pivoting to impersonate people on live video calls. “Most troubling is that dictators and authoritarians thrive in an environment in which we trust no one and nothing, including ourselves and our own senses,” says deepfake expert Sam Gregory, the executive director of witness.org.
But let’s be clear: the idea that we can fact-check ourselves out of this crisis is a fantasy. It’s a slow process. We also need to make sure the data is free of bias and error before distribution.
“(DIY factchecking) is a con,” says Baratunde Thurston, a comedian, writer and cultural critic. It’s difficult in today’s polarized environment. DIY fact-checking is important, he says, but much more is needed to guard the truth. “It’s a con designed to shift responsibility off the platforms, the politicians, the profiteers—and dump it on ordinary people—the underpaid, over-policed, under-connected communities who already get ignored when they tell the truth. Now they get to be ignored and accused of spreading lies unless they come with receipts, watermarks, and chain-of-custody documentation. It's surveillance culture with a trust tax.”
Is reality sustainable if it always has to be proven?
Cleaner data can help, AI researchers say, but as synthetic media become more sophisticated, we need more AI literacy—plus a more formal and global effort to verify the good data and flag the bad.
Thanks to OpenAI's DALL-E2, Midjourney and Stable Diffusion, deepfake images are more realistic and more available now than ever. Technology to create videos out of whole cloth is also rapidly improving. According to the Federal Trade commission, job and employment scams on LinkedIn and other platforms nearly tripled from 2020 to 2025, and HR losses from those scams have increased from some $90 million to roughly $500 million.
Daniel Schiff, a policy scientist and the co-director of the Governance and Responsibility AI Lab at Purdue University, says people struggle to separate reliable and unbiased information from AI-generated content that can sound authoritative but can be a hit or a miss. “AI shapes the information people find. It shapes what they learn. It shapes now what they think,” Schiff says.
Anton Dahbura, the co-director of the Johns Hopkins Institute for Assured Autonomy, also reminds us that AI can hallucinate. It can reference sources that don’t exist—“and leave out important opinions in an effort to offer a digestible consensus” about a consumer product or important medical research.
New Rule
But one of the newest ideas—and most promising ways to fight back, say data experts, are data safety labels, similar to those used to rank food for health. A new group of data researchers and scientists are working now to develop this new framework.
“We need a food safety system for information,” says Kasia Chmielinski, the Project Lead for the Data Nutrition Project —a national research organization which is compiling a nutrition label metaphor for the data going into training datasets for AI systems.
The Project is being lauded widely for its research internationally, and its push to create a standard labeling system for interrogating datasets. “It’s long overdue,” says Chmielinski. “It’s our belief that deeper transparency into dataset health and accuracy will help us move forward without mirroring social biases and disinformation in what AI delivers.” She says the scale of false data being used and created is rapidly expanding.
The Data Nutrition Project takes its inspiration from nutritional labels on food. The safety labels it is creating for data highlight the key ingredients in datasets, such as metadata, demographic representation, distribution, missing data—along with transparency around its intended use and the potential risks or limitations of the data, itself.
“We need to make it easier for people using AI to assess the health and fitness of the datasets they intend to use to train AI algorithms,” Chmielinski says. “That has been a missing step in the AI development pipeline.” She said the Project is working on packaging the nutritional data as quality recipes and an easy-to-use Dataset Nutrition Label which will soon become internationally accessible.
The appetite for data safety labels is growing, to make deepfakes harder to make and easier to spot. Other groups using the food label metaphor and contributing research to the Data Nutrition Project are the Berkman Klein Center at Harvard University, the United Nations Humanitarian Data Exchange, the AI Transparency Project at the Harvard Kennedy School and the MIT Media Lab, among others.
“We don’t need more panic over AI,” says Thurston. “We need recipes. We need labeling laws. We need a nutritional panel for information and we need to regulate the informational supply chain the same way we regulate what goes into our groceries. Because if we don’t, we’ll keep getting sick. We’ll keep ingesting narratives designed to confuse, distract, and divide. And the people feeding it to us? They’re banking on us being too burnt out to ask what’s in the sausage.”
For more information on the data safety labeling project, check out the Data Nutrition Project’s additional work and draft label here.
Got a comment on the use of AI and the need to make the data it gets trained on more accurate? Share your take here—and thank you, as always, for your readership.