Even when people know they may be listening to AI-generated speech, it is still difficult for both English and Mandarin speakers to reliably detect a deepfake voice. That means billions of people who understand the world’s most spoken languages are potentially at risk when exposed to deepfake scams or misinformation.
Kimberly Mai at University College London and her colleagues challenged more than 500 people to identify speech deepfakes among multiple audio clips. Some clips contained the authentic voice of a female speaker reading generic sentences in either English or Mandarin, while others were deepfakes created by generative AIs trained on female voices.
The study participants were randomly assigned to two different possible experimental setups. One group listened to 20 voice samples in their native language and had to decide whether the clips were real or fake.
People correctly classified the deepfakes and the authentic voices about 70 per cent of the time for both the English and Mandarin voice samples. That suggests human detection of deepfakes in real life will probably be even worse because most people would not necessarily know in advance that they might be hearing AI-generated speech.
A second group was given 20 randomly chosen pairs of audio clips. Each pair featured the same sentence spoken by a human and the deepfake, and participants were asked to flag the fake. This boosted detection accuracy to more than 85 per cent – although the team acknowledged that this scenario gave the listeners an unrealistic advantage.
“This setup is not completely representative of real-life scenarios,” says Mai. “Listeners would not be told beforehand whether what they are listening to is real, and factors like the speaker’s gender and age could affect detection performance.”
The study also did not challenge listeners to identify whether or not the deepfakes sound like the target person being mimicked, says Hany Farid at the University of California, Berkeley. Identifying the authentic voice of specific speakers is important in real-life scenarios: scammers have cloned the voices of business leaders to trick employees into transferring money, and misinformation campaigns have uploaded deepfakes of well-known politicians to social media networks.
Still, Farid described such research as helping to evaluate how well AI-generated deepfakes are “moving through the uncanny valley”, mimicking the natural sound of human voices without retaining subtle speech differences, which may feel eerie to listeners. The study provides a useful baseline for automated deepfake detection systems, he says.
Additional attempts to train participants to improve their deepfake detection generally failed. That suggests it is important to develop AI-powered deepfake detectors, says Mai. She and her colleagues are looking to test whether large language models capable of processing speech data can do the job.
Topics: