Publikationen

AI Got Your Tongue? Analysing the Sounds of Audio Deepfake Generation Methods

AutorSchäfer, Karla
Datum2025
ArtConference Paper
AbstraktIn current research, audio deepfake detectors are trained on finding differences between bona-fide and spoofed samples. A variety of generation methods, mostly distinguished in voice conversion(VC) and text-to-speech synthesis (TTS) exist. We assume that these generation methods lead to specific artefacts in the generated recording. To test this, we created a test set with various spoofs containing the same linguistic content and target speakers as a bona-fide counterpart, using four VC and four TTS models. We applied feature representation methods to compare the differences in 1) bona-fide vs. spoofed samples, 2) samples created using VC vs. TTS and 3) differences in the generation methods used. We found differences between spoofs and bona-fide. Spoofs having overall higher deflections in the waveform and overall smaller values in the spectral evaluation. In the spectral domain, several differences between VC and TTS were detected. XTTS and kNN-VC stood out when viewing the spectral features, e.g. spectral contrast. The samples created using RVC seemed to be the most similar to bona-fide. MFCC and LFCC were the most effective at identifying the differences between bona-fide and spoof audio, making them a suitable choice for detecting audio deepfakes.
KonferenzInternational Conference on Multimedia Retrieval 2025
Urlhttps://publica.fraunhofer.de/handle/publica/490905