| Abstract | Audio deepfakes are artificially generated audio recordings that are used to create inauthentic imitations of a person's utterances. In the context of the increasing utilisation of artificial intelligence, the recognition of these recordings is becoming increasingly important. Thereby, the generalization of the detectors is a major challenge in audio deepfake detection (ADD). We analysed six feature representations: MFCC, LFCC, Wav2Vec2.0 and three pre-trained encoders of automatic speech recognition (ASR) systems (Whisper, SpeechT5, Canary) for ADD. For evaluating their generalizability, we trained the models on the ASVspoof 2019 LA train set and tested on the in-the-wild (ITW) test set. Furthermore, we evaluated the training time of the models, analysed the effect of additional (newer) training data and performed a correlation analysis. The ASR encoders were outperformed by using simple MFCC features, achieving an EER of 17.96% on the ITW test set with the lowest training time (7 min. per epoch). The best results were calculated using Wav2Vec 2.0 (+MFCC) with an EER of 17.60%, but using additional training data, and a training time of 135 min. per epoch. Wav2Vec also showed the highest correlations in varying training settings, indicating the best predictability of the model's performance. |
|---|