Abstract | With the rise of audio deepfakes, there is an increasing need for comprehensive studies on their generation methods, especially regarding their quality. Areas such as languages beyond English and Chinese, as well as comparisons between voice conversion (VC) and text-to-speech synthesis (TTS), remain underexplored. In our study, we generated samples in English and German using 10 recent VC and TTS methods, including two publicly accessible online tools. We compared these samples using various evaluation methods to gain insights into their quality across different factors. Our analysis indicates that TTS performs slightly better than VC, with minor differences between English and German data. Interestingly, in VC, the gender of the source speaker has minimal influence on the generated samples. Instead, the cross-gender factor appears to affect VC. For both VC and TTS, the target speaker samples used for generation seem to influence the quality of the generated samples. |
---|