| Abstract | Basic feature representations like MFCCs and LFCCs allow resource efficient and explainable feature extraction for e.g. audio deepfake detection. Despite the frequent utilisation of these feature representations, a comprehensive examination of the number of coefficients employed and the impact of delta and double delta values remains to be undertaken. We analysed MFCCs and LFCCs combined with four classifiers, using in-domain and out-of-domain test sets. MFCCs performed superior on out-of-domain data, LFCC on the in-domain test set. The combination of lower amounts of coefficients with longer audio inputs, in conjunction with the utilisation of delta and double delta features, yielded enhanced generalisable results. For instance, for ResNet34 with 128 coefficients we calculated an EER of 65.15% on the out-of-domain test set, with 20 coefficients we calculated an EER of 29.71%. Furthermore, we identified specific patterns in the MFCCs when employed with various classifiers. For all classifiers, lower MFCCs (0, 1) were identified as contributing to a classification as bona-fide, whereby higher MFCCs contributed to a classification as spoof for all detectors. |
|---|