Publications

Machine Learning-Based Detection of AI-Generated Text via Stylistic and Statistical Feature Modeling

AuthorSchäfer, Karla; Steinebach, Martin
Date2025
TypeConference Paper
AbstractThrough the advances of large-language models (LLMs) AI- generated text can be created with ease. But, these tools can also pose a threat, e.g. through the creation of disinformation. In this work, we analysed texts generated by three LLMs: GPT-3.5, LLaMA3, and Qwen from the CUDRT dataset. We extracted 220 stylistic and statistical features of human and AI-generated text using the LFTK library. First, we analysed the features using the pearson correlation. Second, we trained five machine learning models and tested the classifiers on detecting completely AI-generated, polished, rewritten texts, and summaries created by AI. We calculated an F1-score of 90%+ for the text generated entirely by AI, depending on the LLM used. We found that AI-generated texts, independent of LLM, can be identified through a high kuperman age, i.e. high word complexity, whereby human-written texts are written with higher lexical variation and richness. We provide an explanation for the classification results and a comparison with RoBERTa (fine-tuned).
ConferenceInternational Conference on Trust, Security and Privacy in Computing and Communications 2025
Urlhttps://publica.fraunhofer.de/handle/publica/506062