DeTox: A Comprehensive Dataset for German Offensive Language and Conversation Analysis

AutorDemus, Christoph; Pitz, Jonas; Schütz, Mina; Probol, Nadine; Siegel, Melanie; Labudde, Dirk
ArtConference Paper
AbstraktIn this work, we present a publicly available offensive language dataset (DeTox-dataset) containing 10,278 annotated German social media comments collected in the first half of 2021. With twelve different annotation categories annotated by six annotators, it is far more comprehensive than other datasets, and goes beyond just hate speech detection. The labels aim in particular also at toxicity, criminal relevance and discrimination types of comments. Furthermore, about half of the comments are from coherent parts of conversations, which opens the possibility to consider the comments contexts and do conversation analyses in order to research the contagion of offensive language in conversations. The dataset is available in our GitHub repository:
KonferenzWorkshop on Online Abuse and Harms 2022