REGRETS: A New Corpus of Regrettable (Self-) Disclosures on Social Media

AuthorSimo Fhom, Hervais-Clemence; Kreutzer, Michael
TypeConference Paper
AbstractIn the past few years, researchers have shown a growing interest in techniques for automated detection of regrettable disclosures (things people wish they had not shared) on social media. Most of these proposals formulate the task of automatically detecting potentially regrettable disclosures as a supervised classification problem. In such a setting, the underlying classification model is trained and validate on a dataset labeled accordingly. However, despite growing efforts, existing approaches remain limited, partly due to the lack of high-quality corpus of regrettable messages and comments shared on social media. Previous work tend to confuse regrettable disclosure with related concepts such as hate speech, profanity and offensive language, ignoring empirical findings on the reasons, the types of contents, and disclosure contexts that often lead to regrets. Moreover, corpora used in prior work are typically limited in size and w.r.t. their source domains (i.e., social media platforms) and scope (i.e., range of regret-related topical content used as labels). The goal of this paper is to contribute towards lowering the barrier for developing effective systems for automated detection of regret-related posts. We propose a novel methodology for large-scale data collection and semi-automated annotation. We introduce REGRETS, a new large-scale corpus of 4,7 million regrettable text-only posts and comments with high-quality annotations. Further, we propose regret-specific embeddings models pre-trained on our corpus of user-generated social media texts which were extracted from various popular social media ecosystems. Lastly, we report on analyses that demonstrate the feasibility of partly automating the annotation of social media texts, and the richness of the resulting corpus. We release our findings as resources to facilitate further interdisciplinary research:
ConferenceConference on Computer Communications 2022