Cross-domain authorship attribution based on compression: Notebook for PAN at CLEF 2018

AuthorHalvani, Oren; Graner, Lukas
TypeConference Paper, Electronic Publication
AbstractAuthorship attribution (AA) is a very well studied research subject and the most prominent subtask of authorship analysis. The goal of AA is to identify the most likely author of an anonymous document among a set of known candidate authors, for which sample documents exist. Even after more than a century of intensive research, AA is still far from being solved. One open question, for example is, if the goal of AA can be successfully achieved, if the anonymous document and the known sample documents come from different domains such as genre or topic. We present a lightweight authorship attribution approach named COBAA ("Compression-Based Authorship Attribution") which is an attempt to answer this question. COBAA is based solely on a compression algorithm as well as a simple similarity measure and does not involve a training procedure. Therefore, the method can be used out-of-the-box even in real-world scenarios, where no training data is available. COBAA has been evaluated at the PAN 2018 Author Identification shared task and was ranked third among 11 participating approaches. The method achieved 0.629 in terms of Mean Macro-F1 on a corpus with attribution problems, distributed across five languages (English, French, Italian, Polish and Spanish).
ConferenceConference and Labs of the Evaluation Forum (CLEF) <2018, Avignon>
PartCapellato, Linda: CLEF 2018, Conference and Labs of the Evaluation Forum. Working Notes. Online resource: Avignon, France, September 10-14, 2018. Avignon, 2018. (CEUR Workshop Proceedings 2125), Paper 90, 11 pp.