Author clustering using compression-based dissimilarity scores: Notebook for PAN at CLEF 2017

AuthorHalvani, O.; Graner, L.
TypeConference Paper, Electronic Publication
AbstractThe PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish authorship links between documents in a cluster and provide a list of document pairs, ranked according to a confidence score. We present a simple scheme to handle both scenarios. In order to group the documents by their authors, we use k-Medoids, where the optimal k is determined through the computation of silhouettes. To determine links between the documents in each cluster, we apply a predefined compressor as well as a dissimilarity measure. The resulting compression-based dissimilarity scores are then used to rank all document pairs. The proposed scheme does not require (text-)preprocessing, feature engineering or hyperparameter optimization, which are often necessary in author clustering and/or other related fields. However, the achieved results indicate that there is room for improvement.
ConferenceConference and Labs of the Evaluation Forum (CLEF) <2017, Dublin>
PartCappellato, L.: CLEF 2017, Conference and Labs of the Evaluation Forum. Working Notes. Online resource: Dublin, Ireland, September 11-14, 2017. Dublin, 2017. (CEUR Workshop Proceedings 1866), Paper 59, 11 pp.