Constrained Hierarchical Clustering for News Events

Abstract

Knowledge discovery from web news events has received great attention in recent years. In practice, this knowledge is a digital representation (virtual world) of various phenomena that occur in our physical world. Hierarchical clustering algorithms are used to organize related events into groups and subgroups according to some similarity measure. The main motivation for this organization is based on the hypothesis that if the user is interested in a specific event of a certain cluster, then the user may also be interested in other related events of this same cluster. However, existing event clustering methods do not effectively use the different types of information about events, such as temporal information, geographical data, name of people and organizations. In this paper, we propose the COH-KMeans algorithm (Constrained Hierarchical K-Means) that obtains a hierarchical clustering structure considering certain conditions imposed by the users, for example, events of similar content that occurred in nearby geographic locations or that occurred within a predefined time window. A statistical analysis of the experimental results reveals that the incorporation of constraints performed by COH-KMeans allows to obtain higher quality clusters when compared to a state-of-the-art unsupervised hierarchical clustering method. Moreover, we present our tool for exploratory analysis of events and we discuss how event clustering can be used to support the decision-making process from the perspective of a Data Analytics System.

Publication
21st International Database Engineering & Applications Symposium