WIT Press


A Simple Mixture Model For Unsupervised Text Categorisation

Price

Free (open access)

Volume

33

Pages

10

Published

2004

Size

290 kb

Paper DOI

10.2495/DATA040021

Copyright

WIT Press

Author(s)

F. Clérot, F. Fessant, O. Collin, O. Cappé & E. Moulines

Abstract

Automatically segmenting text corpora into thematically related groups is a complex exploratory analysis problem. In this article, we outline our multi-stage exploratory analysis process and investigate the performance of a simple statistical model. After a description of this model and of its fitting procedure, we illustrate its performance on the segmentation of a corpus of CKM-related texts in English. Keywords: text mining, exploratory analysis, clustering, mixture model. 1 Introduction Clustering is a key tool in exploratory data analysis; segmenting the data into homogeneous groups leads to a more synthetic understanding of the data, allows to build powerful visualisations and is often the first step towards more specific analysis such as supervised classification. Although less standard in analysis of text data, clustering has recently received a lot of attention. The goal is to bring to text data analysis the same benefits as above. There are however significant differences between text data analysis and numerical data analysis: for numerical data analysis, the cluster "homogeneity" is judged from a metric in data space; when dealing with text data, it is clearly implicit that "homogeneity" means "topical homogeneity", a notion which is more difficult to measure. In this article, the purpose of text clustering is to build a topical segmentation of a corpus. Because of the difficulty of defining a priori a topical homogeneity measure, the text clustering analysis must be considered as a part of

Keywords

text mining, exploratory analysis, clustering, mixture model.