WIT Press


Textual Document Pre-processing And Feature Extraction In OLEX

Price

Free (open access)

Volume

35

Pages

11

Published

2005

Size

367 kb

Paper DOI

10.2495/DATA050171

Copyright

WIT Press

Author(s)

R. Curia, M. Ettorre, L. Gallucci, S. Iiritano & P. Rullo

Abstract

KnowledgeDiscovery in Text (KDT) has emerged as a challenging application due to the large amount of textual documents available from heterogeneous sources. OLEX is a KDT system for text classification developed at Exeura. A critical step of a KDT process is the pre-processing phase, consisting of a number of complex tasks aimed at making documents \“machine readable”. This paper describes the OLEX Pre-processing Module (OPM), an advanced software based on a general framework supporting the extraction from texts of linguistic, syntactic and structural relevant features. A main aspect of OPM is its capability to provide support for parallel text annotation. 1 Introduction Managing the huge amount of textual documents available on the web and on the intranets has become an important problem of Knowledge Management. Thus, techniques and tools for text categorization are needed [1]. Textual document collections can be seen as sources of unstructured data for which knowledge mining can be made by using Knowledge Discovery in Text (KDT) [2], an interactive and iterative process based on four phases: • Document Acquisition • Document Pre-Processing • Text Mining • Result Interpretation

Keywords