Textual Document Pre-processing And Feature Extraction In OLEX
Price
Free (open access)
Volume
35
Pages
11
Published
2005
Size
367 kb
Paper DOI
10.2495/DATA050171
Copyright
WIT Press
Author(s)
R. Curia, M. Ettorre, L. Gallucci, S. Iiritano & P. Rullo
Abstract
KnowledgeDiscovery in Text (KDT) has emerged as a challenging application due to the large amount of textual documents available from heterogeneous sources. OLEX is a KDT system for text classification developed at Exeura. A critical step of a KDT process is the pre-processing phase, consisting of a number of complex tasks aimed at making documents \“machine readable”. This paper describes the OLEX Pre-processing Module (OPM), an advanced software based on a general framework supporting the extraction from texts of linguistic, syntactic and structural relevant features. A main aspect of OPM is its capability to provide support for parallel text annotation. 1 Introduction Managing the huge amount of textual documents available on the web and on the intranets has become an important problem of Knowledge Management. Thus, techniques and tools for text categorization are needed [1]. Textual document collections can be seen as sources of unstructured data for which knowledge mining can be made by using Knowledge Discovery in Text (KDT) [2], an interactive and iterative process based on four phases: • Document Acquisition • Document Pre-Processing • Text Mining • Result Interpretation
Keywords