Text Preparation Through Extended Tokenization
Price
Free (open access)
Volume
37
Pages
9
Published
2006
Size
533 kb
Paper DOI
10.2495/DATA060021
Copyright
WIT Press
Author(s)
M. Hassler & G. Fliedl
Abstract
Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of processing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rule-based extended tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core features of our implementation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper, we focus on the task of improving the quality of standard tagging. Keywords: text preparation, natural language processing, tokenization, tagging improvement, tokenization prototype. 1 Introduction Nearly all researchers concerned with text mining presuppose tokenizing as first step during text preparation [1–5]. Good surveys about tokenization techniques are provided by Frakes and Baeza-Yates [6] and Baeza-Yates and Ribeiro-Neto [7], and Manning and Sch¨ utze in [8, pp.124–136]. But – as we know – only very few reflect tokenization as a task of multi-language text processing with far-reaching impact [9]. This involves language-related knowledge about linguistically
Keywords
text preparation, natural language processing, tokenization, tagging improvement, tokenization prototype.