Multilingual Text Mining
Price
Free (open access)
Volume
35
Pages
6
Published
2005
Size
488 kb
Paper DOI
10.2495/DATA050091
Copyright
WIT Press
Author(s)
F. Neri
Abstract
The availability of a huge amount of textual data from a bewildering variety of sources leads to the well-identified paradox based on which an overload of information means no usable knowledge. In fact, up to 80% of electronic data is textual. Moreover, the most valuable information is encoded in pages which are written in various native languages, but are relevant even to non-native speakers. The process of accessing all these raw data, heterogeneous for language used, and transforming them into information is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilingualism. Through multilingual text mining, users can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find all related information. This paper describes the approach used by SYNTHEMA for multilingual text mining, showing the classification results on around 600 breaking news items written in English, Italian and French. 1 Multilingual resources construction Generally speaking, the manual construction and maintenance of multilingual language resources is undoubtedly expensive, requiring remarkable efforts. Being established in 1994 by computer scientists from the IBM Research Center, with the expertise and skills suited to provide effective software solutions, as well as carry out R&D in Natural Language Processing area, SYNTHEMA has been involved in Machine Translation, Information Extraction and Text Mining activities since 1996, primarily in the field of Technology Watch. The growing availability of comparable and parallel corpora has pushed SYNTHEMA to develop specific methods for semi-automatic updating of lexical resources. They are based on Natural Language Understanding and Machine Learning. These techniques detect multilingual lexicons from such corpora, by extracting all the
Keywords