A Fully Sensitive Correlation Measure For Data Mining
Price
Free (open access)
Volume
40
Pages
7
Page Range
35 - 41
Published
2008
Size
195 kb
Paper DOI
10.2495/DATA080041
Copyright
WIT Press
Author(s)
R. J. G. B. Campello & E. R. Hruschka
Abstract
This paper introduces a novel sequence correlation measure that is fully sensitive to both the ranks and magnitudes of the sequences under evaluation. This measure can be more appropriate than the existing ones in those application scenarios in which such a full sensitivity is desired. The applicability of the new measure in data mining tasks is motivated. Keywords: correlation indexes, clustering analysis. 1 Introduction A problem that appears in different contexts of data analysis is that of comparing two sequences A = {a1, a2, . . . ,an} and B = {b1, b2, . . . ,bn} for which there is a total order relation (≤) on their elements. This problem can be addressed by means of correlation indexes, such as the well-known Pearson correlation coefficient [1, 2]. Aside from the huge applicability of such indexes in statistics [3, 4], there are also different possible scenarios for their application to data mining tasks. In this context, one may mention, for instance, the use of sequence correlation indexes for feature selection as a pre-processing step for data clustering or classification [5]. Another scenario for the application of correlation indexes to data clustering or classification is the measurement of similarities in bioinformatics data sets [6]. For example, sequences A and B can refer to the responses of a given pair of genes along a set of experiments (e.g. microarray) [7]. Since the trend of such responses plays a fundamental role to describe the function and behavior of the corresponding genes, correlation indexes have been widely used as measures of similarity when dealing with this sort of data.
Keywords
correlation indexes, clustering analysis.