Towards Scaling Up Induction Of Second-order Decision Tables
Price
Free (open access)
Volume
28
Pages
Published
2002
Size
601 kb
Paper DOI
10.2495/DATA020381
Copyright
WIT Press
Author(s)
R Hewett & J Leuchner
Abstract
One of the fundamental challenges for data mining is to enable inductive learning algorithms to operate on very large databases. Ensemble learning techniques such as bagging have been applied successfully to improve accuracy of classification models by generating multiple models, from replicate training sets, and aggregating them to form a composite model. In this paper, we adapt the bagging approach for scaling up and also study effects of data partitioning, sampling, and aggregation techniques for mining very large databases. Our recent work developed SORCER, a learning system that induces a near minimal rule set from a data set represented as a second-order decision table (a database relation in which rows have sets of atomic values as components). Despite its simplicity, experiments show that SORCER is competitive to other, state-of-the-art induction systems. Here we apply SORCER using two instance subset selection procedures (random partitioning and sampling with replacement) and two aggregation procedures (majority voting and selecting the model that performs best on a validation set). We experiment with the GIS data set, from the UCI KDD Repository, which contains 581,012 instances of 30x30 meter cells with 54 attributes for classifying forest cover types. Performance results are reported including results from mining the entire training data set using different compression algorithms in SORCER and published results from neural net and decision tree learners. 1 Introduction The development of inductive learning algorithms that scale up to very large data sets is a fundamental problem in data mining applications. Scalability raises
Keywords