English | 2015 | ISBN: 978-1-118-33258-0 | 720 Pages | PDF | 10 MB
Data Mining Algorithms is a practical, technically-oriented guide to data mining algorithms that covers the most important algorithms for building classification, regression, and clustering models, as well as techniques used for attribute selection and transformation, model quality evaluation, and creating model ensembles. The author presents many of the important topics and methodologies widely used in data mining, whilst demonstrating the internal operation and usage of data mining algorithms using examples in R.
Inductive learning is the source of many data mining algorithms as well as of their theoretical justifiations. This is the area where the domains of machine learning and data mining intersect. But even data mining algorithms that do not originate from machine learning can be usually seen as some explicit or implicit forms of inductive learning. The analyzed data plays the role of training information, and the models derived therefrom represent the induced knowledge. In particular, the three most widely studied and practically exercised data mining tasks, classifiation, regression, and clustering, can be considered inductive learning tasks. This chapter provides some basic background, terminology, and notation that is common for all of them.
Whenever discussing any form of learning, including inductive learning, the term “knowledge” is frequently used to refer to the expected result of the learning process. It is
unfortunately diffiult to provide a satisfactory defiition of knowledge, consistent with the common understanding as well as technically useful, without an extensive discussion of psychological and philosophical theories of mind and reasoning, which are undoubtedly beyond the scope of our interest here. It makes sense therefore to adopt a simple indirect surrogate defiition which does not try to explain what knowledge is, but explains what purpose it can serve. This purpose of knowledge is inference.
Inference can be considered the process of using some available knowledge to derive some new knowledge. Given the fact that knowledge is both the input and output of inference, the above idea of defiing knowledge as something used for inference may appear pretty useless and creating an infiite defiition loop. It is not necessarily quite that bad since, in the context of inductive learning, different types of inference are employed when using training information to derive knowledge and when using this derived knowledge. These are inductive inference and deductive inference, and their role in inductive learning is schematically illustrated below.
Basic statistics as modeling
Techniques presented in this chapter can be actually reduced to mathematical formulae that calculate certain quantities based on a dataset. A statistic (in a narrow sense) is just that: the value of a formula calculated on a dataset. What such basic statistics produce on output are essentially single numbers. This is indeed very far from knowledge representations delivered by modeling algorithms.
Despite that gap in complexity, basic statistics and models have something important in common. They both are created or calculated based on a limited dataset, but intended to adequately represent the properties of the whole domain. For a model, this usually means that it would be capable of delivering good quality predictions for arbitrary, possibly previously.
Classifiation performance measures are calculated by comparing the predictions generated by the classifir on a dataset S with the true class labels of the instances from this dataset. The latter may be an arbitrary subset of the available labeled dataset, including the training set. As it will be more extensively discussed later, though, it is usually separate from the training set and referred to as the validation set or test set. The distinction between these terms is mostly based on the purpose of the model evaluation process. When the evaluation is performed to make some decisions that may affect the fial model (e.g., select a classifiation algorithm, adjust its parameters, select attributes, etc.), which may be called intermediate evaluation, it is a common convention to speak of a validation set. Whenever the performance of the ultimately created model is to be evaluated (fial evaluation), one would rather speak of a test set. This terminological distinction is purely conventional and has no impact on performance measures. They can be applied to any dataset, including the training set used to create the model (which determines the model’s training performance), although they would not reliably estimate the model’s generalization properties in that case. We will follow the convention to designate an arbitrary dataset by S and a validation/test set separate from the training set by Q.