Predictive Data Mining

A Practical Guide


  • Sholom Weiss
  • Nitin Indurkhya

The potential business advantages of data mining are well documented in publications for executives and managers. However, developers implementing major data-mining systems need concrete information about the underlying technical principles—and their practical manifestations—in order to either integrate commercially available tools or write data-mining programs from scratch. This book is the first technical guide to provide a complete, generalized roadmap for developing data-mining applications, together with advice on performing these large-scale, open-ended analyses for real-world data warehouses.Note: If you already own Predictive Data Mining: A Practical Guide, please see ISBN 1-55860-477-4 to order the accompanying software. To order the book/software package, please see ISBN 1-55860-478-2.
View full description


Book information

  • Published: August 1997
  • ISBN: 978-1-55860-403-2


"I enjoy reading PREDICTIVE DATA MINING. It presents an excellent perspective on the theory and practice of data mining. It can help educate statisticians to build alliances between statisticians and data miners." --Emanuel Parzen, Distinguished Professor of Statistics, Texas A&M University

Table of Contents

Preface1 What is Data Mining?1.1 Big Data1.1.1 The Data Warehouse1.1.2 Timelines1.2 Types of Data-Mining Problems1.3 The Pedigree of Data Mining1.3.1 Databases1.3.2 Statistics1.3.3 Machine Learning1.4 Is Big Better?1.4.1 Strong Statistical Evaluation1.4.2 More Intensive Search1.4.3 More Controlled Experiments1.4.4 Is Big Necessary1.5 The Tasks of Predictive Data Mining1.5.1 Data Preparation1.5.2 Data Reduction1.5.3 Data Modeling and Prediction1.5.4 Case and Solution Analyses1.6 Data Mining: Art or Science1.7 An Overview of the Book1.8 Bibliographic and Historical Remarks2 Statistical Evaluation for Big Data2.1 The Idealized Model2.1.1 Classical Statistical Comparison and Evaluation2.2 It's Big but Is It Biased2.2.1 Objective Versus Survey Data2.2.2 Significance and Predictive Value2.2.2.1 Too Many Comparisons?2.3 Classical Types of Statistical Prediction2.3.1 Predicting True-or-False: Classification2.3.1.1 Error Rates2.3.2 Forecasting Numbers: Regression2.3.2.1 Distance Measures2.4 Measuring Predictive Performance2.4.1 Independent Testing2.4.1.1 Random Training and Testing2.4.1.2 How Accurate Is the Error Estimate? Comparing Results for Error Measures2.4.1.4 Ideal or Real-World Sampling? Training and Testing from Different Time Periods2.5 Too Much Searching and Testing?2.6 Why Are Errors Made?2.7 Bibliographic and Historical Remarks3 Preparing the Data3.1 A Standard Form3.1.1 Standard Measurements3.1.2 Goals3.2 Data Transformations3.2.1 Normalizations3.2.2 Data Smoothing3.2.3 Differences and Ratios3.3 Missing Data3.4 Time-Dependent Data3.4.1 Time Series3.4.2 Composing Features from Time Series3.4.2.1 Current Values3.4.2.2 Moving Averages3.4.2.3 Trends3.4.2.4 Seasonal Adjustments3.5 Hybrid Time-Dependent Applications3.5.1 Multivariate Time Series3.5.2 Classification and Time Series3.5.3 Standard Cases and Time-Series Attributes3.6 Text Mining3.7 Bibliographic and Historical Remarks4 Data Reduction4.1 Selecting the Best Features4.2 Feature Selection from Means and Variances4.2.1 Independent Features4.2.2 Distance-Based Optimal Feature Selection4.2.3 Heuristic Feature Selection4.3 Principal Constraints4.4 Feature Selection by Decision Trees4.5 How Many Measured Values4.5.1 Reducing and Smoothing Values4.5.1.1 Rounding4.5.1.2 K-Means Clustering4.5.1.3 Class Entropy4.6 How Many Cases?4.6.1 A Single Sample4.6.2 Incremental Samples4.6.3 Average Samples4.6.4 Specialized Case-Reduction Techniques4.6.4.1 Sequential Sampling over Time4.6.4.2 Strategic Sampling of Key Events4.6.4.3 Adjusting Prevalence4.7 Bibliographic and Historical Remarks5 Looking for Solutions5.1 Overview5.2 Math Solutions5.2.1 Linear Scoring5.2.2 Nonlinear Scoring: Neural Nets5.2.3 Advanced Statistical Methods5.3 Distance Solutions5.4 Logic Solutions5.4.1 Decision Trees5.4.2 Decision Rules5.5 What Do the Answers Mean?5.5.1 Is It Safe to Edit Solutions?5.6 Which Solution is Preferable?5.7 Combining Different Answers5.7.1 Multiple Prediction Methods5.7.2 Multiple Samples5.8 Bibliographic and Historical Remarks6 What's Best for Data Reduction and Mining?6.1 Let's Analyze Some Real Data6.2 The Experimental Methods6.3 The Empirical Results6.3.1 Significance Testing6.4 So What Did We Learn?6.4.1 Feature Selection6.4.2 Value Reduction6.4.3 Subsampling or All Cases6.5 Graphical Trend Analysis6.5.1 Incremental Case Analysis6.5.2 Incremental Complexity Analysis6.6 Maximum Data Reduction6.7 Are There Winners and Losers in Performance?6.8 Getting the Best Results6.9 Bibliogaphic and Historical Remarks7 Art or Science? Case Studies in Data Mining7.1 Why These Case Studies?7.2 A Summary of Tasks for Predictive Data Mining7.2.1 A Checklist for Data Preparation7.2.2 A Checklist for Data Reduction7.2.3 A Checklist for Data Modeling and Prediction7.2.4 A Checklist for Case and Solution Analyses7.3 The Case Studies7.3.1 Transaction Processing7.3.2 Text Mining7.3.3 Outcomes Analysis7.3.4 Process Control7.3.5 Marketing and User Profiling7.3.6 Exploratory Analysis7.4 Looking Ahead7.5 Bibliographic and Historical RemarksAppendix: Data-Miner Software Kit