Home | Site map | Elsevier websites | Alerts
Elsevier
Product information search
Search all Elsevier sites
Search
Advanced Product Search
Go to Elsevier home page
SiteStat.jsp
DATA PREPARATION FOR DATA MINING
Data Preparation for Data MiningTo order this title, and for more information, click here

By
Dorian Pyle, PTI, Leominster

Description


Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along with sample data with which you can practice and experiment.

Contents
Preface Introduction Chapter 1 Data Exploration As a Process 1.1 The Data Exploration Process 1.1.1 Stage 1: Exploring the Problem Space 1.1.2 Stage 2: Exploring the Solution Space 1.1.3 Stage 3: Specifying the Implementation Method 1.1.4 Stage 4: Mining the Data 1.1.5 Exploration: Mining and Modeling 1.2 Data Mining, Modeling and Modeling Tools 1.2.1 Ten Golden Rules 1.2.2 Introducing Modeling Tools 1.2.3 Types of Models 1.2.4 Active and Passive Models 1.2.5 Explanatory and Predictive Models 1.2.6 Static and Continuously-learning Models 1.3 Summary Supplemental Material Chapter 2 The Nature of the World and Its Impact on Data Preparation 2.1 Measuring the World 2.1.1 Objects 2.1.2 Capturing Measurements 2.1.3 Errors of Measurement 2.1.4 Tying Measurements to the Real World 2.2 Types of Measurements 2.2.1 Scalar Measurements 2.2.2 Non-scalar Measurements 2.3 Continua of Attributes of Variables 2.3.1 The Qualitative - Quantitative Continuum 2.3.2 The Discrete - Continuous Continuum 2.4 Scale Measurement Example 2.5 Transformations and Difficulties - Variables, Data, and Information 2.6 Building Mineable Data Representations 2.6.1 Data Representation 2.6.2 Building Data - Dealing with Variables 2.6.3 Building Mineable Data Sets 2.7 Summary Supplemental Material Chapter 3 Data Preparation as a Process 3.1 Data Preparation: Inputs, Outputs, Models, and Decisions 3.1.1 Step 1: Prepare the Data 3.1.2 Step 2: Survey the Data 3.1.3 Step 3: Model the Data 3.1.4 Use the Model 3.2 Modeling Tools and Data Preparation 3.2.1 How Modeling Tools Drive Data Preparation 3.2.2 Decision Trees 3.2.3 Decision Lists 3.2.4 Neural Networks 3.2.5 Evolution Programs 3.2.6 Modeling Data with the Tools 3.2.7 Predictions and Rules 3.2.8 Choosing Techniques 3.3 Stages of Data Preparation 3.3.1 Stage 1: Data Access 3.3.2 Stage 2: Data Audit 3.3.3 Stage 3: Enhancing and Enriching the Data 3.3.4 Stage 4: Sampling Bias 3.3.5 Stage 5: Data Structure (Super, Macro and Micro) 3.3.6 Stage 6: Building the PIE 3.3.7 Stage 7: Surveying the Data 3.3.8 Stage 8: Modeling 3.4 And the result is . . .? Chapter 4 Getting the Data: Basic Preparation 4.1 Data Discovery 4.1.1 Data Access Issues 4.2 Data Characterization 4.2.1 Detail / Aggregation Level (Granularity) 4.2.2 Consistency 4.2.3 Pollution 4.2.4 Objects 4.2.5 Relationship 4.2.6 Domain 4.2.7 Defaults 4.2.8 Integrity 4.2.9 Concurrency 4.2.10 Duplicate or Redundant Variables 4.3 Assembling the Data Set 4.3.1 Reverse Pivoting 4.3.2 Feature Extraction 4.3.3 Physical or Behavioral Data Sets 4.3.4 Explanatory Structure 4.3.5 Data Enhancement or Enrichment 4.3.6 Sampling Bias 4.4 Example 1: CREDIT 4.4.1 Looking at the Variables 4.4.2 Relationships between Variables 4.5 Example 2: SHOE 4.5.1 Looking At the Variables 4.5.2 Relationship between Variables 4.6 The Data Assay Chapter 5 Sampling, Variability and Confidence 5.1 Sampling or First catch your hare! 5.1.1 How much Data? 5.1.2 Variability 5.1.3 Converging on a Representative Sample 5.1.4 Measuring Variability 5.1.5 Variability and Deviation 5.2 Confidence 5.3 Variability of Numeric Variables 5.3.1 Variability and Sampling 5.3.2 Variability and Convergence 5.4 Variability and Confidence in Alpha Variables 5.4.1 Ordering and Rate of Discovery 5.5 Measuring Confidence 5.5.1 Modeling and Confidence with the Whole Population 5.5.2 Testing for Confidence 5.5.3 Confidence Tests and Variability 5.6 Confidence in Capturing Variability 5.6.1 A brief introduction to the Normal Distribution 5.6.2 Normally Distributed Probabilities 5.6.3 Capturing Normally Distributed Probabilities; an Example 5.6.4 Capturing Confidence, Capturing Variance 5.7 Problems and Shortcomings of Taking Samples using Variability 5.7.1 Missing Values 5.7.2 Constants (variables with only one value) 5.7.3 Problems with Sampling 5.7.4 Monotonic Variable Detection 5.7.5 Interstitial Linearity. 5.7.6 Rate of Discovery 5.8 Confidence and Instance Count 5.9 Summary Supplemental Material Chapter 6 Handling Non-Numerical Variables 6.1 Representing Alphas and Remapping 6.1.1 One-of-n remapping 6.1.2 M-of-n remapping 6.1.3 Remapping to eliminate Ordering 6.1.4 Remapping one-to-many patterns, or ill-formed problems 6.1.5 Remapping Circular Discontinuity 6.2 State Space 6.2.1 Unit State Space 6.2.2 Pythagoras in State Space 6.2.3 Position in State Space 6.2.4 Neighbors and Associates 6.2.5 Density and Sparsity 6.2.6 Nearby and Distant Nearest Neighbors 6.2.7 Normalizing Measured Point Separation 6.2.8 Contours, Peaks and Valleys 6.2.9 Mapping State Space 6.2.10 Objects in State Space 6.2.11 Phase Space 6.2.12 Mapping Alpha Values 6.2.13 Location, location, location! 6.2.14 Numerics, Alphas and the Montreal Canadiens 6.3 Joint Distribution Tables 6.3.1 Two-way Tables 6.3.2 More values, more variables and meaning of the numeration 6.3.3 Dealing with low-frequency alpha labels, and other problems 6.4 Dimensionality 6.4.1 Multi-Dimensional Scaling 6.4.2 Squashing a Triangle 6.4.3 Projecting Alpha Values 6.4.4 Scree Plots 6.5 Practical Consideration - Implementing Alpha Numeration in the Demonstration Code 6.5.1 Implementing Neighborhoods 6.5.2 Implementing Numeration in all Alpha Data Sets 6.5.3 Implementing Dimensionality reduction for Variables 6.6 Summary Chapter 7 Normalizing and Redistributing Variables 7.1 Normalizing a Variable's Range 7.1.1 Review of Data Preparation and Modeling (Training, Testing and Execution) 7.1.2 The Nature and Scope of the Out-of-Range Values Problem 7.1.3 Discovering the Range of Values When Building the PIE 7.1.4 Out-of-Range Values When Training 7.1.5 Out-of-Range Values When Testing 7.1.6 Out-of-Range Values When Executing 7.1.7 Scaling Transformations 7.1.8 Softmax Scaling 7.1.9 Normalizing Ranges 7.2 - Redistributing Variable Values 7.2.1 The Nature of Distributions 7.2.2 Distributive Difficulties 7.2.3 Adjusting Distributions 7.2.4 Modified Distributions 7.3 Summary Supplemental Material Chapter 8 Replacing Missing and Empty Values 8.1 Retaining Information about Missing Values 8.1.1 Missing Value Patterns 8.1.2 Capturing Patterns 8.2 Replacing Missing Values 8.2.1 Unbiased Estimators 8.2.2 Variability Relationships 8.2.3 Relationships between Variables 8.2.4 Preserving Between Variable Relationships 8.3 Summary Supplemental Material Chapter 9 Series Variables 9.1 Here there be Dragons! 9.2 Types of Series 9.3 - Describing Series Data 9.3.1 Constructing a Series 9.3.2 Features of a Series 9.3.3 Describing a Series - Fourier 9.3.4 Describing a Series - Spectrum 9.3.5 Describing a Series - Trend, Season, Cycles, Noise 9.3.6 Describing a Series - Autocorrelation 9.4 Modeling Series Data 9.5 Repairing Series Data Problems 9.5.1 Missing values 9.5.2 Outliers 9.5.3 Non-Uniform Displacement 9.5.4 Trend 9.6 Tools 9.6.1 Filtering 9.6.2 Moving Averages 9.6.3 Smoothing 1 - PVM Smoothing 9.6.4 Smoothing 2 - Median Smoothing, Resmoothing and Hanning 9.6.5 Extraction 9.6.6 Differencing 9.7 Other Problems 9.7.1 Numerating Alpha Values 9.7.2 Distribution 9.8 Preparing Series Data 9.8.1 Looking at the Data 9.8 2 Signposts on the Rocky Road 9.9 Implementation Notes Chapter 10 Preparing the Data Set 10.1 Using Sparsely Populated Variables 10.1.1 Increasing Information Density using Sparsely Populated Variables 10.1.2 Binning Sparse Numerical Values 10.1.3 Present Value Patterns (PVPs) 10.2 Problems with High Dimensionality Data Sets 10.2.1 Information Representation 10.2.2 Representing High Dimensionality Data in Less Dimensions 10.3 Introducing the Neural Network. 10.3.1 Training a Neural Network 10.3.2 Neurons 10.3.3 Reshaping the Logistic Curve 10.3.4 Single Input Neurons 10.3.5 Multiple Input Neurons 10.3.6 Networking Neurons to Estimate a Function 10.3.7 Network Learning 10.3.8 Network Prediction - Hidden Layer 10.3.9 Network Prediction - Output Layer 10.3.10 Stochastic Network Performance 10.3.11 Network Architecture 1 - The Autoassociative Network 10.3.12 Network Architecture 2 - The Sparsely Connected Network 10.4 Compressing Variables 10.4.1 Using Compressed Dimensionality Data 10.5 Removing Variables 10.5.1 Estimating Variable Importance 1: What Doesn't Work 10.5.2 Estimating Variable Importance 2: Clues 10.5.3 Estimating Variable Importance 3: Configuring and Training the Network 10.6 How Much Data is Enough? 10.6.1 Joint Distribution 10.6.2 Capturing Joint Variability 10.6.3 Degrees of Freedom 10.7 Beyond Joint Distribution 10.7.1 Enhancing the Data Set 10.7.2 Data Sets in Perspective 10.8 Implementation Notes 10.8.1 Collapsing Extremely Sparsely Populated Variables 10.8.2 Reducing Excessive Dimensionality 10.8.3 Measuring Variable Importance 10.8.4 Feature Enhancement 10.9 Where Next? Chapter 11 The Data Survey 11.1 Introduction to the Data Survey 11.2 Information and Communication 11.2.1 Measuring Information: Signals and Dictionaries 11.2.2 Measuring Information: Signals 11.2.3 Measuring Information: Bits of Information 11.2.4 Measuring Information: Surprise 11.2.5 Measuring Information: Entropy 11.2.6 Measuring Information: Dictionaries 11.3 Mapping using Entropy 11.3.1 Whole Data Set Entropy 11.3.2 Conditional Entropy between Inputs and Outputs 11.3.3 Mutual Information 11.3.4 Other Survey Uses for Entropy and Information 11.3.5 Looking for Information 11.4 Identifying Problems with a Data Survey 11.4.1 Confidence and Sufficient Data 11.4.2 Detecting Sparsity 11.4.3 - Manifold Definition 11.5 Clusters 11.6 Sampling Bias 11.7 Making the Data Survey 11.8 Other Directions Supplemental Material Chapter 12 Using Prepared Data 12.1 Modeling Data 12.1.1 Assumptions 12.1.2 Models 12.1.2 Data Mining versus Exploratory Data Analysis 12.2 Modeling Data 12.2.1 Decision Trees 12.2.2 Clusters 12.2.3 Nearest Neighbor 12.2.4 Neural Networks and Regression 12.3 Prepared Data and Modeling Algorithms 12.3.1 Neural Networks and the CREDIT Data Set 12.3.2 Decision Trees and the CREDIT Data Set 12.4 Practically using Data Preparation and Prepared Data 12.5 Looking at Present Modeling Tools, and Future Directions 12.5.1 Near Future 12.5.2 Farther out Appendix A Using the Demonstration Code on the CD Appendix B Further Reading Index

Bibliographic details
Paperback, 560 pages, publication date: MAR-1999
ISBN-13: 978-1-55860-529-9
ISBN-10: 1-55860-529-0
Imprint: MORGAN KAUFFMAN

Price and Ordering
Price:
EUR 61.95
GBP 40.99
USD 71.95
order now
Books and book related electronic products are priced in US dollars (USD), euro (EUR), and Great Britain Pounds (GBP). USD prices apply to the Americas and Asia Pacific. EUR prices apply in Europe and the Middle East. GBP prices apply to the UK and all other countries.
See also information about conditions of sale & ordering procedures, and links to our regional sales offices.

077/762
Last update: 27 Sep 2008
Book contents
Table of contents
Reviews
Submit your review
Bookmark this page
Recommend this publication
Overview of all books
Printer-friendly version   Printer-friendly version
 Home | Site map | Privacy policy | Terms and Conditions | Feedback | A Reed Elsevier company
 Copyright © 2008 Elsevier B.V. All rights reserved.