To order this title, and for more information, click here
By Dorian Pyle, PTI, Leominster
Description
Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation.
Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how
best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is
certain to be disappointing.
Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry,
Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical
details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced
distortion, and more valuable results.
On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into
a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation
solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along
with sample data with which you can practice and experiment.
Contents Preface
Introduction
Chapter 1 Data Exploration As a Process
1.1 The Data Exploration Process
1.1.1 Stage 1: Exploring the
Problem Space
1.1.2 Stage 2: Exploring the Solution Space
1.1.3 Stage 3: Specifying the Implementation Method
1.1.4 Stage 4: Mining
the Data
1.1.5 Exploration: Mining and Modeling
1.2 Data Mining, Modeling and Modeling Tools
1.2.1 Ten Golden Rules
1.2.2 Introducing
Modeling Tools
1.2.3 Types of Models
1.2.4 Active and Passive Models
1.2.5 Explanatory and Predictive Models
1.2.6 Static and Continuously-learning
Models
1.3 Summary
Supplemental Material
Chapter 2 The Nature of the World and Its Impact on Data Preparation
2.1 Measuring the
World
2.1.1 Objects
2.1.2 Capturing Measurements
2.1.3 Errors of Measurement
2.1.4 Tying Measurements to the Real World
2.2 Types
of Measurements
2.2.1 Scalar Measurements
2.2.2 Non-scalar Measurements
2.3 Continua of Attributes of Variables
2.3.1 The Qualitative
- Quantitative Continuum
2.3.2 The Discrete - Continuous Continuum
2.4 Scale Measurement Example
2.5 Transformations and Difficulties
- Variables, Data, and Information
2.6 Building Mineable Data Representations
2.6.1 Data Representation
2.6.2 Building Data - Dealing
with Variables
2.6.3 Building Mineable Data Sets
2.7 Summary
Supplemental Material
Chapter 3 Data Preparation as a Process
3.1
Data Preparation: Inputs, Outputs, Models, and Decisions
3.1.1 Step 1: Prepare the Data
3.1.2 Step 2: Survey the Data
3.1.3 Step
3: Model the Data
3.1.4 Use the Model
3.2 Modeling Tools and Data Preparation
3.2.1 How Modeling Tools Drive Data Preparation
3.2.2
Decision Trees
3.2.3 Decision Lists
3.2.4 Neural Networks
3.2.5 Evolution Programs
3.2.6 Modeling Data with the Tools
3.2.7 Predictions
and Rules
3.2.8 Choosing Techniques
3.3 Stages of Data Preparation
3.3.1 Stage 1: Data Access
3.3.2 Stage 2: Data Audit
3.3.3 Stage
3: Enhancing and Enriching the Data
3.3.4 Stage 4: Sampling Bias
3.3.5 Stage 5: Data Structure (Super, Macro and Micro)
3.3.6 Stage
6: Building the PIE
3.3.7 Stage 7: Surveying the Data
3.3.8 Stage 8: Modeling
3.4 And the result is . . .?
Chapter 4 Getting the
Data: Basic Preparation
4.1 Data Discovery
4.1.1 Data Access Issues
4.2 Data Characterization
4.2.1 Detail / Aggregation Level
(Granularity)
4.2.2 Consistency
4.2.3 Pollution
4.2.4 Objects
4.2.5 Relationship
4.2.6 Domain
4.2.7 Defaults
4.2.8 Integrity
4.2.9 Concurrency
4.2.10 Duplicate or Redundant Variables
4.3 Assembling the Data Set
4.3.1 Reverse Pivoting
4.3.2 Feature Extraction
4.3.3 Physical or Behavioral Data Sets
4.3.4 Explanatory Structure
4.3.5 Data Enhancement or Enrichment
4.3.6 Sampling Bias
4.4
Example 1: CREDIT
4.4.1 Looking at the Variables
4.4.2 Relationships between Variables
4.5 Example 2: SHOE
4.5.1 Looking At the
Variables
4.5.2 Relationship between Variables
4.6 The Data Assay
Chapter 5 Sampling, Variability and Confidence
5.1 Sampling
or First catch your hare!
5.1.1 How much Data?
5.1.2 Variability
5.1.3 Converging on a Representative Sample
5.1.4 Measuring Variability
5.1.5 Variability and Deviation
5.2 Confidence
5.3 Variability of Numeric Variables
5.3.1 Variability and Sampling
5.3.2 Variability
and Convergence
5.4 Variability and Confidence in Alpha Variables
5.4.1 Ordering and Rate of Discovery
5.5 Measuring Confidence
5.5.1 Modeling and Confidence with the Whole Population
5.5.2 Testing for Confidence
5.5.3 Confidence Tests and Variability
5.6
Confidence in Capturing Variability
5.6.1 A brief introduction to the Normal Distribution
5.6.2 Normally Distributed Probabilities
5.6.3 Capturing Normally Distributed Probabilities; an Example
5.6.4 Capturing Confidence, Capturing Variance
5.7 Problems and Shortcomings
of Taking Samples using Variability
5.7.1 Missing Values
5.7.2 Constants (variables with only one value)
5.7.3 Problems with Sampling
5.7.4 Monotonic Variable Detection
5.7.5 Interstitial Linearity.
5.7.6 Rate of Discovery
5.8 Confidence and Instance Count
5.9 Summary
Supplemental Material
Chapter 6 Handling Non-Numerical Variables
6.1 Representing Alphas and Remapping
6.1.1 One-of-n remapping
6.1.2 M-of-n remapping
6.1.3 Remapping to eliminate Ordering
6.1.4 Remapping one-to-many patterns, or ill-formed problems
6.1.5 Remapping
Circular Discontinuity
6.2 State Space
6.2.1 Unit State Space
6.2.2 Pythagoras in State Space
6.2.3 Position in State Space
6.2.4
Neighbors and Associates
6.2.5 Density and Sparsity
6.2.6 Nearby and Distant Nearest Neighbors
6.2.7 Normalizing Measured Point Separation
6.2.8 Contours, Peaks and Valleys
6.2.9 Mapping State Space
6.2.10 Objects in State Space
6.2.11 Phase Space
6.2.12 Mapping Alpha
Values
6.2.13 Location, location, location!
6.2.14 Numerics, Alphas and the Montreal Canadiens
6.3 Joint Distribution Tables
6.3.1
Two-way Tables
6.3.2 More values, more variables and meaning of the numeration
6.3.3 Dealing with low-frequency alpha labels, and other
problems
6.4 Dimensionality
6.4.1 Multi-Dimensional Scaling
6.4.2 Squashing a Triangle
6.4.3 Projecting Alpha Values
6.4.4 Scree
Plots
6.5 Practical Consideration - Implementing Alpha Numeration in the
Demonstration Code
6.5.1 Implementing Neighborhoods
6.5.2
Implementing Numeration in all Alpha Data Sets
6.5.3 Implementing Dimensionality reduction for Variables
6.6 Summary
Chapter 7 Normalizing
and Redistributing Variables
7.1 Normalizing a Variable's Range
7.1.1 Review of Data Preparation and Modeling (Training, Testing
and
Execution)
7.1.2 The Nature and Scope of the Out-of-Range Values Problem
7.1.3 Discovering the Range of Values When Building the
PIE
7.1.4 Out-of-Range Values When Training
7.1.5 Out-of-Range Values When Testing
7.1.6 Out-of-Range Values When Executing
7.1.7
Scaling Transformations
7.1.8 Softmax Scaling
7.1.9 Normalizing Ranges
7.2 - Redistributing Variable Values
7.2.1 The Nature of
Distributions
7.2.2 Distributive Difficulties
7.2.3 Adjusting Distributions
7.2.4 Modified Distributions
7.3 Summary
Supplemental
Material
Chapter 8 Replacing Missing and Empty Values
8.1 Retaining Information about Missing Values
8.1.1 Missing Value Patterns
8.1.2 Capturing Patterns
8.2 Replacing Missing Values
8.2.1 Unbiased Estimators
8.2.2 Variability Relationships
8.2.3 Relationships
between Variables
8.2.4 Preserving Between Variable Relationships
8.3 Summary
Supplemental Material
Chapter 9 Series Variables
9.1 Here there be Dragons!
9.2 Types of Series
9.3 - Describing Series Data
9.3.1 Constructing a Series
9.3.2 Features of a
Series
9.3.3 Describing a Series - Fourier
9.3.4 Describing a Series - Spectrum
9.3.5 Describing a Series - Trend, Season, Cycles,
Noise
9.3.6 Describing a Series - Autocorrelation
9.4 Modeling Series Data
9.5 Repairing Series Data Problems
9.5.1 Missing values
9.5.2 Outliers
9.5.3 Non-Uniform Displacement
9.5.4 Trend
9.6 Tools
9.6.1 Filtering
9.6.2 Moving Averages
9.6.3 Smoothing 1 -
PVM Smoothing
9.6.4 Smoothing 2 - Median Smoothing, Resmoothing and Hanning
9.6.5 Extraction
9.6.6 Differencing
9.7 Other Problems
9.7.1 Numerating Alpha Values
9.7.2 Distribution
9.8 Preparing Series Data
9.8.1 Looking at the Data
9.8 2 Signposts on the Rocky
Road
9.9 Implementation Notes
Chapter 10 Preparing the Data Set
10.1 Using Sparsely Populated Variables
10.1.1 Increasing Information
Density using Sparsely Populated Variables
10.1.2 Binning Sparse Numerical Values
10.1.3 Present Value Patterns (PVPs)
10.2 Problems
with High Dimensionality Data Sets
10.2.1 Information Representation
10.2.2 Representing High Dimensionality Data in Less Dimensions
10.3 Introducing the Neural Network.
10.3.1 Training a Neural Network
10.3.2 Neurons
10.3.3 Reshaping the Logistic Curve
10.3.4
Single Input Neurons
10.3.5 Multiple Input Neurons
10.3.6 Networking Neurons to Estimate a Function
10.3.7 Network Learning
10.3.8
Network Prediction - Hidden Layer
10.3.9 Network Prediction - Output Layer
10.3.10 Stochastic Network Performance
10.3.11 Network
Architecture 1 - The Autoassociative Network
10.3.12 Network Architecture 2 - The Sparsely Connected Network
10.4 Compressing Variables
10.4.1 Using Compressed Dimensionality Data
10.5 Removing Variables
10.5.1 Estimating Variable Importance 1: What Doesn't Work
10.5.2 Estimating Variable Importance 2: Clues
10.5.3 Estimating Variable Importance 3: Configuring and Training the
Network
10.6 How
Much Data is Enough?
10.6.1 Joint Distribution
10.6.2 Capturing Joint Variability
10.6.3 Degrees of Freedom
10.7 Beyond Joint Distribution
10.7.1 Enhancing the Data Set
10.7.2 Data Sets in Perspective
10.8 Implementation Notes
10.8.1 Collapsing Extremely Sparsely Populated
Variables
10.8.2 Reducing Excessive Dimensionality
10.8.3 Measuring Variable Importance
10.8.4 Feature Enhancement
10.9 Where Next?
Chapter 11 The Data Survey
11.1 Introduction to the Data Survey
11.2 Information and Communication
11.2.1 Measuring Information:
Signals and Dictionaries
11.2.2 Measuring Information: Signals
11.2.3 Measuring Information: Bits of Information
11.2.4 Measuring
Information: Surprise
11.2.5 Measuring Information: Entropy
11.2.6 Measuring Information: Dictionaries
11.3 Mapping using Entropy
11.3.1 Whole Data Set Entropy
11.3.2 Conditional Entropy between Inputs and Outputs
11.3.3 Mutual Information
11.3.4 Other Survey
Uses for Entropy and Information
11.3.5 Looking for Information
11.4 Identifying Problems with a Data Survey
11.4.1 Confidence and
Sufficient Data
11.4.2 Detecting Sparsity
11.4.3 - Manifold Definition
11.5 Clusters
11.6 Sampling Bias
11.7 Making the Data Survey
11.8 Other Directions
Supplemental Material
Chapter 12 Using Prepared Data
12.1 Modeling Data
12.1.1 Assumptions
12.1.2 Models
12.1.2 Data Mining versus Exploratory Data Analysis
12.2 Modeling Data
12.2.1 Decision Trees
12.2.2 Clusters
12.2.3 Nearest Neighbor
12.2.4 Neural Networks and Regression
12.3 Prepared Data and Modeling Algorithms
12.3.1 Neural Networks and the CREDIT Data Set
12.3.2 Decision Trees and the CREDIT Data Set
12.4 Practically using Data Preparation and Prepared Data
12.5 Looking at Present Modeling
Tools, and Future Directions
12.5.1 Near Future
12.5.2 Farther out
Appendix A Using the Demonstration Code on the CD
Appendix
B Further Reading
Index
Books and book related electronic products are priced in US dollars (USD), euro (EUR), and Great Britain Pounds (GBP). USD prices apply to the Americas and Asia Pacific. EUR prices apply in Europe and the Middle East. GBP prices apply to the UK and all other countries.