# Data Preparation for Data Mining Using SAS

**By**

- Mamdouh Refaat, Consultant

Are you a data mining analyst, who spends up to 80% of your time assuring data quality, then preparing that data for developing and deploying predictive models? And do you find lots of literature on data mining theory and concepts, but when it comes to practical advice on developing good mining views find little âhow toâ information? And are you, like most analysts, preparing the data in SAS?This book is intended to fill this gap as your source of practical recipes. It introduces a framework for the process of data preparation for data mining, and presents the detailed implementation of each step in SAS. In addition, business applications of data mining modeling require you to deal with a large number of variables, typically hundreds if not thousands. Therefore, the book devotes several chapters to the methods of data transformation and variable selection.

View full description### Audience

Data Mining professionals, business analysts, SAS programmers, and data management and statistics students who plan to work in data mining. Essentially the same audience as all of our data mining books.

### Book information

- Published: September 2006
- Imprint: MORGAN KAUFMANN
- ISBN: 978-0-12-373577-5

### Reviews

It is easy to write books that address broad topics and ideas leaving the reader with the question âYes, but how?â By combining a comprehensive guide to data preparation for data mining along with specific examples in SAS, Mamdouh's book is a rare find-a blend of theory and the practical at the same time. As anyone who has mined data will confess, 80% of the problem is in data preparation; Mamdouh addresses this difficult subject with strong practical techniques and methods. If you are working on an SAS data mining project, this book is a must! If you are working on any data mining project, the techniques and methods will be a guiding light! --Frank Byrum, Cormine Intelligent Data, LLC

### Table of Contents

Contents1 Introduction 1.1 The Data Mining Process 1.2 Methodologies of Data Mining 1.3 The Mining View 1.4 Scoring View 1.5 Notes on Data Mining Software 2 Tasks and Data Flow 2.1 Data Mining Tasks 2.2 Data Mining Competencies 2.3 The Data Flow 2.4 Types of Variables 2.5 The Mining View and the Scoring View 2.6 Steps of Data Preparation 3 Review of Data Mining Modeling Techniques3.1 Introduction 3.2 Regression Models 3.3 Decision trees 3.4 Neural Networks 3.5 Cluster Analysis 3.6 Association Rules 3.7 Time Series Analysis 3.8 Support Vector Machines 4 SAS Macros: A Quick Start 4.1 Introduction: Why Macros 4.2 The Basics - The Macro and Its Variables 4.3 Doing Calculations 4.4 Programming Logic 4.5 Working with Strings 4.6 Macros that Call Other Macros 4.7 Common Macro Patterns and Caveats 4.8 Where to Go From Here 5 Data Acquisition and Integration 5.1 Introduction 5.2 Sources of Data 5.3 Variable Types 5.4 Data Roll Up 5.5 Roll Up With Sums, Averages and Counts 5.6 Calculation of the Mode 5.7 Data Integration 6 Integrity Checks 6.1 Introduction 6.2 Comparing Datasets 6.3 Dataset Schema Checks 6.3.2 Variable Types 6.4 Nominal Variables 6.5 Continuous Variables 7 Exploratory Data Analysis 7.1 Introduction 7.2 Common EDA Procedures 7.3 Univariate Statistics 7.4 Variable Distribution 7.5 Detection of Outliers 7.5.4 Notes on Outliers 7.6 Testing Normality 7.7 Cross-tabulation 7.8 Investigating Data Structures 8 Sampling and Partitioning 8.1 Introduction 8.2 Contents of Samples 8.3 Random Sampling 8.4 Balanced Sampling 8.5 Minimum Sample Size 9 Data Transformations 9.1 Raw and Analytical Variables 9.2 Scope of Data Transformations 9.3 Creation of New Variables 9.4 Mapping of Nominal Variables 9.5 Normalization of Continuous Variables 9.6 Changing the Variable Distribution 10 Binning and Reduction of Cardinality 10.1 Introduction 10.2 Cardinality Reduction 10.2.1 The Main Questions 10.2.2 Structured Grouping Methods 10.2.3 Splitting a Dataset 10.2.4 The Main Algorithm 10.2.5 Reduction of Cardinality Using Gini Measure 10.2.6 Limitations and Modifications 10.3 Binning of Continuous Variables 11 Treatment of Missing Values 11.1 Introduction 11.2 Simple Replacement 11.3 Imputing Missing Values 11.3.1 Basic Issues in Multiple Imputation 11.3.2 Patterns of Missingness 11.4 Imputation Methods and Strategy 11.5 SAS Macros for Multiple Imputation Nominal Variables11.6 Predicting Missing Values 12 Predictive Power and Variable Reduction I12.1 Introduction 12.2 Metrics of Predictive Power .12.3 Methods of Variable Reduction 12.4 Variable Reduction : before or during modeling 13 Analysis of Nominal and Ordinal Variables 13.1 Introduction 13.2 Contingency Tables 13.3 Notation and Definitions 13.4 Contingency Tables for Binary Variables 13.5 Contingency Tables for Multi - Category Variables 13.6 Analysis of Ordinal Variables 13.7 Implementation Scenarios 14 Analysis of Continuous Variables 14.1 Introduction 14.2 When is Binning Necessary? 14.3 Measures of Association 14.4 Correlation Coefficients 15 Principal Component Analysis (PCA) 215.1 Introduction 15.2 Mathematical Formulations 15.3 Implementing and Using PCA . 15.4 Comments on Using PCA 15.4.1 Number of Principal Components 15.4.2 Success of PCA 15.4.3 Nominal Variables 15.4.4 Dataset Size and Performance 16 Factor Analysis 16.1 Introduction to Factor Analysis 16.2 Relationship between PCA and FA 16.3 Implementation of Factor Analysis 17 Predictive Power and Variable Reduction II 17.1 Introduction 17.2 Data with Binary Dependent Variables 17.3 Nominal IVâs 17.3.2 Ordinal IVâs 17.4 Variable Reduction Strategies 18 Putting it All Together 18.1 Introduction 18.2 The Process of Data Preparation 18.3 Case Study: The Bookstore A Listing of SAS Macros A.1 Copyright and Software License A.2 Dependencies between Macros A.3 Data Acquisition and Integration A.4 Integrity Checks A.5 Exploratory Data Analysis A.6 Sampling and Partitioning A.7 Data Transformations A.8 Binning and Reduction of Cardinality A.9 Treatment of Missing Values A.10 Analysis of Nominal and Ordinal Variables A.11 Analysis of Continuous Variables A.12 Principal Component Analysis