Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

1st Edition - January 11, 2012
Authors: Gary D. Miner, John Elder, Andrew Fast, Thomas Hill, Robert Nisbet, Dursun Delen
Language: English
eBook ISBN:
9 7 8 - 0 - 1 2 - 3 8 7 0 1 1 - 7

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications brings together all the information, tools and methods a professional will need to effici… Read more

Purchase options

LIMITED OFFER

Save 50% on book bundles

Immediately download your ebook while waiting for your print delivery. No promo code is needed.

Institutional subscription on ScienceDirect

Request a sales quote

Resources

Companion materials(opens in new tab/window)Textbook support for instructors(opens in new tab/window)

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications brings together all the information, tools and methods a professional will need to efficiently use text mining applications and statistical analysis.

Winner of a 2012 PROSE Award in Computing and Information Sciences from the Association of American Publishers, this book presents a comprehensive how-to reference that shows the user how to conduct text mining and statistically analyze results. In addition to providing an in-depth examination of core text mining and link detection tools, methods and operations, the book examines advanced preprocessing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text mining and link detection using real world example tutorials in such varied fields as corporate, finance, business intelligence, genomics research, and counterterrorism activities.

The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase dramatically.

Dedication

Endorsements for Practical Text Mining & Statistical Analysis for Non-structured Text Data Applications

Foreword 1

Foreword 2

Foreword 3

Acknowledgments

Preface

About the Authors

Introduction

Building the Workshop Manual

Communication

The Structure of this Book

Part I: Basic Text Mining Principles

Part II: Tutorials

Part III: Advanced Topics

Tutorials

Why Did We Write This Book?

What Are the Benefits of Text Mining?

Blast Off!

References

List of Tutorials by Guest Authors

Part I: Basic Text Mining Principles

Chapter 1. The History of Text Mining

Preamble

The Roots of Text Mining: Information Retrieval, Extraction, and Summarization

Information Extraction and Modern Text Mining

Major Innovations in Text Mining since 2000

The Development of Enabling Technology in Text Mining

Emerging Applications in Text Mining

Sentiment Analysis and Opinion Mining

IBM’s Watson: An “Intelligent” Text Mining Machine?

What’s Next?

Postscript

References

Chapter 2. The Seven Practice Areas of Text Analytics

Preamble

What is Text Mining?

The Seven Practice Areas of Text Analytics

Five Questions for Finding the Right Practice Area

The Seven Practice Areas in Depth

Interactions between the Practice Areas

Scope of This Book

Summary

Postscript

References

Chapter 3. Conceptual Foundations of Text Mining and Preprocessing Steps

Preamble

Introduction

Syntax versus Semantics

The Generalized Vector-Space Model

Preprocessing Text

Creating Vectors from Processed Text

Summary

Postscript

Reference

Chapter 4. Applications and Use Cases for Text Mining

Preamble

Why Is Text Mining Useful?

Extracting “Meaning” from Unstructured Text

Summarizing Text

Common Approaches to Extracting Meaning

Extracting Information through Statistical Natural Language Processing

Statistical Analysis of Dimensions of Meaning

Beyond Statistical Analysis of Word Frequencies: Parsing and Analyzing Syntax

Review

Improving Accuracy in Predictive Modeling

Using Statistical Natural Language Processing to Improve Lift

Using Dictionaries to Improve Prediction

Identifying Similarity and Relevance by Searching

Part of Speech Tagging and Entity Extraction

Summary

Postscript

References

Chapter 5. Text Mining Methodology

Preamble

Text Mining Applications

Cross-Industry Standard Process for Data Mining (CRISP-DM)

Example 1: An Exploratory Literature Survey Using Text Mining

Postscript

References

Chapter 6. Three Common Text Mining Software Tools

Preamble

Introduction

IBM SPSS Modeler Premium

SAS Text Miner

About the Scenarios in This SAS Section

Tips for Text Mining

STATISTICA Text Miner

Summary: STATISTICA Text Miner

Postscript

Part II: Introduction to the Tutorial and Case Study Section of This Book

Introduction

Reference

Tutorial AA. Case Study: Using the Social Share of Voice to Predict Events That Are about to Happen

Analysis

Summary

Tutorial BB. Mining Twitter for Airline Consumer Sentiment

Introduction

What Is R?

Loading Data into R

The twitteR Package

Extracting Text from Tweets

The plyr Package

Estimating Sentiment

Loading the Opinion Lexicon

Implementing Our Sentiment Scoring Algorithm

Algorithm Sanity Check

data.frames Hold Tabular Data

Scoring the Tweets

Repeat for Each Airline

Compare the Score Distributions

Ignore the Middle

Compare with ACSI’s Customer Satisfaction Index

Scrape the ACSI Website

Compare Twitter Results with ACSI Scores

Graph the Results

Notes and Acknowledgments

References

Tutorial A. Using STATISTICA Text Miner to Monitor and Predict Success of Marketing Campaigns Based on Social Media Data

Introduction

The Key Issue

Step 1: Collecting Data

Step 2: Monitoring the Situation

Step 3: Creating Predictive Models

Step 4: Performing a “What-If” Analysis of the Marketing Campaigns

Step 5: Performing Sentiment Analysis

Summary

Tutorial B. Text Mining Improves Model Performance in Predicting Airplane Flight Accident Outcome

Introduction

The Data

Text Mining the Data

Text Mining Results

Data Preparation

Using Text Mining Results to Build Predictive Models

Tutorial C. Insurance Industry: Text Analytics Adds “Lift” to Predictive Models with STATISTICA Text and Data Miner

Introduction

Data Description

Part A: Comparing the Lift of Predictive Models with and without Text Mining

Boosted Trees (without Text Material)

Boosted Trees Adding the Text Mining Variables

How to Merge Graphs

Part B: Enterprise Deployment

Summary

Tutorial D. Analysis of Survey Data for Establishing the “Best Medical Survey Instrument” Using Text Mining

Introduction

The Analysis

Summary

Tutorial E. Analysis of Survey Data for Establishing “Best Medical Survey Instrument” Using Text Mining: Central Asian (Russian Language) Study Tutorial 2: Potential for Constructing Instruments That Have Increased Validity

Introduction

The Analysis

Summary

Tutorial F. Using eBay Text for Predicting ATLAS Instrumental Learning

Introduction

Examining the Data by Types

Summary

Reference

Tutorial G. Text Mining for Patterns in Children’s Sleep Disorders Using STATISTICA Text Miner

Setting Up the Analysis

Reviewing Results

Summary

Tutorial H. Extracting Knowledge from Published Literature Using RapidMiner

Introduction

Motivation

A Brief Introduction to RapidMiner

Text Analytics in RapidMiner

Starting a New Process

Summary

Reference

Tutorial I. Text Mining Speech Samples: Can the Speech of Individuals Diagnosed with Schizophrenia Differentiate Them from Unaffected Controls?

Introduction

Objectives

Case Study: The Steps Used to Prepare the Data

Results and Analysis

Summary

References

Tutorial J. Text Mining Using STM™, CART®, and TreeNet® from Salford Systems: Analysis of 16,000 iPod Auctions on eBay

Installing the Salford Text Miner

Comments on the Challenge

Tutorial K. Predicting Micro Lending Loan Defaults Using SAS® Text Miner

Introduction

About SAS® Text Miner

Project Overview

Preparing the Data and Setting Up the Diagram

Creating a New Project

Registering the Table

Creating a New Diagram

Text Filter Node

Text Topic Node

Creating the Text Mining Flow

Inserting the Data

Understanding Text Parsing

Synonyms and Multiterm Words

Defining Topics

Other Uses of the Interactive Topic Viewer

Making the Predictive Model

Final Results

Viewing the Reports

Text Only Decision Tree

All Variable Text and Relational

Conclusion

Tutorial L. Opera Lyrics: Text Analytics Compared by the Composer and the Century of Composition—Wagner versus Puccini

Tutorial M. Case Study: Sentiment-Based Text Analytics to Better Predict Customer Satisfaction and Net Promoter® Score Using IBM®SPSS® Modeler

Introduction

Business Objectives

Case Study

Creating New Categories and Adding Missing Descriptors

Results and Analysis

Summary

References

Tutorial N. Case Study: Detecting Deception in Text with Freely Available Text and Data Mining Tools

Introduction

General Architecture for Test Engineering

Linguistic Inquiry and Word Count

Working with General Architecture for Test Engineering and Linguistic Inquiry and Word Count Output

Summary

References

Tutorial O. Predicting Box Office Success of Motion Pictures with Text Mining

Introduction

Analysis

Summary

References

Tutorial P. A Hands-On Tutorial of Text Mining in PASW: Clustering and Sentiment Analysis Using Tweets from Twitter

Introduction

Objective

Case Study

Categorization

Cluster Analysis

Analyzing Text Links

Additional Settings

Summary

Tutorial Q. A Hands-On Tutorial on Text Mining in SAS®: Analysis of Customer Comments for Clustering and Predictive Modeling

Introduction

Objective

Case Study

Summary

References

Tutorial R. Scoring Retention and Success of Incoming College Freshmen Using Text Analytics

Introduction

Part I. Predictive Modeling Using Only the Numeric Variables

Part II. Text Mining and Text Variables’ Word Frequencies and Concepts

Tutorial S. Searching for Relationships in Product Recall Data from the Consumer Product Safety Commission with STATISTICA Text Miner

Specifying the Analysis

Reviewing the Results

Tutorial T. Potential Problems That Can Arise in Text Mining: Example Using NALL Aviation Data

Introduction

Spelling Errors

Example: Finding Spelling Errors in Text Miner

Combine Words

Misspellings as Synonyms

Unexpected Terms

Example: Finding Unexpected Terms

Different File Types

Summary

Tutorial U. Exploring the Unabomber Manifesto Using Text Miner

Introduction

Summarizing the Text

Searching for Trends with Pronouns

References

Tutorial V. Text Mining PubMed: Extracting Publications on Genes and Genetic Markers Associated with Migraine Headaches from PubMed Abstracts

Tutorial W. Case Study: The Problem with the Use of Medical Abbreviations by Physicians and Health Care Providers

The Present Problem in the use of Medical Abbreviations by Physicians and Health Care Providers

TJC (JCAHO) “Do Not Use” Abbreviations

Additional Abbreviations, Acronyms, and Symbols

Using the “Text Mining Project” Format of STATISTICA Text Miner

Using TextMiner3.dbs

Conclusion

Intervention Training Needed

References

Tutorial X. Classifying Documents with Respect to “Earnings” and Then Making a Predictive Model for the Target Variable Using Decision Trees, MARSplines, Naïve Bayes Classifier, and K-Nearest Neighbors with STATISTICA Text Miner

Introduction: Automatic Text Classification

Data File with File References

Specifying the Analysis

Processing the Data Analysis

Saving the Extracted Word Frequencies to the Input File

Initial Feature Selection

General Classification and Regression Trees

K-Nearest Neighbors Modeling

Conclusion

Reference

Tutorial y. Case Study: Predicting Exposure of Social Messages: The Bin Laden Live Tweeter

Introduction

Analysis

Summary

Tutorial Z. The InFLUence Model: Web Crawling, Text Mining, and Predictive Analysis with 2010–2011 Influenza Guidelines—CDC, IDSA, WHO, and FMC

Abstract

Web Crawling and Text Mining of CDC Documents on FLU

Feature Selection

MARSplines Interactive Module Modeling

Boosted Trees

Naïve Bayes Modeling

K-Nearest Neighbors

Part III: Advanced Topics

Chapter 7. Text Classification and Categorization

Preamble

Introduction

Defining a Classification Problem

Feature Creation

Text Classification Algorithms

Combining Evidence

Evaluating Text Classifiers

Hierarchical Text Classification

Text Classification Applications

Summary

Postscript

References

Chapter 8. Prediction in Text Mining: The Data Mining Algorithms of Predictive Analytics

Preamble

Introduction

The Power of Simple Descriptive Statistics, Graphics, and Visual Text Mining

Visual Data Mining

Predictive Modeling (Supervised Learning)

Statistical Models versus General Predictive Modeling

Clustering (Unsupervised Learning)

Singular Value Decomposition, Principal Components Analysis, and Dimension Reduction

Association and Link Analysis

Summary

Postscript

References

Chapter 9. Entity Extraction

Preamble

Introduction

Text Features for Entity Extraction

Strategies for Entity Extraction

Choosing an Entity Extraction Approach

Evaluating Entity Extraction

Summary

Postscript

References

Chapter 10. Feature Selection and Dimensionality Reduction

Preamble

Introduction

Feature Selection

Feature Selection Approaches

Dimensionality Reduction

Linear Dimensionality Reduction Approaches

Postscript

References

Chapter 11. Singular Value Decomposition in Text Mining

Preamble

Introduction

Redundancy in Text

Dimensions of Meaning: Latent Semantic Indexing

The Math of Singular Value Decomposition

Graphical Representations and Simple Examples

Singular Value Decomposition in Equation Form

Singular Value Decomposition and Principal Components Analysis Eigenvalues

Some Practical Considerations

Extracting Dimensions

Subjective Methods: Reviewing Graphs

Analytical Methods: Building Models for Dimensions

Useful Analyses Based on Singular Value Decomposition Scores

Cluster Analysis

Predictive Modeling

When SVD Is Not Useful

Summary

Postscript

References

Chapter 12. Web Analytics and Web Mining

Preamble

Web Analytics

The Value of Web Analytics

The Future of Web Analytics and Web Mining

Postscript

References

Chapter 13. Clustering Words and Documents

Preamble

Introduction

Clustering Algorithms

Clustering Documents

Clustering Words

Cluster Visualization

Summary

Postscript

References

Chapter 14. Leveraging Text Mining in Property and Casualty Insurance

Preamble

Introduction

Property and Casualty Insurance as a Business

Analytics Opportunities in the Insurance Life Cycle

Driving Business Value Using Text Mining

Summary

Postscript

References

Chapter 15. Focused Web Crawling

Preamble

Introduction

The Focused Crawling Process

The Opportunities and Challenges of Mining the Web

Topic Hierarchies for Focused Crawling

Training the Document Classifier

Capturing User Feedback

Summary

Postscript

References

Chapter 16. The Future of Text and Web Analytics

Text Analytics and Text Mining

The Pros and Cons of Commercial Software versus Open Source Software

The Future of Text Mining

The Future of Web Analytics

Multisession Pathing

Integration of Web Analytics with Standard BI Tools

Attribution across Multiple Sessions

The Future: What Does It Hold?

New Areas That May Use Text Analytics in the Future

IBM Watson

Summary

References

IBM-Watson References

Chapter 17. Summary

Why Are You Reading This Chapter?

Our Perspective for Applying Text Mining Technology

Part I: Background and Theory

What Is Text Mining?

What Tools Can I Use?

Part II: The Text Mining Laboratory—28 Tutorials

Part III: Advanced Topics

Outlines of Chapter 7–15

Glossary

Index

How to Use the Data Sets and the Text Mining Software on the DVD or on Links for Practical Text Mining

I Data Sets for the Tutorials in Practical Text Mining

II SAS Text Miner Software

III Salford Systems Software, Including a New Text Miner Module Made for this Book (30-Day Free Trial Available)

IV STATISTICA Text Miner Software (30-day free trial on the DVD that accompanies this book)

Gary D. Miner

Dr. Gary Miner PhD received a B.S. from Hamline University, St. Paul, MN, with biology, chemistry, and education majors; an M.S. in zoology and population genetics from the University of Wyoming; and a Ph.D. in biochemical genetics from the University of Kansas as the recipient of a NASA pre-doctoral fellowship. He pursued additional National Institutes of Health postdoctoral studies at the U of Minnesota and U of Iowa eventually becoming immersed in the study of affective disorders and Alzheimer's disease. In 1985, he and his wife, Dr. Linda Winters-Miner, founded the Familial Alzheimer's Disease Research Foundation, which became a leading force in organizing both local and international scientific meetings, bringing together all the leaders in the field of genetics of Alzheimer's from several countries, resulting in the first major book on the genetics of Alzheimer’s disease. In the mid-1990s, Dr. Miner turned his data analysis interests to the business world, joining the team at StatSoft and deciding to specialize in data mining. He started developing what eventually became the Handbook of Statistical Analysis and Data Mining Applications (co-authored with Drs. Robert A. Nisbet and John Elder), which received the 2009 American Publishers Award for Professional and Scholarly Excellence (PROSE). Their follow-up collaboration, Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, also received a PROSE award in February of 2013. Gary was also co-author of “Practical Predictive Analytics and Decisioning Systems for Medicine (Academic Press, 2015). Overall, Dr. Miner’s career has focused on medicine and health issues, and the use of data analytics (statistics and predictive analytics) in analyzing medical data to decipher fact from fiction. Gary has also served as Merit Reviewer for PCORI (Patient Centered Outcomes Research Institute) that awards grants for predictive analytics research into the comparative effectiveness and heterogeneous treatment effects of medical interventions including drugs among different genetic groups of patients; additionally he teaches on-line classes in ‘Introduction to Predictive Analytics’, ‘Text Analytics’, ‘Risk Analytics’, and ‘Healthcare Predictive Analytics’ for the University of California-Irvine. Recently, until ‘official retirement’ 18 months ago, he spent most of his time in his primary role as Senior Analyst-Healthcare Applications Specialist for Dell | Information Management Group, Dell Software (through Dell’s acquisition of StatSoft (www.StatSoft.com) in April 2014). Currently Gary is working on two new short popular books on ‘Healthcare Solutions for the USA’ and ‘Patient-Doctor Genomics Stories’.

Affiliations and expertise

CEO, M&M Predictive Analytics LLC; UCI Adjunct Professor for Continuing Education, Predictive Analytics Program; Associate Editor, The Journal of Geriatric Psychiatry and Neurology; Private Consulting, Tulsa, OK, USA

John Elder

Dr. John Elder heads the United States’ leading data mining consulting team, with offices in Charlottesville, Virginia; Washington, D.C.; and Baltimore, Maryland (www.datamininglab.com). Founded in 1995, Elder Research, Inc. focuses on investment, commercial, and security applications of advanced analytics, including text mining, image recognition, process optimization, cross-selling, biometrics, drug efficacy, credit scoring, market sector timing, and fraud detection. John obtained a B.S. and an M.E.E. in electrical engineering from Rice University and a Ph.D. in systems engineering from the University of Virginia, where he’s an adjunct professor teaching Optimization or Data Mining. Prior to 16 years at ERI, he spent five years in aerospace defense consulting, four years heading research at an investment management firm, and two years in Rice's Computational & Applied Mathematics Department.

Affiliations and expertise

Elder Research, Inc. and the University of Virginia, Charlottesville, USA

Andrew Fast

Dr. Andrew Fast leads research in text mining and social network analysis at Elder Research. Dr. Fast graduated magna cum laude from Bethel University and earned an M.S. and a Ph.D. in computer science from the University of Massachusetts Amherst. There, his research focused on causal data mining and mining complex relational data such as social networks. At ERI, Andrew leads the development of new tools and algorithms for data and text mining for applications of capabilities assessment, fraud detection, and national security. Dr. Fast has published on an array of applications, including detecting securities fraud using the social network among brokers and understanding the structure of criminal and violent groups. Other publications cover modeling peer-to-peer music file sharing networks, understanding how collective classification works, and predicting playoff success of NFL head coaches (work featured on ESPN.com).

Thomas Hill

Dr. Thomas Hill is Senior Director for Advanced Analytics (Statistica products) in the TIBCO Analytics group. He previously held positions as Executive Director for Analytics at Statistica, within Quest's and at Dell's Information Management Group. He was a Co-founder and Senior Vice President for Analytic Solutions for over 20 years at StatSoft Inc. until the acquisition by Dell in 2014. At StatSoft, he was responsible for building out Statistica into a leading analytics platform. Dr. Hill received his Vordiplom in psychology from Kiel University in Germany, earned an M.S. in industrial psychology and a Ph.D. in psychology from the University of Kansas. He was on the faculty of the University of Tulsa from 1984 to 2009, where he conducted research in cognitive science and taught data analysis and data mining courses. He has received numerous academic grants and awards from the National Science Foundation, the National Institute of Health, the Center for Innovation Management, the Electric Power Research Institute, and other institutions. Over the past 20 years, his team has completed diverse consulting projects with companies from practically all industries in the United States and internationally on identifying and refining effective data mining and predictive modeling / analytics solutions for diverse applications. Dr. Hill has published widely on innovative applications for data mining and predictive analytics. He is the author (with Paul Lewicki, 2005) of Statistics: Methods and Applications, the Electronic Statistics Textbook (a popular on-line resource on statistics and data mining), a co-author of Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications (2012) and Practical Predictive Analytics and Decisioning Systems for Medicine (2014); he is also a contributing author to the popular Handbook of Statistical Analysis and Data Mining Applications (2009). Dr. Hill also authored numerous patents related to data science, Machine Learning, and specialized applications of of analytics to various domains.

Affiliations and expertise

StatSoft, Inc., Tulsa, OK, USA

Robert Nisbet

Bob Nisbet, PhD, is a Data Scientist, currently modeling precancerous colon polyp presence with clinical data at the UC-Irvine Medical Center. He has experience in predictive modeling in Telecommunications, Insurance, Credit, Banking. His academic experience includes teaching in Ecology and in Data Science. His industrial experience includes predictive modeling at AT&T, NCR, and FICO. He has worked also in Insurance, Credit, membership organizations (e.g. AAA), Education, and Health Care industries. He retired as an Assistant Vice President of Santa Barbara Bank & Trust in charge of business intelligence reporting and customer relationship management (CRM) modeling.

Affiliations and expertise

Researcher-Medical Informatics, H.H. Chao Comprehensive Digestive Disease Center, University of California Irvine Medical Center, Private Consulting, Santa Barbara, CA, USA

Dursun Delen

Dr. Dursun Delen is the William S. Spears Chair in Business Administration and Associate Professor of Management Science and Information Systems in the Spears School of Business at Oklahoma State University (OSU). He received his Ph.D. in industrial engineering and management from OSU in 1997. Prior to his appointment as an assistant professor at OSU in 2001, he worked for a privately owned research and consultancy company, Knowledge Based Systems Inc., in College Station, Texas, as a research scientist for five years, during which he led a number of decision support and other information systems-related research projects funded by federal agencies, including DoD, NASA, NIST and DOE.