Description

How do you approach answering queries when your data is stored in multiple databases that were designed independently by different people? This is first comprehensive book on data integration and is written by three of the most respected experts in the field.

This book provides an extensive introduction to the theory and concepts underlying today's data integration techniques, with detailed, instruction for their application using concrete examples throughout to explain the concepts. Data integration is the problem of answering queries that span multiple data sources (e.g., databases, web pages). Data integration problems surface in multiple contexts, including enterprise information integration, query processing on the Web, coordination between government agencies and collaboration between scientists. In some cases, data integration is the key bottleneck to making progress in a field.

The authors provide a working knowledge of data integration concepts and techniques, giving you the tools you need to develop a complete and concise package of algorithms and applications.

Key Features

  • Offers a range of data integration solutions enabling you to focus on what is most relevant to the problem at hand
  • Enables you to build your own algorithms and implement your own data integration applications
  • Companion website with numerous project-based exercises and solutions and slides. Links to commercially available software allowing readers to build their own algorithms and implement their own data integration applications. Facebook page for reader input during and after publication

Readership

Database practitioners in industry, i.e., data warehouse engineers, database system designers, data architects/enterprise architects, database researchers, statisticians, data analysts, and other data professionals working at the R&D and implementation levels. Students in data analytics and knowledge discovery

Table of Contents

Dedication

Preface

1. Introduction

1.1 What Is Data Integration?

1.2 Why Is It Hard?

1.3 Data Integration Architectures

1.4 Outline of the Book

Bibliographic Notes

Part I: Foundational Data Integration Techniques

2. Manipulating Query Expressions

2.1 Review of Database Concepts

2.2 Query Unfolding

2.3 Query Containment and Equivalence

2.4 Answering Queries Using Views

Bibliographic Notes

3. Describing Data Sources

3.1 Overview and Desiderata

3.2 Schema Mapping Languages

3.3 Access-Pattern Limitations

3.4 Integrity Constraints on the Mediated Schema

3.5 Answer Completeness

3.6 Data-Level Heterogeneity

Bibliographic Notes

4. String Matching

4.1 Problem Description

4.2 Similarity Measures

4.3 Scaling Up String Matching

Bibliographic Notes

5. Schema Matching and Mapping

5.1 Problem Definition

5.2 Challenges of Schema Matching and Mapping

5.3 Overview of Matching and Mapping Systems

5.4 Matchers

5.5 Combining Match Predictions

5.6 Enforcing Domain Integrity Constraints

5.7 Match Selector

5.8 Reusing Previous Matches

5.9 Many-to-Many Matches

5.10 From Matches to Mappings

Bibliographic Notes

6. General Schema Manipulation Operators

6.1 Model Management Operators

6.2 Merge

6.3 ModelGen

6.4 Invert

6.5 Toward Model Management Systems

6.5 Bibliographic Notes

7. Data Matching

7.1 Problem Definition

7.2 Rule-Based Matching

7.3 Learning-Based Matching

7.4 Matching by Clustering

7.5 Probabilistic Approaches to Data Matching

7.6 Collective Matching

7.7 Scaling Up Data Matching

Bibliographic Notes

8. Query Processing

8.1 Backgroun

Details

No. of pages:
520
Language:
English
Copyright:
© 2012
Published:
Imprint:
Morgan Kaufmann
eBook ISBN:
9780123914798
Print ISBN:
9780124160446

About the authors

AnHai Doan

AnHai Doan, Associate Professor in Computer Science at the University of Wisconsin-Madison. Consulting work with Microsoft AdCenter Lab and Yahoo Research Lab.

Affiliations and Expertise

Associate Professor in Computer Science at the University of Wisconsin-Madison. Consulting work with Microsoft AdCenter Lab and Yahoo Research Lab.

Alon Halevy

Head of the Structured Data Group, Google Research, Mountain View, California. He joined Google in 2005 with the acquisition of his company, Transformic.

Affiliations and Expertise

Head of the Structured Data Group, Google Research, Mountain View, California.

Zachary Ives

Associate Professor at the University of Pennsylvania and a Faculty Member of the Penn Center for Bioinformatics. He received his PhD from the University of Washington. His research interests include data integration, data sharing among autonomous and heterogeneous systems, heterogeneous sensor networks, and information provenance and authoritativeness.

Affiliations and Expertise

Associate Professor at the University of Pennsylvania, and a Faculty Member of the Penn Center for Bioinformatics.

Reviews

"Researchers looking for concise and clear descriptions of the state of the art in data integration will benefit from this noteworthy effort. Graduate students in particular will acquire an excellent blueprint of the field, supplemented by almost 600 up-to-date bibliographic references they can use to further their work." --ComputingReviews.com, October 2013

"Written by three of the field’s leading experts, this book manages to address a broad range of topics in its subject domain in a reasonably compact package…The intended audience is primarily academic, specifically graduate and advanced undergraduate students in a university setting. Researchers new to the field will find it to be a helpful introduction." --ComputingReviews.com, August 2013

"…a well-organized and thorough treatment of data integration topics is a welcome addition to the practicing software professional’s bookshelf. If that treatment is both academically rigorous and still readable, as is the case with this book, it becomes a valuable resource for researchers and, in particular, for doctoral students." --ComputingReviews.com, July 2013

"This is the definitive book on data integration technology, written by experts who invented much of the technology they write about. It’s comprehensive, with lots of technical detail very clearly explained. It’s a must-read for anyone involved in the development of data integration solutions." --Philip A. Bernstein, Distinguished Scientist, Microsoft Corporation

"Despite having been with us for decades, data integration remains a challenging, multi-faceted problem.  This book does an excellent job of bringing together and explaining its many facets along with the technical solutions that have been developed to date. The authors are three of the field's leading contributors, with a mix of both academic and industrial experience, and their presentation in