Principles of Big Data

Preparing, Sharing, and Analyzing Complex Information

1st Edition - May 20, 2013
Author: Jules J. Berman
Language: English
Paperback ISBN:
9 7 8 - 0 - 1 2 - 4 0 4 5 7 6 - 7
eBook ISBN:
9 7 8 - 0 - 1 2 - 4 0 4 7 2 4 - 2

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to or… Read more

Purchase options

LIMITED OFFER

Save 50% on book bundles

Immediately download your ebook while waiting for your print delivery. No promo code is needed.

Institutional subscription on ScienceDirect

Request a sales quote

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.

Dedication

Acknowledgments

Author Biography

Preface

Introduction

Definition of Big Data

Big Data Versus Small Data

Whence Comest Big Data?

The Most Common Purpose of Big Data is to Produce Small Data

Opportunities

Big Data Moves to the Center of the Information Universe

Chapter 1. Providing Structure to Unstructured Data

Background

Machine Translation

Autocoding

Indexing

Term Extraction

References

Chapter 2. Identification, Deidentification, and Reidentification

Background

Features of an Identifier System

Registered Unique Object Identifiers

Really Bad Identifier Methods

Embedding Information in an Identifier: Not Recommended

One-Way Hashes

Use Case: Hospital Registration

Deidentification

Data Scrubbing

Reidentification

Lessons Learned

References

Chapter 3. Ontologies and Semantics

Background

Classifications, the Simplest of Ontologies

Ontologies, Classes with Multiple Parents

Choosing a Class Model

Introduction to Resource Description Framework Schema

Common Pitfalls in Ontology Development

References

Chapter 4. Introspection

Background

Knowledge of Self

eXtensible Markup Language

Introduction to Meaning

Namespaces and the Aggregation of Meaningful Assertions

Resource Description Framework Triples

Reflection

Use Case: Trusted Time Stamp

Summary

References

Chapter 5. Data Integration and Software Interoperability

Background

The Committee to Survey Standards

Standard Trajectory

Specifications and Standards

Versioning

Compliance Issues

Interfaces to Big Data Resources

References

Chapter 6. Immutability and Immortality

Background

Immutability and Identifiers

Data Objects

Legacy Data

Data Born from Data

Reconciling Identifiers across Institutions

Zero-Knowledge Reconciliation

The Curator’s Burden

References

Chapter 7. Measurement

Background

Counting

Gene Counting

Dealing with Negations

Understanding Your Control

Practical Significance of Measurements

Obsessive-Compulsive Disorder: The Mark of a Great Data Manager

References

Chapter 8. Simple but Powerful Big Data Techniques

Background

Look At the Data

Data Range

Denominator

Frequency Distributions

Mean and Standard Deviation

Estimation-Only Analyses

Use Case: Watching Data Trends with Google Ngrams

Use Case: Estimating Movie Preferences

References

Chapter 9. Analysis

Background

Analytic Tasks

Clustering, Classifying, Recommending, and Modeling

Data Reduction

Normalizing and Adjusting Data

Big Data Software: Speed and Scalability

Find Relationships, Not Similarities

References

Chapter 10. Special Considerations in Big Data Analysis

Background

Theory in Search of Data

Data in Search of a Theory

Overfitting

Bigness Bias

Too Much Data

Fixing Data

Data Subsets in Big Data: Neither Additive nor Transitive

Additional Big Data Pitfalls

References

Chapter 11. Stepwise Approach to Big Data Analysis

Background

Step 1. A Question Is Formulated

Step 2. Resource Evaluation

Step 3. A Question Is Reformulated

Step 4. Query Output Adequacy

Step 5. Data Description

Step 6. Data Reduction

Step 7. Algorithms Are Selected, If Absolutely Necessary

Step 8. Results Are Reviewed and Conclusions Are Asserted

Step 9. Conclusions Are Examined and Subjected to Validation

References

Chapter 12. Failure

Background

Failure Is Common

Failed Standards

Complexity

When Does Complexity Help?

When Redundancy Fails

Save Money; Don’t Protect Harmless Information

After Failure

Use Case: Cancer Biomedical Informatics Grid, a Bridge too Far

References

Chapter 13. Legalities

Background

Responsibility for the Accuracy and Legitimacy of Contained Data

Rights to Create, Use, and Share the Resource

Protections for Individuals

Consent

Unconsented Data

Good Policies Are a Good Policy

Use Case: The Havasupai Story

References

Chapter 14. Societal Issues

Background

How Big Data Is Perceived

The Necessity of Data Sharing, Even When It Seems Irrelevant

Reducing Costs and Increasing Productivity with Big Data

Public Mistrust

Saving Us from Ourselves

Hubris and Hyperbole

References

Chapter 15. The Future

Background

Last Words

References

Glossary

References

Index

Purchase options

Save 50% on book bundles

Institutional subscription on ScienceDirect

Jules J. Berman