Principles of Big Data

Principles of Big Data

Preparing, Sharing, and Analyzing Complex Information

1st Edition - May 20, 2013

Write a review

  • Author: Jules Berman
  • eBook ISBN: 9780124047242
  • Paperback ISBN: 9780124045767

Purchase options

Purchase options
DRM-free (PDF, Mobi, EPub)
Available
Sales tax will be calculated at check-out

Institutional Subscription

Free Global Shipping
No minimum order

Description

Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.

Key Features

  • Learn general methods for specifying Big Data in a way that is understandable to humans and to computers
  • Avoid the pitfalls in Big Data design and analysis
  • Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources

Readership

Data managers, data analysts, statisticians

Table of Contents

  • Dedication

    Acknowledgments

    Author Biography

    Preface

    Introduction

    Definition of Big Data

    Big Data Versus Small Data

    Whence Comest Big Data?

    The Most Common Purpose of Big Data is to Produce Small Data

    Opportunities

    Big Data Moves to the Center of the Information Universe

    Chapter 1. Providing Structure to Unstructured Data

    Background

    Machine Translation

    Autocoding

    Indexing

    Term Extraction

    References

    Chapter 2. Identification, Deidentification, and Reidentification

    Background

    Features of an Identifier System

    Registered Unique Object Identifiers

    Really Bad Identifier Methods

    Embedding Information in an Identifier: Not Recommended

    One-Way Hashes

    Use Case: Hospital Registration

    Deidentification

    Data Scrubbing

    Reidentification

    Lessons Learned

    References

    Chapter 3. Ontologies and Semantics

    Background

    Classifications, the Simplest of Ontologies

    Ontologies, Classes with Multiple Parents

    Choosing a Class Model

    Introduction to Resource Description Framework Schema

    Common Pitfalls in Ontology Development

    References

    Chapter 4. Introspection

    Background

    Knowledge of Self

    eXtensible Markup Language

    Introduction to Meaning

    Namespaces and the Aggregation of Meaningful Assertions

    Resource Description Framework Triples

    Reflection

    Use Case: Trusted Time Stamp

    Summary

    References

    Chapter 5. Data Integration and Software Interoperability

    Background

    The Committee to Survey Standards

    Standard Trajectory

    Specifications and Standards

    Versioning

    Compliance Issues

    Interfaces to Big Data Resources

    References

    Chapter 6. Immutability and Immortality

    Background

    Immutability and Identifiers

    Data Objects

    Legacy Data

    Data Born from Data

    Reconciling Identifiers across Institutions

    Zero-Knowledge Reconciliation

    The Curator’s Burden

    References

    Chapter 7. Measurement

    Background

    Counting

    Gene Counting

    Dealing with Negations

    Understanding Your Control

    Practical Significance of Measurements

    Obsessive-Compulsive Disorder: The Mark of a Great Data Manager

    References

    Chapter 8. Simple but Powerful Big Data Techniques

    Background

    Look At the Data

    Data Range

    Denominator

    Frequency Distributions

    Mean and Standard Deviation

    Estimation-Only Analyses

    Use Case: Watching Data Trends with Google Ngrams

    Use Case: Estimating Movie Preferences

    References

    Chapter 9. Analysis

    Background

    Analytic Tasks

    Clustering, Classifying, Recommending, and Modeling

    Data Reduction

    Normalizing and Adjusting Data

    Big Data Software: Speed and Scalability

    Find Relationships, Not Similarities

    References

    Chapter 10. Special Considerations in Big Data Analysis

    Background

    Theory in Search of Data

    Data in Search of a Theory

    Overfitting

    Bigness Bias

    Too Much Data

    Fixing Data

    Data Subsets in Big Data: Neither Additive nor Transitive

    Additional Big Data Pitfalls

    References

    Chapter 11. Stepwise Approach to Big Data Analysis

    Background

    Step 1. A Question Is Formulated

    Step 2. Resource Evaluation

    Step 3. A Question Is Reformulated

    Step 4. Query Output Adequacy

    Step 5. Data Description

    Step 6. Data Reduction

    Step 7. Algorithms Are Selected, If Absolutely Necessary

    Step 8. Results Are Reviewed and Conclusions Are Asserted

    Step 9. Conclusions Are Examined and Subjected to Validation

    References

    Chapter 12. Failure

    Background

    Failure Is Common

    Failed Standards

    Complexity

    When Does Complexity Help?

    When Redundancy Fails

    Save Money; Don’t Protect Harmless Information

    After Failure

    Use Case: Cancer Biomedical Informatics Grid, a Bridge too Far

    References

    Chapter 13. Legalities

    Background

    Responsibility for the Accuracy and Legitimacy of Contained Data

    Rights to Create, Use, and Share the Resource

    Copyright and Patent Infringements Incurred by Using Standards

    Protections for Individuals

    Consent

    Unconsented Data

    Good Policies Are a Good Policy

    Use Case: The Havasupai Story

    References

    Chapter 14. Societal Issues

    Background

    How Big Data Is Perceived

    The Necessity of Data Sharing, Even When It Seems Irrelevant

    Reducing Costs and Increasing Productivity with Big Data

    Public Mistrust

    Saving Us from Ourselves

    Hubris and Hyperbole

    References

    Chapter 15. The Future

    Background

    Last Words

    References

    Glossary

    References

    Index

Product details

  • No. of pages: 288
  • Language: English
  • Copyright: © Morgan Kaufmann 2013
  • Published: May 20, 2013
  • Imprint: Morgan Kaufmann
  • eBook ISBN: 9780124047242
  • Paperback ISBN: 9780124045767

About the Author

Jules Berman

Jules Berman
Jules Berman holds two Bachelor of Science degrees from MIT (in Mathematics and in Earth and Planetary Sciences), a PhD from Temple University, and an MD from the University of Miami. He was a graduate researcher at the Fels Cancer Research Institute (Temple University) and at the American Health Foundation in Valhalla, New York. He completed his postdoctoral studies at the US National Institutes of Health, and his residency at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the US National Institutes of Health as a Medical Officer and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics and is the 2011 recipient of the Association’s Lifetime Achievement Award. He is a listed author of more than 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and pathology. Dr. Berman is currently a freelance writer.

Affiliations and Expertise

Freelance author with expertise in informatics, computer programming, and cancer biology

Ratings and Reviews

Write a review

There are currently no reviews for "Principles of Big Data"