Principles of Big Data

Principles of Big Data

Preparing, Sharing, and Analyzing Complex Information

1st Edition - May 20, 2013

Write a review

  • Author: Jules Berman
  • eBook ISBN: 9780124047242
  • Paperback ISBN: 9780124045767

Purchase options

Purchase options
DRM-free (PDF, Mobi, EPub)
Sales tax will be calculated at check-out

Institutional Subscription

Free Global Shipping
No minimum order


Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.

Key Features

  • Learn general methods for specifying Big Data in a way that is understandable to humans and to computers
  • Avoid the pitfalls in Big Data design and analysis
  • Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources


Data managers, data analysts, statisticians

Table of Contents

  • Dedication


    Author Biography



    Definition of Big Data

    Big Data Versus Small Data

    Whence Comest Big Data?

    The Most Common Purpose of Big Data is to Produce Small Data


    Big Data Moves to the Center of the Information Universe

    Chapter 1. Providing Structure to Unstructured Data


    Machine Translation



    Term Extraction


    Chapter 2. Identification, Deidentification, and Reidentification


    Features of an Identifier System

    Registered Unique Object Identifiers

    Really Bad Identifier Methods

    Embedding Information in an Identifier: Not Recommended

    One-Way Hashes

    Use Case: Hospital Registration


    Data Scrubbing


    Lessons Learned


    Chapter 3. Ontologies and Semantics


    Classifications, the Simplest of Ontologies

    Ontologies, Classes with Multiple Parents

    Choosing a Class Model

    Introduction to Resource Description Framework Schema

    Common Pitfalls in Ontology Development


    Chapter 4. Introspection


    Knowledge of Self

    eXtensible Markup Language

    Introduction to Meaning

    Namespaces and the Aggregation of Meaningful Assertions

    Resource Description Framework Triples


    Use Case: Trusted Time Stamp



    Chapter 5. Data Integration and Software Interoperability


    The Committee to Survey Standards

    Standard Trajectory

    Specifications and Standards


    Compliance Issues

    Interfaces to Big Data Resources


    Chapter 6. Immutability and Immortality


    Immutability and Identifiers

    Data Objects

    Legacy Data

    Data Born from Data

    Reconciling Identifiers across Institutions

    Zero-Knowledge Reconciliation

    The Curator’s Burden


    Chapter 7. Measurement



    Gene Counting

    Dealing with Negations

    Understanding Your Control

    Practical Significance of Measurements

    Obsessive-Compulsive Disorder: The Mark of a Great Data Manager


    Chapter 8. Simple but Powerful Big Data Techniques


    Look At the Data

    Data Range


    Frequency Distributions

    Mean and Standard Deviation

    Estimation-Only Analyses

    Use Case: Watching Data Trends with Google Ngrams

    Use Case: Estimating Movie Preferences


    Chapter 9. Analysis


    Analytic Tasks

    Clustering, Classifying, Recommending, and Modeling

    Data Reduction

    Normalizing and Adjusting Data

    Big Data Software: Speed and Scalability

    Find Relationships, Not Similarities


    Chapter 10. Special Considerations in Big Data Analysis


    Theory in Search of Data

    Data in Search of a Theory


    Bigness Bias

    Too Much Data

    Fixing Data

    Data Subsets in Big Data: Neither Additive nor Transitive

    Additional Big Data Pitfalls


    Chapter 11. Stepwise Approach to Big Data Analysis


    Step 1. A Question Is Formulated

    Step 2. Resource Evaluation

    Step 3. A Question Is Reformulated

    Step 4. Query Output Adequacy

    Step 5. Data Description

    Step 6. Data Reduction

    Step 7. Algorithms Are Selected, If Absolutely Necessary

    Step 8. Results Are Reviewed and Conclusions Are Asserted

    Step 9. Conclusions Are Examined and Subjected to Validation


    Chapter 12. Failure


    Failure Is Common

    Failed Standards


    When Does Complexity Help?

    When Redundancy Fails

    Save Money; Don’t Protect Harmless Information

    After Failure

    Use Case: Cancer Biomedical Informatics Grid, a Bridge too Far


    Chapter 13. Legalities


    Responsibility for the Accuracy and Legitimacy of Contained Data

    Rights to Create, Use, and Share the Resource

    Copyright and Patent Infringements Incurred by Using Standards

    Protections for Individuals


    Unconsented Data

    Good Policies Are a Good Policy

    Use Case: The Havasupai Story


    Chapter 14. Societal Issues


    How Big Data Is Perceived

    The Necessity of Data Sharing, Even When It Seems Irrelevant

    Reducing Costs and Increasing Productivity with Big Data

    Public Mistrust

    Saving Us from Ourselves

    Hubris and Hyperbole


    Chapter 15. The Future


    Last Words





Product details

  • No. of pages: 288
  • Language: English
  • Copyright: © Morgan Kaufmann 2013
  • Published: May 20, 2013
  • Imprint: Morgan Kaufmann
  • eBook ISBN: 9780124047242
  • Paperback ISBN: 9780124045767

About the Author

Jules Berman

Jules Berman
Jules Berman holds two Bachelor of Science degrees from MIT (in Mathematics and in Earth and Planetary Sciences), a PhD from Temple University, and an MD from the University of Miami. He was a graduate researcher at the Fels Cancer Research Institute (Temple University) and at the American Health Foundation in Valhalla, New York. He completed his postdoctoral studies at the US National Institutes of Health, and his residency at the George Washington University Medical Center in Washington, DC. Dr. Berman served as Chief of anatomic pathology, surgical pathology, and cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the US National Institutes of Health as a Medical Officer and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics and is the 2011 recipient of the Association’s Lifetime Achievement Award. He is a listed author of more than 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and pathology. Dr. Berman is currently a freelance writer.

Affiliations and Expertise

Freelance author with expertise in informatics, computer programming, and cancer biology

Ratings and Reviews

Write a review

There are currently no reviews for "Principles of Big Data"