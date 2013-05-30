Principles of Big Data
1st Edition
Preparing, Sharing, and Analyzing Complex Information
Description
Principles of Big Data helps readers avoid the common mistakes that endanger all Big Data projects. By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book. The book demonstrates how adept analysts can find relationships among data objects held in disparate Big Data resources, when the data objects are endowed with semantic support (i.e., organized in classes of uniquely identified data objects). Readers will learn how their data can be integrated with data from other resources, and how the data extracted from Big Data resources can be used for purposes beyond those imagined by the data creators.
Key Features
- Learn general methods for specifying Big Data in a way that is understandable to humans and to computers
- Avoid the pitfalls in Big Data design and analysis
- Understand how to create and use Big Data safely and responsibly with a set of laws, regulations and ethical standards that apply to the acquisition, distribution and integration of Big Data resources
Readership
Data managers, data analysts, statisticians
Table of Contents
Dedication
Acknowledgments
Author Biography
Preface
Introduction
Definition of Big Data
Big Data Versus Small Data
Whence Comest Big Data?
The Most Common Purpose of Big Data is to Produce Small Data
Opportunities
Big Data Moves to the Center of the Information Universe
Chapter 1. Providing Structure to Unstructured Data
Background
Machine Translation
Autocoding
Indexing
Term Extraction
References
Chapter 2. Identification, Deidentification, and Reidentification
Background
Features of an Identifier System
Registered Unique Object Identifiers
Really Bad Identifier Methods
Embedding Information in an Identifier: Not Recommended
One-Way Hashes
Use Case: Hospital Registration
Deidentification
Data Scrubbing
Reidentification
Lessons Learned
References
Chapter 3. Ontologies and Semantics
Background
Classifications, the Simplest of Ontologies
Ontologies, Classes with Multiple Parents
Choosing a Class Model
Introduction to Resource Description Framework Schema
Common Pitfalls in Ontology Development
References
Chapter 4. Introspection
Background
Knowledge of Self
eXtensible Markup Language
Introduction to Meaning
Namespaces and the Aggregation of Meaningful Assertions
Resource Description Framework Triples
Reflection
Use Case: Trusted Time Stamp
Summary
References
Chapter 5. Data Integration and Software Interoperability
Background
The Committee to Survey Standards
Standard Trajectory
Specifications and Standards
Versioning
Compliance Issues
Interfaces to Big Data Resources
References
Chapter 6. Immutability and Immortality
Background
Immutability and Identifiers
Data Objects
Legacy Data
Data Born from Data
Reconciling Identifiers across Institutions
Zero-Knowledge Reconciliation
The Curator’s Burden
References
Chapter 7. Measurement
Background
Counting
Gene Counting
Dealing with Negations
Understanding Your Control
Practical Significance of Measurements
Obsessive-Compulsive Disorder: The Mark of a Great Data Manager
References
Chapter 8. Simple but Powerful Big Data Techniques
Background
Look At the Data
Data Range
Denominator
Frequency Distributions
Mean and Standard Deviation
Estimation-Only Analyses
Use Case: Watching Data Trends with Google Ngrams
Use Case: Estimating Movie Preferences
References
Chapter 9. Analysis
Background
Analytic Tasks
Clustering, Classifying, Recommending, and Modeling
Data Reduction
Normalizing and Adjusting Data
Big Data Software: Speed and Scalability
Find Relationships, Not Similarities
References
Chapter 10. Special Considerations in Big Data Analysis
Background
Theory in Search of Data
Data in Search of a Theory
Overfitting
Bigness Bias
Too Much Data
Fixing Data
Data Subsets in Big Data: Neither Additive nor Transitive
Additional Big Data Pitfalls
References
Chapter 11. Stepwise Approach to Big Data Analysis
Background
Step 1. A Question Is Formulated
Step 2. Resource Evaluation
Step 3. A Question Is Reformulated
Step 4. Query Output Adequacy
Step 5. Data Description
Step 6. Data Reduction
Step 7. Algorithms Are Selected, If Absolutely Necessary
Step 8. Results Are Reviewed and Conclusions Are Asserted
Step 9. Conclusions Are Examined and Subjected to Validation
References
Chapter 12. Failure
Background
Failure Is Common
Failed Standards
Complexity
When Does Complexity Help?
When Redundancy Fails
Save Money; Don’t Protect Harmless Information
After Failure
Use Case: Cancer Biomedical Informatics Grid, a Bridge too Far
References
Chapter 13. Legalities
Background
Responsibility for the Accuracy and Legitimacy of Contained Data
Rights to Create, Use, and Share the Resource
Copyright and Patent Infringements Incurred by Using Standards
Protections for Individuals
Consent
Unconsented Data
Good Policies Are a Good Policy
Use Case: The Havasupai Story
References
Chapter 14. Societal Issues
Background
How Big Data Is Perceived
The Necessity of Data Sharing, Even When It Seems Irrelevant
Reducing Costs and Increasing Productivity with Big Data
Public Mistrust
Saving Us from Ourselves
Hubris and Hyperbole
References
Chapter 15. The Future
Background
Last Words
References
Glossary
References
Index
Details
No. of pages: 288
- 288
- Language:
- English
- Copyright:
- © Morgan Kaufmann 2013
- Published:
- 30th May 2013
- Imprint:
- Morgan Kaufmann
- eBook ISBN:
- 9780124047242
- Paperback ISBN:
- 9780124045767
About the Author
Jules Berman
Jules Berman holds two bachelor of science degrees from MIT (Mathematics, and Earth and Planetary Sciences), a PhD from Temple University, and an MD, from the University of Miami. He was a graduate researcher in the Fels Cancer Research Institute, at Temple University, and at the American Health Foundation in Valhalla, New York. His post-doctoral studies were completed at the U.S. National Institutes of Health, and his residency was completed at the George Washington University Medical Center in Washington, D.C. Dr. Berman served as Chief of Anatomic Pathology, Surgical Pathology and Cytopathology at the Veterans Administration Medical Center in Baltimore, Maryland, where he held joint appointments at the University of Maryland Medical Center and at the Johns Hopkins Medical Institutions. In 1998, he transferred to the U.S. National Institutes of Health, as a Medical Officer, and as the Program Director for Pathology Informatics in the Cancer Diagnosis Program at the National Cancer Institute. Dr. Berman is a past President of the Association for Pathology Informatics, and the 2011 recipient of the association's Lifetime Achievement Award. He is a listed author on over 200 scientific publications and has written more than a dozen books in his three areas of expertise: informatics, computer programming, and cancer biology. Dr. Berman is currently a free-lance writer.
Affiliations and Expertise
Freelance author with expertise in informatics, computer programming, and cancer biology
Reviews
"By stressing simple, fundamental concepts, this book teaches readers how to organize large volumes of complex data, and how to achieve data permanence when the content of the data is constantly changing. General methods for data verification and validation, as specifically applied to Big Data resources, are stressed throughout the book." --ODBMS.org, March 2014
"The book is written in a colloquial style and is full of anecdotes, quotations from famous people, and personal opinions." --ComputingReviews.com, February 2014
"The author has produced a sober, serious treatment of this emerging phenomenon, avoiding hype and gee-whiz cases in favor of concepts and mature advice. For example, the author offers ten distinctions between big data and small data, including such factors as goals, location, data structure, preparation, and longevity. This characterization provides much greater insight into the phenomenon than the standard 3V treatment (volume, velocity, and variety)." --ComputingReviews.com, October 2013