Update: You can watch a video of the presentations at the end of this story.
In the era of big data, every scientific discipline must find a way to tackle challenges in storing, handling and interpreting large amounts of raw information. Earlier this month, experts examined how to address that issue in the physical sciences. The panel, titled Big Data and the Future of Physics, was part of the World Science Festival in New York City.
“As scholarly research is becoming increasingly digitized and data science is taking over many domains, the importance of managing and sharing data is being felt throughout the scientific community”, explained panelist Anita de Waard, Elsevier’s VP Research Data Collaborations, who develops cross-disciplinary frameworks to store, share and search experimental outputs, in collaboration with academic and government groups. “As data, software and ideas become available to everyone, science can take advantage of the network effect to radically accelerate.”
The event brought together physical scientists from throughout the New York City area. Four specialists shared their views on the importance and implications of data collection and storage in physics. Their presentations were followed by a discussion led by science writer and former editor-in-chief of Scientific American John Rennie.
Zooming in over CERN
Prof. Michael Tuts, the current Chair of the Columbia University Physics Department and an experimental particle physicist, shared his experiences with the ATLAS Experiment at the Large Hadron Collider (LHC) accelerator at CERN in Geneva. At the LHC, two beams of 100 billion subatomic particles are collided at high speeds in the hopes of finding evidence of new physics. In 2012, the Higgs boson was found at the LHC, leading a year later to the Nobel Prize.
One of the main challenges in running the LHC is handling the vast amounts of data it produces. Prof. Tuts compared the LHC to a 100-megapixel digital camera that takes 40 million electronic “pictures” of the colliding proton bunches per second. To keep the amount of data within reason, “empty” pictures — pictures that contain no data — are immediately thrown away.
The challenge researchers at the LHC are facing is to keep the interesting “pictures” for further analysis and filter and throw away the ones that are empty. With 40 million pictures per second taken by the ATLAS Detector, the experiment produces 40TB of raw data per second, which are being filtered down to 1GB per second.
To further complicate the challenge, only one picture in 100 billion is, for example, a Higgs boson – so researchers have to be very careful what they throw away because once it is gone, it is gone forever.
In a subsequent step, the raw data has to be turned into data that can be used for physics analysis, resulting in various separate sets of data that need to be saved on disks and tapes for posterity. All these data are being put on the Worldwide LHC Computing Grid (WLCG), consisting of 167 computing sites located in 42 countries and holding over 200PB (200,000TB) in 1 billion files.
The big challenge Prof. Tuts and his collaborators are tackling is to develop ever smarter ways to analyze and mine these huge datasets, as the team is expecting the amount of data to have increased by a factor of 10 over the following decade.
The sky is the archive
Prof. Kirk Borne, astrophysicist and Chief Data Scientist at Booz Allen Hamilton’s Strategic Innovation Group Kirk Borne, presented on Astroinformatics. Prof. Borne stated that astronomy (“the world’s second oldest profession”) is a forensic science, trying to reconstruct events long past by observing the evidence they left behind. Radiation is the astronomer’s only source of information about the universe, and it is a remarkably rich and diverse source that needs to be analyzed by different instruments measuring different wavelength spectrums to get a complete picture. Various types of telescopes observing different parts of the spectrum allow for inter-comparisons of new objects and sources.
These various sources are producing massive amounts of data that need to be accessible to various research groups in order to compare and combine observations.
As an example of an exciting new astrophysics project that will collect huge amounts of data, Prof. Borne discussed the Large Synoptic Survey Telescope (LSST) being built in Chili. Starting in 2022, the LSST will capture images of the entire night sky every three days over a 10-year period, enabling researchers to analyze the changes in each quadrant of the sky over this time. The LSST will, for example, make an inventory of the solar system, including timely observation of near-Earth asteroids, determine the velocity and location of 20 billion stars in the Milky Way and possibly even shed light on the nature of dark energy.
Every night, 10 million events will be recorded producing 20TB of data, all publicly available. The data will be triaged and classified in real time, enabling fast, even detection and response.
To analyze data sets of this size, physicists need to employ machine-generated, human-generated, and potentially crowdsourced techniques extracting the context of the data and curating these features for search, re-use, and machine-assisted triage of millions of sky events.
Knowledge preservation and … pizza?
Prof. Michael Hildreth, Professor of Physics at the University of Notre Dame, discussed knowledge preservation and reproducibility of research in general, and what other disciplines can learn from physics here. More and more funding agencies acknowledge that the huge investments in producing data for science is wasted if the data is lost or cannot be re-used after the grant finishes.
Conservation of data is not only crucial for the reproducibility of scientific results but also to make them accessible to a general audience. This leads to important questions about who is responsible for these data, where to store them and who will pay for the storage and making them available. Apart from that, preservation of data is not enough; they also need remain usable for future researchers. In times where hardware and software can become outdated within a couple of months, this poses a huge challenge.
Prof. Hildreth used the comparison between preserving scientific data and preserving pizza. There are three ways to preserve pizza
- Refrigeration — preserving for a short time before it goes moldy and hoping that a short time is “long enough”
- Frozen pizza — preserving until freezer burn takes over, freezing code, operating system, and data so it can be re-run later, assuming you still can with the same procedures used for remote computing
- Preserving the recipe — making sure you can repeat it, given the ingredients and instructions.
To achieve the second and third goals, Prof. Hildreth used the example of the Data and Software Preservation for Open Science (DASPOS) – a multidisciplinary effort to create a template for data conservation with the aim of producing “automatic pizza freezers and automatic recipe regenerators.”
Prof. Hildreth concluded his talk with the observation that knowledge preservation is complicated and technologically challenging. Many of the necessary tools are still missing. However, progress is being made on many fronts, for example with the CERN Analysis Portal. Prof. Hildreth said tools to bridge the gap between data scientists and data-generating researchers seem to be the key to further process.
How the “network effect” is accelerating science
De Waard summarized the three presentations as describing shared data (on particles and stars), shared software (preserved and “dockerized” or wrapped up in preservable containers) and shared ideas. As science thus becomes “deconstructed” in its component parts, it allows the “network effect,” meaning that many more connections become possible between nodes in a network than in the traditional linear stream, where a scientist creates his or her own data, software and ideas in relative isolation. This enables scientific progress to accelerate at an exponential rate: not only can data created by one team be used by the whole world, but new parties can to contribute software and ideas. That is essential because the number of scientists in the world, and in the US in particular, is not increasing fast enough to keep up with the available data or the complexity of the questions.
“Doing science can be indistinguishable from analyzing large data sets”
“The discussion was quite provocative” said Dr. Jennifer Costley, Director of Physical Sciences, Sustainability and Engineering at the New York Academy of Sciences. “It highlighted for me that doing science can be indistinguishable from analyzing large data sets, and there should be more sharing of tools and methods between large physics programs and the broader data science community. It also put a new spin on the ‘reproducibility crisis’ in science as not just a matter of replication of results, but of the software and hardware used to produce those results.”
The event was organized in partnership with Elsevier Journal Annals of Physics, whose Editor-in-Chief Brian Greene, Professor of Physics at Columbia University, is also co-founder and chairman of the World Science Festival.
“We were proud to co-sponsor this event,” said Ann Gabriel, VP of Elsevier’s Global Strategic Networks team. “The proceedings reflect our commitment and investment in networked knowledge, quality content, and open data within physics and beyond.”