[caption align="alignleft" width="365"] “The Many Faces of Big Data” panel at the German Center for Research & Innovation featured Juliana Freire, PhD, Gerhard Weikum, PhD, Claudio T. Silva, PhD (the moderator), and Raghu Ramakrishnan, PhD. (Photo © Nathalie Schueller)[/caption]
Hardly a day goes by without a new article, study, book or blog post about “big data.” A Google search yields more than 1.5 billion results, representing an “explosion” of interest in the topic in the past couple of years, according to Dr. Juliana Freire, Professor of Computer Science and Engineering at NYU-Poly.
Dr. Freire was the first of three presenters on a panel that explored “The Many Faces of Big Data” at the German Center for Research & Innovation in New York City June 27.Dr. Freire was joined by Dr. Gerhard Weikum, Scientific Director at the Max Planck Institute for Informatics in Saarbruecken, Germany, and Dr. Raghu Ramakrishnan, a Technical Fellow and Chief Technology Officer in the Server and Tools Business at Microsoft. The panel was moderated by Dr. Claudio T. Silva, Head of Disciplines of the Center for Urban Science and Progress (CUSP) and Professor of Computer Science and Engineering at NYU Polytechnic.[divider]
‘They see the potential of big data but don’t have the tools to explore it’[caption align="alignleft" width="400"] Juliana Friere, PhD (Photo © Nathalie Schueller)[/caption]Providing context for the event, Dr. Freire acknowledged that big data is not new, citing the financial industry, telecommunications industry and astronomy as just a few examples of areas in which large amounts of data have been routinely processed for a long time. “What is new is that any one of us can access big data now, and so today there are many data enthusiasts,” she said.
But easy access doesn’t mean easy analysis. “Data analysis can’t just be done with computation; you need to have the human in the loop,” she said. “Getting data from multiple sources, integrating and exploring the data to actually get insights from it — that’s hard.”
The path from data to knowledge requires the use of methodologies such as machine learning, visualization, statistics and algorithms before anyone can begin to make sense of the input. What’s missing are tools and techniques that automate as many of the tedious tasks as possible, enabling enthusiasts and experts alike to more rapidly draw useful conclusions.[caption align="alignleft" width="420"] Taxi trips in an hour. Taxis are valuable sensors for city life. In NYC, there are on average 500,000 taxi trips each day. Information associated with taxi trips thus provides unprecedented insight into many different aspects of a city life, from economic activity and human behavior to mobility patterns. This figure shows the taxi trips in Manhattan on May 1 from 8 a.m. to 9 a.m. The blue dots correspond to pickups and the orange ones correspond to drop-offs. Note the absence of taxis along 6th avenue, indicating that traffic was blocked during this period. (Source: An upcoming article titled “Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips,” by Nivan Ferreira, Jorge Poco, Huy Vo, Juliana Freire, and Claudio T. Silva. IEEE Transactions on Visualization and Computer Graphics (TVCG), 2013)[/caption]
A small case study from her group’s work at NYU’s Polytechnic Institute illustrates the challenges people face when they have to analyze big data, Dr. Freire said. The group obtained data from the New York City Taxi and Limousine Commission for 2011 and 2012 and sought to answer such questions as: What is the average trip time from midtown to the airports during weekdays? What are the popular night spots in Manhattan? How did taxi movement patterns change during Hurricane Sandy?
Although they ultimately succeeded in moving from raw data to unearthing trends, the journey was not simple or straightforward. “The data we received were big, complex and dirty,” Dr. Freire said.The input on the more than half a billion trips included a number of variables, such as pickup and drop-off times, location, fare amount, tip amount and distance traveled. The queries users posed involved comparing multiple data slices — for example, different regions and different time slices. To explain anomalies, such as a downturn in traffic, the team had to integrate data from other sources. For example, to explain a drop in taxi use that occurred in the beginning of April 2011, they did a Google news search. The results revealed a lot of rain during that time period as well as high gas prices; taken together, they could explain the drop.“
The sad reality is that the economists and traffic engineers we work with see the potential of big data, but they don’t have the tools to explore it on their own,” Dr. Freire said.
To aid in the process, her team developed a visual query model and a scalable system, TaxiVis, which allows users to interact with and analyze the data through visual operations. (Article in Press: “Visual Exploration of Big Spatio-Temporal Urban Data: A Study of New York City Taxi Trips,” by Nivan Ferreira, Jorge Poco, Huy Vo, Juliana Freire, and Claudio T. Silva, IEEE Transactions on Visualization and Computer Graphics, 2013)
They’ve also created an app, FindACab, which displays the spots where a smartphone user is most likely to find a cab at a specific time and location. (The app should be available for iPhones and/or Android phones in the fall or winter).
Despite the progress, there is more that needs to be done before big data can be put to routine use. “Even so-called simple tasks still are very time consuming,” Dr. Freire said. “Data cleaning and integration are painful. Visualization is a powerful tool, but we need better integration with data management systems — plus, it’s often challenging to design appropriate visual representations. The bottom line is that data exploration is challenging for both small and big data — we need tools that are easy to use.”[divider]
‘A myth is that it’s all about volume …’[caption align="alignright" width="400"] Gerhard Weikum, PhD (Photo © Nathalie Schueller)[/caption]
Dr. Weikum opened his talk with a slide depicting the iconic elephant and the six blind men. He explained:Big data is like the elephant. The blind men touch it and then describe what they think is the key feature — but they all touch different parts. One person touches the body and describes it as a massive data mountain; another touches the trunk and says big data is like a dynamic hose that sucks in data and then sprays it all over the place; one touches the tusk, and says it’s like a drill for digging deep into the data. Finally, one touches the tail and describes it as a data broom, maybe for cleaning data.
In truth, these are all various facets of the big data story.
A myth about big data is that it’s all about volume and size, Dr. Weikum noted. Many applications of big data deal with terabytes and gigabytes or less, he pointed out, but agreed with Dr. Freire that even those relatively smaller volumes can be daunting to handle. Using an analogy from mathematics, Dr. Weikum explained that in standard geometry, the term “volume” encompasses three dimensions: length, width and depth.[pullquote align="right"]“It’s not about looking at some data and describing it. That would not be ambitious enough.” — Gerhard Weikum, PhD[/pullquote]
Length means one dataset usually isn’t enough. We need to collect multiple datasets to find answers and we need to make them comparable. This alone is often quite challenging. Width refers to the fact that that there may be a seemingly unlimited number of relevant data sources on a particular topic, and it’s difficult to identify them all (or the truly relevant ones). Depth refers to the fact that ultimately, big data is about gaining insight and driving actionable measures over the long term. It’s not about looking at some data and describing it. That would not be ambitious enough.
Dr. Weikum gave what he called a “lightweight” example to demonstrate the complexity of this approach: the use of bicycle sharing as an alternative to automobiles in New York City and other parts of the world. On the “length” side, researchers would need to collect input on use from different cities, compare effects of bike sharing over time and look at bike sharing within the larger picture of commuter traffic in every city. But the data received aren’t directly comparable, he explained. Patterns of use may be different, infrastructures — the availability of other transportation options — may be different, and cultures are likely to be different. (Is there a culture of taking taxis in a particular city, for example?) These factors make comparisons difficult.With respect to “width,” researchers would need to discover and integrate information on cyclists in those cities, answering such questions as: Do they wear helmets? Who is cycling — younger people, older people, business people, students? What do other people think of the cyclists? What do politicians say about them? That would involve tapping into news, social media and other data sources to get a clearer picture. Data from those sources may be “full of noise or junk,” Dr. Weikum pointed out, but could also contain some valuable “nuggets.”
To gain meaningful insights and long-term guidance, or “depth,” on topics such as end-to-end cycling commute time, energy cost per commute, safety issues, health impact of cycling, who’s for it and who’s against it, researchers would need to be able to analyze and understand input of varying quality in multiple forms from from multiple sources. “That’s why big data analytics is a complex workflow,” Dr. Weikum said. “It’s not about simply gathering some data, putting it in a database, visualizing it and getting a big ‘ah-ha’ moment. That’s rarely the case.”[divider]
What’s in it for me?[caption id="attachment_26599" align="alignright" width="400"] Raghu Ramakrishnan, PhD (Photo © Nathalie Schueller)[/caption]Moving from the back end of big data management to current and potential applications, Dr. Ramakrishnan said big data “is not a specific technological innovation; it’s about being able to do things that would have been pipe dreams just a few years ago. Data is literally the new gold — data mining, the new Klondike.”
While he agreed with the other speakers that more user-friendly data-analysis tools are needed, he pointed to some areas in which such tools are in place or soon will be. For example, the availability of “cloud delivery” (providing software, services and storage capacity on the Internet) means that companies and individuals don’t have to set up their own infrastructures to deal with big data, he explained. “You can just go rent the space, and if your data has already been positioned there, you can start analyzing immediately. That ease of access not only to data but to tools to play with it is transformative.”[pullquote align="right"]“Data is the new gold — data mining, the new Klondike.” — Raghu Ramakrishnan, PhD[/pullquote]
Dr. Ramakrishnan went on to describe how algorithms developed from big data underpin virtually all Web search results, aggregating data so the user doesn’t just see specific web pages but also suggestions for refining a query (“semantic refinement”), maps if the query is about a specific place or entity, and relevant images, news and so forth. Those algorithms also underlie content optimization and recommendations based on what users are most interested in (“popularity”), users’ habits (“National Football League news up front over coffee in the morning, political news at night”) as well as users’ expressed preferences.
Big data also enables the analysis of behavior when people are engaged in applications such as Microsoft’s Office 365, which provides cloud-based delivery of email, calendars, teleconferencing and the like; by analyzing user activity patterns, developers can make common functions easier to access — one click away versus those that take several clicks — thereby optimizing the underlying process to make users more productive.
As another example, big data analysis of data from applications like Microsoft Kinect, which is mainly used for game playing, can help determine whether someone has a hip or knee problem that should be looked into by the way they move (gait analysis), Dr. Ramakrishnan said. And on a larger scale, big data will eventually enable responses to such questions as: What would we do if a new disease hits wheat? Will forests accelerate or slow climate change? How many species are there on Earth, and how can we predict them?
Those are some of the positive sides of big data. At the same time, a host of social, legal and regulatory issues need to be worked out, including privacy concerns and ethical considerations. “These will take longer to understand and resolve,” Ramakrishnan acknowledged. Moreover, as the other speakers noted, there is still a large gap between the availability of big data and people qualified to interpret it; nearly 200,000 people in the United States alone are needed to work with various aspects of big data to help maximize its potential, he said, citing a recent McKinsey & Co. report.
Ecology is an example of an area in which people knowledgeable about computer science are transforming the discipline. “They’re not doing field work or experiments; they’re doing ecology by generating and studying relevant data on a scale we could not have dreamed of before, and instead of qualitative insights, they’re making quantitative predictions about farming, fisheries, forests and other aspects of the earth’s support systems,” Dr. Ramakrishnan said. “In fact, if you don’t know how to use the right tools and do the right math, in five years you probably won’t be a working ecologist — and that’s the case in discipline after discipline. Maybe I’m making an overly strong statement, but that’s true enough to be scary,” he concluded.[divider]