5 reasons data is a key ingredient for AI applications

Reinforcement learning may change the way AI thinks, but data is still king

data ai image
© istock.com/liuzishan

A lot has been written about artificial intelligence in recent months, but one element that is often not emphasized is the importance data plays in allowing AI to function. Take self-driving cars – probably the most recognized application of AI. Building a self-driving car requires a fantastic amount of data ranging from signals from infrared sensors, images from digital cameras, and high resolution maps. NVIDA estimates that one self-driving car generates 1 TB per hour of raw data. All that data is then used in the development of the AI models that actually drive the car.

While we have seen recent advances in other AI techniques like reinforcement learning that use less data (e.g., the success of Deep Mind’s recent Alpha Go - Zero in the game of GO), data is still critical for developing AI applications.

Here are 5 reasons why data is a key ingredient for AI applications.

1. Data is experience

Just like humans, AI applications improve with more experience. Data provides the examples necessary to train models that can perform predictions and classifications. A good example of this is in image recognition. The availability of data through ImageNet transformed the pace of change in image understanding and led to computers reaching human-level performance. A general rule of thumb is that you need 10 times as much data as the number of parameters (i.e., degrees of freedom) in the model being built. The more complicated the task, the more data needed.

banner linking to Elsevier's tech careers page

2. Personalization

For creating the perfect meal, it’s great to know the tastes of your diners. Similarly, data is essential to tailoring an AI model to the needs of specific users. For instance, in her recent conversation at the 2017 World Summit AI, Elisabeth Ling, SVP of Product Analytics at Elsevier, explained that we need to understand how researchers use their Mendeley libraries and search ScienceDirect in order to generate meaningful personalized recommendations.

By knowing what papers users read, download and collect, we can give them advice on potential papers of interest. Furthermore, techniques such as collaborative filtering, which make suggestions based on the similarity between users, improve with access to more data; the more user data one has, the more likely it is that the algorithm can find a similar a user.

3. Fewer exceptions

A key problem in building AI models is overfitting – this is, where the model focuses in too specifically on the examples given. For instance, if a model is trying to learn to recognize chairs and has only been shown standard dining chairs with four legs, it may learn that chairs are defined by having four legs. If the model is then shown a desk chair with just one pillar, it wouldn’t recognize it. Having more data helps combat this problem.

During training, the AI model can see more examples of different types of things. This is especially important in working with data about people, where there can be the potential for algorithmic bias against people from diverse backgrounds. This point was made by Prof. Dame Wendy Hall in her interview at the World Summit AI. Prof. Hall focused on the need to make sure that AI was trained on diverse datasets. A good example of combating this through data is the lengths that Apple went to in training their new Face ID recognition algorithm.

4. Easier testing

Even in cases where techniques can be used that require less training data, more data makes it easier to test AI systems. An example of this is A/B testing. This is where a developer takes a small amount of traffic to a site and tests to see whether a new recommendation engine or search algorithm performs better on that small set of traffic.

The more traffic (i.e., data), the easier it is to test multiple algorithms or variants. An example of this is at Netflix. At the World AI Summit, they explained how they use A/B testing to select artwork that maximizes the engagement with films and TV series on Netflix.

At Elsevier, we use A/B testing to improve our search experiences. In addition, we have developed systems that let us replay queries to maximize the amount of data we can use to improve and test our intelligent search algorithms.

5. More applications

Finally, it is increasingly the case that data can be reused for different applications. For example, a technique called transfer learning allows data developed for one domain (the recognition of general images) to be applied to another domain (the detection of diabetic retinopathy from images of the retina). Likewise, recent work has shown that background knowledge can help improve on tasks like object detection in images. Recent work from Google has shown how training using data intended for a different task like image recognition can help performance on another completely different task like language translation.

In summary, data is a central component to developing any AI system today.

At Elsevier, our data scientists and developers get to work deep and rich datasets. Obviously, starting with 14 million scientific articles but extending to 11 billion facts on 50 million chemicals, traffic from the 13 million monthly users on ScienceDirect, 1.4 billion references, and 70,000 institutional profiles.

Even as we get better at using our data wisely, data will still provide the key ingredient for making smart systems that help users do their work.



Written by

Paul Groth, PhD

Written by

Paul Groth, PhD

Dr. Paul Groth is Disruptive Technology Director at Elsevier Labs. He holds a PhD in Computer Science from the University of Southampton (2007) and has done research at the University of Southern California and the Vrije Universiteit Amsterdam. His research focuses on dealing with large amounts of diverse contextualized knowledge with a particular focus on web and science applications. This includes research in data provenance, data science, data integration and knowledge sharing. He covers these topics on his blog, Think Links.

Previously, he led architecture development for the Open PHACTS drug discovery data integration platform. Paul was co-chair of the W3C Provenance Working Group that created a standard for provenance interchange. At Elsevier, Paul continues his research line and helps the company understand new technologies and their applicability to building better infrastructure for scholarship.


comments powered by Disqus