Building Robust AI Systems – Part II

IN General — 02 August, 2017

Building a robust Artificial Intelligence (AI) system is a challenging task: The previous blog post covered many of the more technical aspects from data handling and storage, dealing with data quality, deploying AI models and operational excellence. The technical setup to calculate billions of predictions that deliver optimal decisions needs to be fault tolerant, redundant – and can easily recover if something does go wrong. As predictions typically have to be calculated within a narrow time window, the setup needs to reflect the customer's stringent requirements and fulfil Service Level Agreements.
However, a sophisticated technical setup is "only" half of what it takes to build robust AI systems. This part will focus more on the AI model itself and what it takes to build a robust prediction model.

Before exploring "robust AI systems" in more detail, we need to define what we mean with this term. Technically, the software used to implement the model should be well written and tested, which applies to both the AI framework (such as e.g. TensorFlow, Torch, CNTK or proprietary software) as well as the model itself. The "usual best practices" of developing clean code, regular or continuous unit- and integration testing, etc., should be applied.
Besides these more technical requirement, the AI model itself should always compute "sensible" predictions – no matter what. What "sensible" means, however, depends on the specific context the model is developed for. For example, an AI system used in the replenishment of a supermarket chain should never issue an order of thousands of crates of expensive Champagne if the usual turnover is a couple of bottles per week. If a bulk order is placed by a particularly valuable customer, then this situation needs to be included – but this is typically an exception that requires human intervention.

How can a prediction system become so "confused" it results in strange and abnormal predictions?

To start with, unsuitable algorithms and machine learning models with inadequate training may result in overfitting, resulting in a loss of generalizability. This means that if the prediction model is too complex and has too many free parameters, it may memorize all data presented to it during training. While such a model can reproduce the training data perfectly, the model does not learn the underlying statistically significant dependencies and will yield inferior predictions on new or unknown data. Some best practice examples to avoid this are to integrate sophisticated regularization schemes or ensemble methods, where a variety of different models is combined to give a better and more generalizable result than any single algorithm can achieve on its own.

How to make prediction models more robust

Bagging (Bootstrap aggregating), proposed by L. Breiman uses multiple version of the same model component where each is trained on a randomly chosen subset of the data. Combining (bagging) them into a single combined model also helps to find an optimal description of the data by the model. In general, separate test and validation samples to monitor the behavior of the model and to evaluate whether a model performs well on new and unknown data. However, there is a twist to it: Many feature variables which are used as input to machine learning models also depend on a set of parameters. These are called “hyper-parameters” and are not related to the parameters which define the later machine learning per se. For example, a machine learning model may use a weighted moving average of the previous sales of a specific item. However, the weight of this average needs to be determined as well – in few cases, a natural value may be available which is essentially defined by the properties of the system which it describes. However, in most cases, the value of the weight will have to be determined from the data. The question is – which data? If the training sample which is also used to optimize the machine learning model is used, the data is essentially used twice and the two sets of parameters, the hyper-parameters of the feature variables and the parameters of the machine learning model, are no longer strictly independent from one another. If the hyper-parameters of the feature variables are determined from the independent test sample used to validate the machine learning model, some information from the independent test sample may be used implicitly in the training of the machine learning model. To solve this conundrum, a third data sample would have to be used: One to determine the hyper-parameters used for the feature variables, one to train the machine learning model and one to validate the full model - but even in the day and age of Big Data, sufficient data isn't always available for this. For example, a supermarket with three years of historic sales data may seem like a lot of data, considering that many supermarket chains stock tens of thousands of products that are sold at hundreds of store locations every single day — but splitting this into three samples would imply that "special" events such as Christmas or Easter or seasonal trends are only present once in each sample – hardly enough for a machine learning model to really learn from it. Methods such as cross-validation aim to alleviate this but are difficult to use with auto-correlated data such as time-series like historic sales patterns.

Despite all efforts to ensure the best data quality, faulty data are still possible. For example: Some aspect of a new data delivery may contain an error that has not been encountered so far and slips through the automated test. Even in these circumstances, the predictions calculated by the AI model has to be sensible and produce a meaningful result rather than crash the process or system or — worse — result in a wrong prediction. Although the administrators and Data Scientists should be notified automatically of any incident, the system should be running reliably, even if an error has occurred. It is important to deal with such an incident (e.g. fix the error in the data and implement a further automatic check), but when several billion predictions and decisions need to be calculated in a narrow time frame, fault tolerance is of utmost importance. Widely used method such as linear regression or even more sophisticated maximum likelihood techniques are particularly prone to produce incorrect results when confronted with outliers or faulty data.

Adversarial samples – an Achilles’ heel for AI systems?

A further way to "confuse" AI and machine learning algorithms are adversarial samples, as introduced by I. Goodfellow et al in 2014. Adversarial samples are specially engineered or created data points that result in a prediction made by an AI or machine learning model that is very far from the truth. Consider the following figure taken from Goodfellow’s pre-print: The original picture shows a panda bear, which is correctly identified as such by an AI algorithm. After adding a very small adversarial component, the AI model misclassifies the panda bear as a gibbon, a primate, with high confidence even though a human would not notice the difference between the two pictures.

In another study, Nguyen et al. focus their research on unrecognizable images that result in predictions with high confidence by an AI system, but the human cannot discern any pattern in the pictures. The code is available for exploration if you're interested.
These examples focus on complex deep learning AI systems, but Papernot et al have shown that other, much simpler machine learning algorithms, such as decision trees are also susceptible to adversarial samples. Such images could potentially lead to disastrous and lethal effects, as shown in research by J.H. Metzen et al., which focuses on semantics inferred from images that are used in self-driving cars. In particular, they try to "trick" the AI systems to "un-see" pedestrians on a street – it doesn’t take much to imagine what would happen if an autonomous car drove through a group of children because it doesn't "see" them. Adversarial samples are also found in other areas, such as natural language generation as recent research by Rajeswar et al shows.

Adversarial examples are hard to defend against – as Goodfellow and Papernot point out in their blog – because we currently don’t have a good theoretical understanding of the crafting process of adversarial examples – and because an AI model can be confronted with any number of inputs. Many approaches to defend against adversarial examples have been studied, He et al. show that even an ensemble of weak defenses does not lead to a strong defense.

Adversarial samples can also be used to create simulated data or new work that is (almost) indistinguishable from the data used to train the machine learning models. These specialized deep learning technique is called Generative Adversarial Networks (GAN) and have been proposed by I. Goodfellow et al in 2014. A video recording of an introduction can be found here. The general idea is based on game theory – in short, two networks compete against each other: The Generative (G) Network creates a synthetic dataset and the Discriminatory (D) Network estimates the probability that the input comes from the original data rather than synthetic data produced by G. For example GANGogh has been trained on 100,000 pictures from WikiArt and was used to generate new art work. In particle physics, having access to a large amount of high quality simulated events is crucial to any analysis aiming to discover new physical phenomena. One of the most computationally expensive steps is the simulation of the behavior of particles traversing the calorimeter, which is used to measure the particle’s energy. In a recent study, M. Paganini et al. showed that this step can be improved significantly by using a Generative Adversarial Network.

The first line of defense against potentially harmful predictions should be to avoid inherently vulnerable techniques such as linear regression, as well as employ regularization schemes, ensemble method or other techniques to avoid overtraining. However, because of the potential vulnerability to adversarial samples, exploiting domain and context information is crucial in the next step of optimization. It's important to ask: Are the predictions sensible? Do the decisions derived from them make sense? What matters most to companies using AI-based decisions is that these decision are always optimal with regards to a defined metric – and are delivered reliably at all times.

Dr. Ulrich Kerzel Dr. Ulrich Kerzel

earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, where he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a Principal Data Scientist.