We Need to Talk About Your Data - Data Quality in the Age of AI

IN Machine Learning — 21 June, 2017

A short while ago I attended an industrial conference about production machines and engineering. This specific community is about five years behind retailers in understanding the impact that Artificial Intelligence (AI) and machine learning can make when integrated into their processes.

One of the keynote speeches was given by a professor chairing a session on information management in mechanical engineering, focussing on machine learning and AI in Industry 4.0. As the professor was describing several industrial use-cases, one of them spiked my interest. The professor described the situation: “The data were entered in a database by experts – and so we were sure that the information was correct, because, come on, they’re experts. Yet products manufactured based on these data were faulty.” It’s a non-verbatim quote, but the main theme repeated several times throughout the presentation of this project emphasised the assumption that the data had to be correct – because they were entered by experts. I was fascinated by this description – because in the many years I’ve been handling data, I’ve never come across a system where all data are correct and can be fully trusted, even — or maybe especially — when many experts work with these data on a daily basis.

Data are the foundation on which all data-driven decisions, machine learning and AI models are built. Data quality plays a critical role: AI systems learn from data – implicitly assuming that the data are a fair representation of the circumstances the machine is supposed to learn. While a certain amount of noise in the data is unavoidable, too much noise impairs machine learning, as it gets harder and harder for the machine to distinguish what are genuine but small effects – and what is just a statistical fluctuation. Systematic shifts and trends are even more dangerous as the AI algorithm has no way of knowing that part of the data is systematically wrong.

Even with the best intentions, there are simply too many opportunities for the data to turn “wrong”. Data may be entered or converted incorrectly when being transferred from one system to another. This kind of error can be reduced significantly by improving tools and processes and verifying automated checks as often as possible. In one project, we found many data points that were labelled “test”, “test2”, “do not use”, “obsolete” in the comment field, indicating that these were originally data used to test the system, but they were never removed after the test concluded.

Other examples are more subtle:

  • Data fields (such as columns in a database) change meaning in time without changing the description of the fields (i.e. a schema change or evolution). This means that the same field is used for different purposes, changing according to some convention. Unfortunately, once the common understanding of such a convention is lost, the data lose their meaning. This can easily happen in situations when key personnel changes or external partners are involved.
  • The data stored in a database does not reflect the “real world” it is meant to describe. In one project, I was asked if it was possible to predict a financial target (such as the EBIT) for a large global enterprise. Apart from the principal difficulty, there are typically far too few data points available (e.g. financial targets recorded on a quarterly basis for the past 5 years, which only makes 20 data points), it turned out that the data available could not be used to analyse past events. Although great care was taken to enter the correct information into the database, the structure of the database did not allow to keep track of mergers, acquisitions, sales or shifts in priorities, production lines or responsibilities between the various business units and subsidiaries. Even if all data had been entered correctly, they were only valid for this moment in time and any change to the organizational structure of the entire company would not be reflected in the data stored.
  • The data and/or the storage systems are used by different groups for different purposes – but neither group is aware of this. This can easily happen in a large organization with central data storage. In another project with a large international retailer, all data was stored in a central database. Our project was concerned with the sales of goods – and everything in the data, from the naming of the variables to the description and even database schema supported the assumption that this database was meant for sales data of individual products at the retailer's various store locations. However, it turned out that the operational division was also using the same system to replenish all items needed to operate the stores, e.g. cleaning material, bags, etc., not meant for the customer (or available to customers). As the database did not discriminate between “internal” and “external” use, there was no way of distinguishing them – unless one knew the convention by which the internal products were labelled.
  • Data are only valid for a certain time span. An obvious illustration is the lifetime of a given product: A new product is introduced and can be sold or used in the further manufacturing process. Eventually, the product is discontinued and can no longer be ordered. In this case, an interval of validity defining when this product is “active” can be easily defined. Even years after the product has been discontinued, one can match sales records or usage patterns with the description of the actual items. However, in a number of cases, the change can be more subtle: Quite often, original equipment manufacturers (OEMs) deliver products or components that are produced according to a set of mutually agreed specifications. However, OEMs may change their product within the specifications. Although the description, identifier and all other details of the product remain the same, the product itself is different. As the product fulfils the original specifications, OEMs often don’t communicate the change. However, as many parts are combined into a final product, the latter may be faulty even though – apparently – nothing has changed. Keeping track of any changes, however small, can at least indicate when and where issues appear. In software development, these concepts are related to “unit testing” (i.e. each piece of software is tested individually to make sure it does what it’s supposed to do) and “integration testing” where the interplay of all components is tested in an environment that matches the intended usage as closely as possible. It is important that both kind of tests are executed frequently, maybe even continuously.

As these examples and anecdotes indicate, data quality is of paramount importance to any AI project – but difficult to assure and maintain. Th. Redman has used an example of how data quality evolves with time in his article in the Harvard Business Review “Data doesn’t speak for itself”. Initially, the quality of the data is unknown and has to be measured according to some metric, which has to be defined in the specific context of a given project. As the baseline is established and a target level is set, initial projects focus on improving the quality of the data. Once these projects conclude, all data stored in the system should be of much higher quality. However, from then on, the quality of the data will degrade again if there is no continuous process monitoring and improving the data quality. New data are added and each new data point comes with the possibility of being wrong in a unique way that hasn’t been encountered so far. Automated quality checks can only test what is currently known and need to be continuously amended. Indeed, in a later article the same author postulates that data quality should be everyone’s job in an organization or company, not just the priority of a small team that grooms the databases occasionally.

Coming back to the story of the Professor at the beginning, the conclusion of the project was that data have a certain time where they can be considered “correct” – that's the interval-of-validity we discussed. In the specific case, this came about because third-party vendors changed the components needed in the manufacturing process slightly with time. Each change was well within the specifications – but all changes together meant that the final product just didn’t work.

Ulrich Kerzel earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, were he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a senior data scientist. Ulrich Kerzel earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. After his PhD, he went to the University of Cambridge, where he was a Senior Research Fellow at Magdelene College. His research work focused on complex statistical analyses to understand the origin of matter and antimatter using data from the LHCb experiment at the Large Hadron Collider at CERN, the world’s biggest research institute for particle physics. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a Principal Data Scientist.

 

Dr. Ulrich Kerzel Dr. Ulrich Kerzel

earned his PhD under Professor Dr Feindt at the US Fermi National Laboratory and at that time made a considerable contribution to core technology of NeuroBayes. He continued this work as a Research Fellow at CERN before he came to Blue Yonder as a Principal Data Scientist.

Video

At Blue Yonder we are serious about data protection and the rights of our website visitors. By signing up to our newsletter you are agreeing to us processing your data, in order for us to send you the content you have requested. We process your data to improve and personalize our visitors' online experiences, and based on legitimate interests. To learn more about how and why we do so, see our privacy policy.