Have you checked the oil on your machine learning model?

October 1, 2018 • Reading time 3 minutes

Once upon a time, buying software was a one-off purchase. A bit like buying a music single. The advantage of absolute certainty of cost outweighed bugs and other issues. But in today’s world, as software is becoming more sophisticated, many companies are moving towards subscription models supported by rapid improvements, developments and fixes – often guided by user requests. In the age of intelligent algorithms that can learn, this is no longer a nice idea, but essential.

Machine learning (ML) is a form of statistics that allows learning over time as new data is entered. This means the statistical models become more accurate. It also means predictions are robust to developments. A ML model to assist driving a car can adapt to new weather and road conditions, even different drivers given enough data is gathered.

This is a fantastic development akin to a car that gets more valuable with age, instead of depreciating. But there is also a downside: If these algorithms are left unsupervised, they can start to pick up strange and mysterious behaviours. Let’s discuss a few of these below:

1. Moving goal posts:

Machine Learning should in theory get better over time. This is usually the case, but not always. If data stops being recorded correctly, or a field is changed, it is possible that the performance of the model will fall. This may go unnoticed at first but could eventually become a serious issue. For instance, a sensor could shift from its original position or, a data entry employee could leave a team and their replacement doesn’t record times correctly. This could cause predictions to shift from their desired purpose all the while accuracy measures could still be high.

There are methods to monitor performance like automated alerts and reports which, together with continuous integration tools like Jenkins, offer rapid response to issues. However, these need interpreting and it can be laborious to cover every possible scenario. At the end of the day, there simply isn’t an alternative to a curious data scientist and continued support.

2. Cheating:

Some algorithms achieve remarkable results, but for the wrong reasons. The classic example was a system that identified dogs and cats in photos. It was incredibly accurate at telling the difference between a dog and a cat, except when the cat was sat on the lawn: the algorithm was picking up the high correlation between dogs and green grass, rather than recognising the features of dogs and cats. In other situations, this could lead to misleading results. The solution to the dog vs cat problem was simple – turn images black and white before analysing. Over time it’s possible that the algorithm finds a cheat (the green grass in our example) and sticks to it with negative consequences for the accuracy of predictions.

3. Leaking bias:

Algorithms need to reflect good ethics. If they bias a certain part of the population, this could have undesirable consequences. Just like none of us would want a racist judge overseeing a legal system, so the same should apply to algorithms. The problem is that it can be hard to tell if an algorithm is behaving as it should, or with unethical bias – particularly if that bias is introduced slowly over time.

The consequence is that algorithms based on AI or machine learning need human supervision by someone that understands the underlying algorithms as well as the outputs. A bit like a seasoned mechanic listening to your car, a data scientist is someone that understands the underlying modelling as well as the results.

Have you checked the oil on your machine learning model? Leaving Excel alone for weeks at worst resulted in outdated results that no one understands, but the same is often not true of modern algorithms. An ongoing oversight is needed if these algorithms are to deliver their promised results.

#ai #machinelearning #algorithms #statistics