Two major problems often arise when implementing AI in business. Projects run into ‘Bad Data’ and ‘Bad Algorithms’. Last week’s blog described some of the issues that come about when dealing with bad data. This week, we’ll use football to illustrate why bad algorithms can cause trouble for machine learning projects.
Imagine you are the coach of the Dallas Cowboys football team. Your team has struggled to win games and you’re worried about losing your job. You’ve seen what analytics and AI have done for baseball, so you decide to give it a shot. You hire a data scientist and give her all of the plays in your playbook and the outcomes of each play during the game. You tell her to create an algorithm that shows which plays you should run and in what order.
The data scientist takes the data and applies machine learning algorithms. Let's imagine 3 resulting scenarios:
In scenario 1, we may think a full script of plays will work in theory, but it would go horribly wrong in a real game. As different situations arise in the game, the coach could not adapt the plays accordingly. This means the script of plays would work on the data used to train the model (the plays and outcomes provided by the coach), but would not respond well to uncertainty or to situations that arise in new games.
This is called overfitting the data. Overfitting happens when a complex model is applied to the data and does not allow for any drift or variation in the prediction. In games like football there are millions of sequences and outcomes available which may have not been captured by the data the coach provided. Therefore, any new situation not provided by the original dataset would not be acknowledged in the ‘perfect’ script of plays provided.
Looking at scenario 2, we can see the data scientist has badly underfit the data. They have selected a model that does not learn the underlying complexity of the data and has output a single play it decided works best. A football game is much more complex than the model allows, so the predictions are not accurate even on the training data. Think of drawing a straight line through a scatter plot. It does not react to the ups and downs of the data, and does not provide any insights to changes in variables.
The best scenario is number 3. Here the coach scored an all-star data scientist who understands the concept of regularization. This fancy mathematics term means the data scientist has allowed enough complexity to properly understand the data, but has constrained output values to ensure the model does not overfit. Regularization is as much an art as it is a science. It takes experience and understanding to properly limit a model and avoid issues with over/under fitting data.
Managers need to be aware of issues that arise when the incorrect algorithms are applied to a dataset. Good managers will look at the dataset and ask:
With up front expectations, the manager can help identify models that over or under fit the data and avoid implementing models that provide incorrect results.