To avoid losing substantial time and money on AI projects, managers must have a strong understanding of how data is processed. Data processing involves actions data scientists undertake to transform dirty, real world data, into clean, understandable data. Machine learning algorithms can only provide valid results and predictions if data is free from errors and is correctly formatted. As with many things in life, “Garbage in equals garbage out.”
Part of data processing includes searching for outliers or impossible data points and inspecting them. For example, a dataset includes client’s date of birth. Filtering this data to inspect birthdates over 90 or 100 years old may reveal data points with birth years in the 1800’s. The data scientist could then remove these data points to avoid confusing the model.
Another important part of data processing involves handling missing values. Datasets are often incomplete and data scientists must decide to delete entire entries that contain missing elements, or they can insert a placeholder (perhaps using the median or most frequent value). If the data scientist determines that there are too many missing values in the dataset, the manager must decide whether it is cost-effective to collect more data, proceed with the current data, or kill the project. This decision should be made with a solid understanding of the costs of collection or non-collection as well as the likelihood the new data will have completeness the current set is lacking.
Once the manager has communicated with the data scientist and determined the data quality is satisfactory to move forward, the training can begin! This is where machine learning terminology can cause confusion. The data scientist is not standing by his computer with a stopwatch and a whistle, shouting at the data to run faster. Instead, they write a few lines of code to split the dataset into two. One set is called the ‘training set’ and the second ‘test set’. It is common to use 80% of the data in the training set and 20% in the test set.
Training the model occurs, as one may think, on the training set. This means that the algorithm is run on all of the data points in the set and it outputs a formula or methodology that will be used to predict or classify future data points. Once the data scientist is satisfied with the outputs and fine tunes the model, it is time to put it to the test.
Since data points in the test set are different than the training set, testing will determine how well the model will generalize new data. By running the new model on the test set, the data scientist can compare the real output values (or labels) of the training set to the model’s predicted output values. When the model’s outputs are reasonable as compared to the actual outputs, there is low generalization error, and the model reacts well to new data.
With high generalization error, the model has learned the training data well, but will not be useful for use in real life. The model does not provide useful outputs from data that represents new situations not seen in the training set. To avoid training a model with high generalization error, the data scientist could use a more sophisticated split. They could run an algorithm and identify data points that are most like each other (called segments in the dataset) and make sure that an equal number of data points from each segment are in the test set and the training set. This can help ensure that data in both sets represents the overall data population.
Although much of the above is undertaken by the data scientist, managers can add incredible value to AI projects by understanding the high level aspects of data processing and training. Ultimately, the manager is the decision maker and having high-level expertise of impacts data has on model production can help lead to better decisions.