As a data scientist, it is always tempting to focus on the newest technology, the latest release of your favorite deep learning network, or a fancy statistical test you recently heard of. While all of this is very important, and we here at STATWORX are proud to use the latest open-source machine learning tools, it is often more important to take a step back and have a closer look at the problem we want to solve.
In this article, I want to show you the importance of framing your business question in a different way – the data science way. Once the problem is clearly defined, we are more than happy to apply the newest fancy algorithm. But let’s start from the beginning!
The Initial Problem
Let’s assume for a moment that you are a data scientist here at STATWORX. Monday morning, at 10 o’clock the telephone rings, and a manager of an international bank is on the phone. After a bit of back and forth, the bank manager explains that they have a problem with defaulting loans and they need a program that predicts loans which are going to default in the future. Unfortunately, he must end the call now, but he’ll catch up with you later. In the meanwhile, you start to make sense of the problem.
Data Scientist View
While it’s clear for the bank manager that he provided you with all necessary information, you grab another cup of coffee, lean back in your chair and recap the problem:
- The bank lends money to customers today
- The customer promises the bank to pay back the loan bit by bit over the next couple of months/years
- Unfortunately, some of the customers are not able to do so and are going to default on the loan
So far everything is fine. The bank will give you data of the past and you are instructed to make a prediction. Fair enough, but what specifically was there to predict again? Do they need to know whether every single loan is going to default or not? Are they more concerned about the default trend throughout the whole bank?
Data Science Explanation
From a data science perspective, we differentiate between two sorts of problems: Classification and Regression tasks. The way we prepare the data and the models we apply are inherently different between the two tasks. Classification problems, as the name suggested, assign data points into a specific category. For bank loans, one approach could be to construct two categories:
- The loan defaulted
- The loan is still performing
On the other hand, the output of a Regression problem is a continuous variable. In this case, this could be:
- The percentage of loans which are going to default in a given month
- The total amount of money the bank will lose in a given month
From now on, it’s paramount to evaluate with the clients what problem they actually want to solve. While it’s a lot of fun to play around with the best tech stack, it is of the highest importance to never forget about the business needs of the client. I’ll present you two possible scenarios, one for the classification and one for the regression case.
Scenario Classification Problem
For the next day, you set up a phone conference with the manager and decision-makers of the bank to discuss the overall direction of the project. The management board of the bank decided that it is more important to focus on the default prediction of single loans, instead of the overall default trend. Now you know that you have to solve a classification problem. Further, you ask the board what exactly they expect from the model.
Manager A: I want to have the best performing model possible!
Manager B: As long as it predicts reality as accurate as possible, I’m happy 🙂
Manager C: As long as it catches every defaulted loan for sure…
Manager A: … but of course, it should not predict too many loans wrong!
Data Scientist View
You try to match every requirement from the bank. Understandably, the bank wants to have the perfect model, which makes little to no mistakes. Unfortunately, there is always an error. You are still unsure which error is worse for the bank. To properly continue your work, it is important to define with the client which problem exactly to solve and, therefore, which error to minimize. Some options could be:
- Catch every loan that will default
- Make sure the model does not classify a performing loan as a defaulted loan
- Some kind of weighted average between both of them
Have a look at the right chart above to see how it could look like.
Data Science Explanation
To generate predictions, you have to train a model on the given data. To tell the model how well it performed and to punish it for mistakes, it is necessary to define an error metric. The choice of the error metric always depends on the business case. From a technical point of view, it is possible to model nearly every business case, however, there are four metrics that are used in most classification problems.
This metric measures, as the name suggests, how accurate the model can predict the loan status. While this is the most basic metric one can think of, it’s also a dangerous one. Let’s say the bank tells us that roughly 5% of the loans on the balance sheet default. If, for some reason, our model never predicts defaults. In other words, the model classifies every loan as a non-defaulting loan. The accuracy is immediately 95/100 = 95%. For datasets where the classes are highly imbalanced, it is usually a good idea to discard accuracy.
Optimizing the machine learning algorithm for recall would ensure that the algorithm catches as many defaulted loans as possible. On the flip side, an algorithm that predicts perfectly all defaulted loans as a default is often the result that the algorithm predicts too many loans as defaulted. Many loans that are not going to default are also flagged as default.
High precision ensures that all of the loans the algorithm flags as a default are classified correctly. This is done at the expense of the overall amount of loans which are flagged as default. Therefore, it might not be possible to flag every loan which is going to default as a default, but the loans which are flagged as defaults are most likely really going to default.
Empirically speaking, an increase in recall is almost always associated with a decrease in precision and vice versa. Often, it is desired to balance precision and recall somehow. This can be done with the F-beta score.
Scenario Regression Problem
During the phone conference (same one as in the classification scenario), the decision-makers from the bank announced that they want to predict the overall default trends. While that’s already important information, you evaluate with the client what exactly their business need is. At the end you’ll end up with a list of requirements:
Manager A: It’s important to match the overall trend as close as possible.
Manager B: During normal times, I won’t pay too much attention to the model. However, it is absolute necessary that the model performs well in extreme market situations.
Manager C: To make it as easy and convenient to use as possible and to be able to explain it to the regulating agency, it has to be as explainable as possible.
Data Science View
Similar to the last scenario, there is again a tradeoff. It is a business problem to define which error is worse. Is every deviation from the ground truth equally bad? Is a certain stability of the prediction error important? Does the client care about the volatility of the forecast? Does a baseline exists?
Have a look at the left chart above to see how it could look like.
Data Science Explanation
Once again, there are several metrices one can choose from. The best metric always depends on the business need. Here are the most common ones:
The Mean Absolut Error (MAE) calculates, as the name suggests, how far the predictions are off in absolute terms. While the number is easy to interpret, it treats every deviation in the same way. On a 100-day time interval, being every day off by 1 unit is the same as predicting everything, every day right but being one day off by 100 units.
The Mean Squared Error (MSE) also calculates the difference between the actual and the predicted output. This time, the deviation is weighted. Extreme values are worse compared to many small errors.
The compares the model to evaluate against a simple baseline model. The advantage is that the output is easy to interpret. A value of 1 describes the perfect model, while a value close to 0 (or even negative) describes a model with room for improvement. This metric is commonly used among economists and econometricians and, therefore, in some industries a metric to consider. However, it is also relatively easy to get a high , which makes it hard to compare.
The Mean Absolute Percentage Error (MAPE) measures the absolute deviation from the predicted values. On the contrary to the MAE, the MAPE displays them in relative terms, which makes it very easy to interpret and to compare. The MAPE has its own set of drawbacks and caveats. Fortunately, my colleague Jan already wrote an article about it. Check it out if you want to learn more about it here.
In either one of the cases, the classification or the regression case, the “right” answer to the problem depends on how the problem is actually defined. Before applying the latest machine learning algorithm, it is crucial that the business question is well defined. A strong collaboration with the client team is necessary and is the key to achieving the best result for the client. There is no one-size-fits-all data science solution. Even though the underlying problem is the same for every stakeholder in the bank, it might be worth it to train several models for every department. It all boils down to the business needs!
We still haven’t covered several other problems, which might arise in subsequent steps. How is the default of a loan defined? What is the prediction horizon? Do we have enough data to cover all business cycles? Is the model just used internally or do we have to explain the model to a regulating agency? Should we optimize the model for some kind of internal resource constraints? To discuss this and more, feel free to reach out to me at firstname.lastname@example.org or send me a message via LinkedIn.