Loan origination and interest from the same make for the lion’s share of business for most banks. As such, for both banking and private lending institutions, the ability and importance to see the future and forecast the success and failure of a loan applicant to fulfill their obligation are immense. It is here Delinquency Prediction plays a major role in defining how successful a bank’s loan origination and subsequent repayment is going to be.
Banks and private lending institutions have started using a number of new age tools and technique to improve their lending scores while increasing and retaining their customer base and expanding along with the dynamic markets. Using third-party sources in personal lending and regression analysis as a fintech tool for predicting successful loan applications are among some of the new ways they have successfully adopted. However, there is more to lending than improving the certainty of successful loan origination through regression analysis. Many a time, borrowers who might seem to make for perfect candidates for loan origination might show erratic payment and financial behavior, once their loan is approved. This is something that the underwriters might not be able to predict at the time of loan origination. But this behavior largely increases the risk involved in lending for banks & other alternative lending fintechs, as it reduces, or rather, jeopardizes the chances of full principle repayment along with interest. Delinquency prediction helps the lenders see the risk by observing and studying a large set of consumers and their financial behaviors using statistical models that help in removing biases and errors to give you a score, close to perfection.
As mentioned above, loan behavior prediction can be a great tool in ascertaining renewals of credit lines and loans, as well as knowing if the customer can make good on their repayment schedule. The simplest way to do so is by using machine learning to assimilate all the data and apply them to statistical models and finding out scores that ascertain the probability of delinquency.
Since we are talking about machine learning, we need to be careful about the information, variables in this case, that we are feeding the tool. For each loan application, the bank receives plenty of information. For example, loans, transaction records, and credit cards. While these make for basic details that are required to be filled for any loan application, there are few more variables that are required to successfully study and ascertain the success rate of the loan. Some of these variables are calculated by the lender on the basis of the information shared by the borrower, supported by relevant documentation.
• Credit Score/FICO Score
• Interest Rate
• Annual Income
• Debt-to-income ratio of the borrower
• Number of days with a credit line that the borrower has had
• Borrower’s revolving balance
• Utilization rate of the borrower’s revolving line
• Number of times the borrower had not paid in full or gone 30+ days past the due date on payment in the last two years.
Lending institutions need to keep a record of their loan history and data to help feed information into the delinquency predictor model to come up with the closest possible scores. Apart from the above-mentioned variables that lenders gather from their borrowers, here’s a list of more variables that are required to help machine learning churn the number, apply them to the predetermined model and give you the score.
The count for each of the variables must be about the same as the number of loans that are being studied. For example, if the bank is studying 280,000 loans, then the count for each of these variables should be close to the same +/- 100 or so. However, if there is a mammoth discrepancy in the count for some of the variables, then it shows that those particular variables do not have enough data to support them, which might disturb the effectiveness of the predictive model. In order to ensure the missing data does not disrupt the performance of the model, the variables with missing values could be substituted by a zero, thereby bringing the count close to the original count of 280,000. Alternatively, a binary selection of yes/no can also be applied to the variables, where a specific value is not required. If there is a certain variable which does affect the score a lot and does not have enough data to support the study, then that particular variable can be dropped from the study. Another way is to average out the amount for the variables and substitute the missing values with the original amount, this will ensure better results, less error and unbiased data filling.
Once the count and value of the variables are ascertained, the next step will be to look into the relationship between the variables. This is done by using pairwise grids and heatmaps. If a very strong correlation is observed, then one of the two correlating variables is dropped to avoid issues of problems in interpreting the relationship between the two variables and delinquency and performance of the models being used for the prediction.
Additionally, leakage should be avoided at all costs as it might bias the model. Leakage occurs when additional, unexpected information is introduced into the dataset.
Finally, we come to the model selection part of delinquency prediction using machine learning. For this, we need to know what we are looking for as a result. In case of delinquency prediction, the result will be binary, yes/no. For this purpose, we can use one of the following machine learning classification models:
• Logistic regression
• Decision trees
• Neural networks
• Support vector machines
• Nearest neighbors
• A few others
If required, then two or more models can be combined to create a final model which reduces generalization errors. However, the success of these ensemble models depends on the assumption that each of them will work with a different aspect of a particular data thereby finding part of its truth.
There are different methods of building ensemble models:
Blending: This method presents the average of all predictions.
Bagging: In this method, different datasets are applied to different models and then consider the majority vote from the models collectively.
Boosting: In this method, models are built sequentially, whereby each new model learns from and is built from the residues of previous models. The final output will be the output of each model carrying the weightage of a learning rate, signified by λ.
Stacking: This model requires analyzers to build base models also known as base learners, apply the data set and use the final output to fit into a model which will give the final output.
Once selected, the models need to be evaluated and tuned in order to rule out the problem of using a single train of data which generally does not provide the best estimate of error on the particular test set. To rule this problem out, it is common to conduct cross-validation by training multiple instances of a model. The model is trained and evaluated K times, each time running on a different training set.
There are different tuning techniques, as well, which can be adopted to tune the models.
Overall, the delinquency prediction model presents a result which has been vetted, time and again, using machine learning, selecting and tuning models, and adjusting missing variables. Delinquency prediction using this method can help lenders substantially reduce their lending and refinancing risks.