1. Introduction
In today’s world, there is no question that almost everyone has a credit card. One may need to be 18 years of age before applying for credit, but parents can actually add their young children as authorized users to their credit cards. So the real question asked by banks and credit card companies is whether the primary account holder is at high risk or low risk in repaying the credit or loan, and thereby credit is granted based on risk levels; high risk equals low credit limit, and low risk equals high credit limit. Institutions use various data mining algorithms to determine the credit limits for individuals. The typical features used by institutions to help determine credit limits include age, education, employment, gender, income, and marital status (1). Using these features, there is still a question of accurate predictability, that is, how accurate can an institution be in predicting the risk and granting credit levels. If an institution grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does not repay the credit/loan.
The novelty of this work it that it shows how to improve the accuracy in predicting credit limits/loan amounts using synthetic feature (SF) generation. By creating secondary groupings and including both the original binning and synthetic bins, the classification accuracy and other statistical measures like precision and ROC improved substantially.
Four different datasets were used for this analysis. Three datasets (datasets 1, 3, and 4) were used for prediction of credit limits/loans, and one dataset (dataset 2) was used for predicting bank approvals of credit/loans. In this work, first, feature selection was performed using information gain. Then, the classification was performed using features with higher information gain and the synthetic features. For classification, three different tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Naïve Bayes) were used.
The rest of this article is organized as follows. Section 2 presents the related works; section 3 presents the datasets and preprocessing; section 4 presents the results and discussion; and section 5 presents the conclusions.
2. Related works
Zeng (2) studied effective binning on credit scoring. He focused on the weight of the evidence and regression modeling for binning of continuous variables. The feature, age (typically a variable in any financial dataset) was the example used for improving the binning process.
Danenas and Garsva (3) presented their work on credit risk evaluation based on linear support vector machine classifiers. This was combined with external evaluation and testing sliding windows, with a focus on larger dataset applications. These authors concluded that, using real-world financial datasets, for example, from the SEC EDGAR database, their method produced results comparable to other classifiers such as logistic regression and thus could be used for the future development of real credit risk evaluation models.
Lessmann et al. (4) compared several classification algorithms to credit scoring. They examined the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy.
Ala’raj and Abbod (5) presented a new ensemble combination approach based on classifier consensus to combine multiple classifier systems of different classification algorithms. Specifically, five well known base classifiers were used: Neural Networks, Support Vector Machines, Random Forests, Decision Trees, and Naïve Bayes. Their experimental results demonstrated the ability of their proposed combinations to improve predictive performance against all base classifiers. Their model was validated over five real-world credit scoring datasets.
Musyoka (6) compared data mining algorithms with the credit card approval dataset. This research focused on masked attributes and compared the Bayesian Network, Decision Tree, and J48 classifiers. Musyoka (6)’s results identified the Bayesian Network algorithm as being the most accurate, returning an accuracy of 86.21%.
Tanikella (7) examined the key features considered for issuing credit cards to customers. This work used machine learning to find that the attributes, prior default, years employed, credit score and debt were the most useful features.
Zhao (8) analyzed the prediction accuracy of multiple regression models and classifiers based on predetermined performance criterion. The experimental models used were Logistic Regression, Linear Support Vector Classification (Linear SVC), and the Naïve Bayes Classifier. In this study, linear SVC performed the best.
Though quite a few works, as presented above, have been done on different aspects of credit analysis using machine learning, none of the works have used the concept of synthetic feature generation in machine learning for credit analysis, which is the uniqueness and novelty of this article.
3. Datasets and processing
Four datasets were selected for this research: German Credit Risk (9), Credit Screening (10), Credit (11), and Bank Churners (12). All datasets contained attributes or features relating to credit cards or credit limits. In the tables describing the respective datasets, the attributes that appear in all four datasets are identified with four asterisks (****), the attributes that appear in three datasets are identified with three asterisks (***), and the attributes that appear in two datasets are identified with two asterisks (**). Preprocessing played a major role in this work, hence preprocessing is explained in detail in this section.
3.1. Preprocessing using feature selection
Feature selection is the process of identifying and selecting features or attributes within the dataset that will aid in improving the accuracy of the returned results. The selection process can be manual or automatic, but essentially the objective is the same – to achieve higher predictive accuracy. For this research, both manual and automatic feature selection was used in each dataset. Once the irrelevant or unusable attributes were removed, the datasets were imported into Weka, and Information Gain was run on each dataset using the Ranker search method. The output identified the amount of information gain for each attribute. Information gain is an entropy-based algorithm that determines the most relevant features necessary of the classification of a dataset.
3.2. Dataset 1: German credit risk
The German credit risk, obtained from Kaggle, was provided by Hofmann (9). This dataset consisted of 1,000 instances and ten attributes. Attribute descriptions and sample values are presented in Table 1.
Table 1. Dataset 1: German credit risk (9).
3.2.1. Preprocessing the German credit risk dataset
3.2.1.1. Calculating information gain
For preprocessing, first the information gain was calculated using the original attributes. As shown in Figure 1, in the German Credit Risk dataset, the attribute with the highest information gain was credit amount, followed by housing and duration. The attributes that are not in Figure 1 have information gain values very close to zero.
3.2.1.2. Removing attributes
The attribute, Purpose, was omitted. Based on its description and data it contained, it was not deemed relevant for this study.
3.2.1.3. Binning and synthetic feature generation
For the German Credit Risk dataset (9), synthetic feature (SF) generation was utilized for the attributes that had lower information gain: Age, Checking Account, Duration, and Saving Account.
The attribute, Age, was binned in two ways, as shown in Table 2. For regular binning, Age was grouped into four buckets, and for synthetic feature generation, the groups were based on the accepted classifications of the age generations (13). Figures 2, 3 show the distributions of each of the binning criteria.
The attribute, Job, was binned in two ways, as shown in Table 3.
The attributes, Checking and Savings, were binned as per Table 4.
The Duration attribute was grouped into four buckets and was narrowed down to three buckets for synthetic feature generation, as shown in Table 5.
The Credit attribute was grouped into four buckets and was narrowed down to three buckets, for synthetic feature generation, as shown in Table 6.
3.3. Dataset 2: Credit screening
The Credit Screening dataset was obtained from UCI Machine Learning Repository and provided by Keogh et al. (10). The dataset contained 16 variables but the variables were masked, hence the converted attributes were used as per Rane (14)’s. This dataset consisted of 690 labeled instances. A description of the attributes is presented in Table 7.
Table 7. Dataset 2: credit screening (10).
3.3.1. Preprocessing the credit screening dataset
3.3.1.1. Calculating information gain
Information gain was calculated using the original set of attributes. As shown in Figure 4, in the Credit Screening dataset (10), the attribute with the highest information gain was Prior Default, and the attribute with the second highest information gain was credit score, with Employed following closely behind. The attributes that are not in Figure 4 have information gain values very close to zero.
3.3.1.2. Removing attributes
Citizen, education level, ethnicity, and zip code were removed.
• Citizen had three values: g, p, and s; g accounted for 90.5% of the applications so the assumption was made the most of the applicants were citizens and therefore the attribute was not used.
• The Education Level attribute had 14 unique values in alpha form, which were not easily interpretable, hence the attribute was removed (not used).
• The Zip Code attribute had values between 1 and 4 digits, hence was not used.
• Ethnicity had too many values, some of which were inconsistent; hence this attribute was removed (not used).
3.3.1.3. Handling missing values
All missing values were labeled “unknown.”
3.3.1.4. Binning and synthetic feature generation
Three of the attributes were binned: Age, Credit Score, and Income. Age was grouped as per Dataset 1 (Table 2), hence is not shown here. The binning and synthetic feature generation of Credit Score and Income are shown in Tables 8, 9, respectively. Debt and years employed are binned in Tables 10, 11, respectively.
3.4. Dataset 3: Credit
The third dataset, Credit, obtained from Kaggle.com, was provided by Iacob (11). This dataset contained 400 instances with 11 attributes, as shown in Table 12. This research focuses on the effects of the attributes on credit limit.
Table 12. Dataset 3: credit (11).
3.4.1. Preprocessing the credit dataset
3.4.1.1. Calculating information gain
Information Gain was calculated on the original attributes. From Figure 5, it can be noted that the attribute with the highest information gain was monthly balance, with credit rating being the second highest. There are far less attributes in Figure 5 than in Table 12. The attributes that are not in Figure 5 have information gain values very close to zero.
3.4.1.2. Removing attributes
Student, ethnicity, income, ID, and number of cards were removed.
• The student attribute was removed because the other datasets did not contain a similar attribute and only 10% of the cardholders were students.
• Ethnicity attribute was also removed because it was not adequately identified in the other datasets.
• Income data did appear correct, hence was not used.
• ID attribute was removed.
• Number of cards was not used because the information gain was close to zero, and other datasets did not include this attribute.
3.4.1.3. Binning and synthetic feature generation
For the Credit Limit dataset, synthetic feature creation was applied on two attributes: Age and Credit Limit. Age was grouped in the same buckets as the German Credit (Table 2). Credit Limit was grouped once in numerical buckets and the second grouping (synthetic feature) was High, Medium, and Low, as shown in Table 13. Other attributes that were binned were monthly balance (Table 14), credit rating (Table 15), and education (Table 16).
3.5. Dataset 4: Bank churners dataset
The Bank Churners dataset, obtained from Kaggle.com, was provided by Goyal (12). This dataset contained 10,127 instances with 27 attributes, as shown in Table 17. The Bank Churners dataset focused on bank customers and the relationship between attrition and the other attributes in dataset.
Table 17. Dataset 4: bank churners (12).
3.5.1. Preprocessing the bank churners dataset
3.5.1.1. Calculating information gain
Information gain was calculated using the original attributes. As shown in Figure 6, in the Bank Churners dataset, the attribute with the highest information gain was income, followed by gender and revolving balance. The attributes that are not in Figure 6 have information gain values very close to zero.
3.5.1.2. Removing attributes
Attrition, Card Category, Client Number, Contacts, Dependents, Months Inactive, Months Total, Open to Buy, Total Amount Change, Total Transaction Amount, Total Transaction Count, Difference Transaction Count Quarterly, and Utilization Ratio Average were removed since they were not considered relevant for this analysis.
3.5.1.3. Binning and synthetic feature generation
For Dataset 4, Synthetic Feature Creation was done for Age (grouped as per the German Credit dataset, Table 2), Credit Limit (grouped as per the Credit dataset, Table 13), and Married. The Married attribute contained the following values: divorced, married, single, and unknown. Divorced and Single were grouped together into “no” (not married), leaving the grouped values for the Married attribute to be Yes, No, and Unknown, as shown in Table 18. Income was binned as per Table 19, and the synthetic feature for income was also categorized as shown in Table 19. Balance Revolving was binned as per Table 20, and months with bank were binned as per Table 21.
4. Classifiers used
For classification, three tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Naïve Bayes) were used.
4.1. Random forest
Random Forest is a widely used machine learning classifier that constructs multiple decision trees randomly, and the term “forest” stems from the imagery of the many trees being created. In Random Forest, each tree is independently produced without pruning, and the nodes are split based on the user’s selection of available features (15). There are also works on how a user can prune Random Forest. Kulkarni and Sinha (16) showed a way of pruning by limiting the number of trees.
4.2. Random tree
The Random Tree classifier is similar to the Random Forest classifier, but it constructs only one decision tree and is based on a random set of attributes. The Random Tree classifier constructs a set of data to build the Random Tree, and every node is split from the best split among all variables (17). Essentially, the Random Tree is a simpler tree/forest classifier, but Random Forest tends to have better accuracy by decreasing the variance because it constructs multiple trees.
4.3. The REPTree classifier
Reduced error pruning tree (REPTree) builds the decision tree based on information gain (18). The tree that is built may be a decision/regression tree, but it is used for classification, and it will create multiple trees in different iterations (18). When the algorithm runs, it goes from each node starting at the bottom and works its way to the top, and at each node, it assesses if it should replace it with the most frequent class to improve the accuracy, and it prunes away items that would cause a reduction in accuracy (19). REPTree only sorts numeric attributes once.
4.4. The Naïve Bayes classifier
Naïve Bayes was chosen as an additional classifier because it is not a tree classifier. The Naïve Bayes classifier assumes that all variables are independent of the class attribute (20).
5. Results and discussion
The four different classifiers were run using Weka. For each dataset and each classifier, we looked at the following statistical measures: accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-measure, and ROC area.
Accuracy is the ratio of a model’s correct data (TP + TN) to the total data, calculated by the following equation:
TPR, also called sensitivity or recall, measures the proportion of actual attacks that were identified as attacks, given by the following equation:
FPR is where a non-attack is identified as an attack, given by the following equation:
Precision measures the proportion of positive identification of attacks that were actually attacks, given by the following equation:
F-measure is the harmonic means on precision and recall, calculated by the following equation:
ROC plots the relational of the TPR vs. FPR.
Where:
• True Positive (TP) is instances that were identified correctly as positives.
• True Negative (TN) is instances that were identified correctly as negatives.
• False Positive (FP) is instances that were identified incorrectly as positives.
• False Negative (FN) is instances that were identified incorrectly as negatives.
For each of the classification runs, various combinations of original attributes (OR), original binned attributes (Bin), and synthetic features (SF) were used. The original attributes and original binned attributes were selected based on information gain that was performed on each respective dataset.
To select the best results, the runs with the minimal set of attributes with the highest statistical measures were selected.
5.1. Classification results for the German credit dataset
For the German credit dataset (9), credit amount, a continuous attribute, was used as the class variable for classification. Tables 22–25 present the statistical results of the classifications.
5.1.1. Naïve Bayes results
Results of the Naïve Bayes classification, presented in Table 22, show that 10 attributes with three synthetic features had the best results, with a classification accuracy of 73.6%. In this run, three synthetic features were used, for features, age, credit, and duration. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 46.1%).
5.1.2. Random forest results
Results of the Random Forest classification, presented in Table 23, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (73.2%). In this run, only the synthetic feature for credit was used. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 38%).
5.1.3. Random tree results
Results of the Random Tree classification, presented in Table 24, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (72.3%). In this run, the only synthetic feature used was for credit. The other statistical measures for this run were also high. Again, without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 37.6%).
5.1.4. REPTree results
Results of the REPTree classification, presented in Table 25, show that eight attributes with one synthetic feature had slightly higher classification accuracy (72.7%) than the other runs. In this run, the only synthetic feature used was credit. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 45.7%).
5.1.5. Overall classifier comparison for the German credit dataset
From Tables 22–25, it can be noted that even using one synthetic attribute greatly improved the classification accuracy and other statistical measures.
A comparison of the classification accuracy of the all the classifiers, on the German Credit dataset (Table 26), show that two out of the four classifiers performed well with five attributes and only one synthetic feature. Naïve Bayes performed the best with 10 attributes and three synthetic features, and REPTree performed the best with eight attributes and one synthetic feature. The highest classification accuracy was achieved with the Naïve Bayes classifier, and the most improved accuracy was achieved using the Random Forest classifier (35.2% improvement, as shown in Table 27).
5.2. Classification results for the credit screening dataset
For the Credit Screening dataset (10), the attribute approved, a binary attribute, was used as the class variable for classification. Tables 28–31 present the statistical results of the classifications. This dataset performed pretty well without the synthetic features too. Using three synthetic features for age, credit score, and income only very slightly improved the classification accuracy for the Random Forest and Random Tree algorithms.
5.2.1. Overall classifier comparison for the credit screening dataset
A comparison of the classification accuracy of the all the classifiers, on the Credit Screening dataset, from Table 32, shows that Random Forest and Random Tree had the highest accuracy using three synthetic features, but Naïve Bayes and REPTree did not perform better with synthetic features. An analysis of the accuracy improvement on this dataset, as shown in Table 33, shows very little improvement after adding synthetic features. In fact, there was a negative improvement with Naïve Bayes and REPTree.
5.3. Classification results for the credit dataset
For the Credit dataset (11), credit limit was used as the class variable for classification. Tables 34–37 present the statistical results of the classifications.
5.3.1. Naïve Bayes results
Results of the Naïve Bayes classification, presented in Table 34, show that in terms of classification accuracy, seven attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 77.5%. Other statistical measures were also high for this run.
5.3.2. Random forest results
Results of the Random Forest classification, presented in Table 35, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (77.5%).
5.3.3. Random tree results
Results of the Random Tree, presented in Table 36, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (76.8%).
5.3.4. REPTree results
Results of the REPTree classification, presented in Table 37, show that five attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (79.3%).
5.3.5. Overall classifier comparison for the credit dataset
For the Credit dataset, for all classifiers, there was a significant increase in classification accuracy and other statistical measures after the synthetic features were added, as shown in Table 38. Comparing the classifiers, REPTree performed the best at 78.8% classification accuracy. Three of the four classifiers performed the best with seven attributes and one synthetic feature. Only Naïve Bayes performed the best with six attributes and two synthetic features. Other statistical measures were also higher for this set of runs.
From Table 39, it can be observed that Random Tree had the highest improvement in accuracy (24.2%), closely followed by Random Forest at 24%. The other two classifiers, Naïve Bayes and REPTree, also had a significant improvement with the addition of synthetic attributes.
5.4. Classification results for the bank churners dataset
For the Bank Churners dataset (12), credit limit was used as the class variable for classification. Tables 40–43 present the statistical results of the classifications.
Results of the Naïve Bayes classification, presented in Table 40, show that in terms of classification accuracy, four attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 71.1%. Results of the Random Forest, Random Tree, and REPTree, presented in Tables 41–43, respectively, also show that four attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (72.7, 72.7, and 72.5%, respectively).
5.4.1. Overall classifier comparison for the bank churners dataset
For this set of classifiers, adding one synthetic attribute improved the classification accuracy significantly and all four classifiers performed the best at four attributes with one synthetic attribute, as shown in Table 44. From Table 45, it can be noted that Random Tree had the highest improvement in accuracy at 22.7%, followed by Random Forest at 20.7%. The other two algorithms also had significant improvement in accuracy with just one synthetic feature.
6. Conclusion
Three of the four datasets used in this research showed an improvement in accuracy and other statistical measures using synthetic attributes. Overall, the tree-based classifiers, Random Forest, Random Tree, and REPTree, appeared to have better performances as well as better performance improvements than the non-tree-based classifier, Naïve Bayes.
Author contributions
SB conceptualized the article, responsible for guiding the research, and directing the formulation of the article. JW also helped to conceptualize the article, did most of the pre-processing and processing of the data, and wrote the initial draft of the article. Both authors contributed to the article and approved the submitted version.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
1. Wongchinsri P, Kuratach W. A survey - data mining frameworks in credit card processing. Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). Chiang Mai: IEEE (2016). p. 1–6.
2. Zeng G. A necessary condition for a good binning algorithm in credit scoring. Appl Math Sci. (2014) 8:3229–42.
3. Danenas P, Garsva G. Selection of support vector machines-based classifiers for credit risk domain. Expert Syst Appl. (2015) 42:3194–204.
4. Lessmann S, Baesens B, Seow HV, Thomas LC. Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur J Operat Res. (2015) 247:124–36.
5. Ala’raj M, Abbod MF. Classifiers consensus system approach for credit scoring. Knowl Based Syst. (2016) 104:89–105.
6. Musyoka WM. Comparison of data mining algorithms in credit card approval. Int J Comput Inform Technol. (2018) 7:2.
7. Tanikella U. Credit Card Approval Verification Model. PhD thesis. San Marcos, CA: California State University San Marcos (2020).
8. Zhao Y. Credit card approval predictions using logistic regression, linear svm and naïve bayes classifier. 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE). Guilin: IEEE (2022). p. 207–11. doi: 10.1109/MLKE55170.2022.00047
10. Keogh E, Blake C, Merz CJ. UCI Repository of Machine Learning Databases. (1998). Available online at: https://archive.ics.uci.edu/ml/datasets/credit+approval
13. Ricaldi LC. Three Essays on Consumer Credit Card Behavior. PhD thesis. Lubbock, TX: Texas Tech University (2015).
14. Rane K. Credit Card Approval Analysis. (2018). Available online at: https://nycdatascience.com/blog/student-works/credit-card-approval-analysis/
15. Belgiu M, Drǎgut̨ L. Random forest in remote sensing: a review of applications and future directions. ISPRS J Photogr Remote Sens. (2016) 114:24–31. doi: 10.1016/j.isprsjprs.2016.01.011
16. Kulkarni VY, Sinha PK. Pruning of random forest classifiers: a survey and future directions. 2012 International Conference on Data Science & Engineering (ICDSE). Cochin: IEEE (2012). p. 64–8. doi: 10.1109/ICDSE.2012.6282329
17. Mishra AK, Ratha BK. Study of random tree and random forest data mining algorithms for microarray data analysis. Int J Adv Electr Comput Eng. (2016) 3.
18. Kalmegh SR. Analysis of WEKA data mining algorithm reptree, simple cart and randomtree for classification of Indian news. Int J Innov Sci Eng Technol. (2015) 2.
19. Rokach L, Maimon O. Data mining with decision trees - theory and applications. 2nd ed. Singapore: World Scientific Publishing (2015).
20. Saritas MM, Yas̨ar AB. Performance analysis of ANN and Naive Bayes classification algorithm for data classification. Int J Intell Syst Appl Eng. (2019) 7:88–91.
21. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Patt Recogn. (1997) 30:1145–59.
23. Frank E, Trigg LE, Holmes G, Witten IH. Naive bayes for regression (technical note). Mach Learn. (2000) 41:5–25. doi: 10.1023/A:1007670802811
24. Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. Burlington, MA: Elsevier/Morgan Kaufmann (2012).
25. Seker S. AutoML: End-to-end Introduction From Optiwisdom. (2019). Available online at: https://towardsdatascience.com/automl-end-to-end-introduction-from-optiwisdom-c17fe03a017f
26. Taub J, Elliot M. The synthetic data challenge. Conference of European Statisticians Joint UNECE/eurostat Work Session on Statistical Data Confidentiality. Hague: UNECE (2019).
27. WEKA. Weka 3: Machine Learning Software in Java, Weka 3 - Data Mining With Open Source Machine Learning Software in Java. (2022). Available online at: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on August 22, 2022).