Synthetic feature generation to improve accuracy in prediction of credit limits

Sikha Bagui* and Jennifer Walker

*Correspondence:
Sikha Bagui,
bagui@uwf.edu

Received: 05 April 2023; Accepted: 18 May 2023; Published: 06 June 2023.

Financial institutions use various data mining algorithms to determine the credit limits for individuals using features like age, education, employment, gender, income, and marital status. But, there is still a question of accurate predictability, that is, how accurate can an institution be in predicting risk and granting credit levels. If an institution grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does not repay the credit/loan. The novelty of this work is that it shows how to improve the accuracy in predicting credit limits/loan amounts using synthetic feature generation. By creating secondary groupings and including both the original binning and the synthetic bins, the classification accuracy and other statistical measures like precision and ROC improved substantially. Hence, our research showed that without synthetic feature generation, the classification rates were low, and the use of synthetic features greatly improved the classification accuracy and other statistical measures.

Keywords: synthetic feature generation, random forest, random tree, REPTree, Naïve Bayes, credit amount, credit risk

1. Introduction

In today’s world, there is no question that almost everyone has a credit card. One may need to be 18 years of age before applying for credit, but parents can actually add their young children as authorized users to their credit cards. So the real question asked by banks and credit card companies is whether the primary account holder is at high risk or low risk in repaying the credit or loan, and thereby credit is granted based on risk levels; high risk equals low credit limit, and low risk equals high credit limit. Institutions use various data mining algorithms to determine the credit limits for individuals. The typical features used by institutions to help determine credit limits include age, education, employment, gender, income, and marital status (1). Using these features, there is still a question of accurate predictability, that is, how accurate can an institution be in predicting the risk and granting credit levels. If an institution grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does not repay the credit/loan.

The novelty of this work it that it shows how to improve the accuracy in predicting credit limits/loan amounts using synthetic feature (SF) generation. By creating secondary groupings and including both the original binning and synthetic bins, the classification accuracy and other statistical measures like precision and ROC improved substantially.

Four different datasets were used for this analysis. Three datasets (datasets 1, 3, and 4) were used for prediction of credit limits/loans, and one dataset (dataset 2) was used for predicting bank approvals of credit/loans. In this work, first, feature selection was performed using information gain. Then, the classification was performed using features with higher information gain and the synthetic features. For classification, three different tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Naïve Bayes) were used.

The rest of this article is organized as follows. Section 2 presents the related works; section 3 presents the datasets and preprocessing; section 4 presents the results and discussion; and section 5 presents the conclusions.

2. Related works

Zeng (2) studied effective binning on credit scoring. He focused on the weight of the evidence and regression modeling for binning of continuous variables. The feature, age (typically a variable in any financial dataset) was the example used for improving the binning process.

Danenas and Garsva (3) presented their work on credit risk evaluation based on linear support vector machine classifiers. This was combined with external evaluation and testing sliding windows, with a focus on larger dataset applications. These authors concluded that, using real-world financial datasets, for example, from the SEC EDGAR database, their method produced results comparable to other classifiers such as logistic regression and thus could be used for the future development of real credit risk evaluation models.

Lessmann et al. (4) compared several classification algorithms to credit scoring. They examined the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy.

Ala’raj and Abbod (5) presented a new ensemble combination approach based on classifier consensus to combine multiple classifier systems of different classification algorithms. Specifically, five well known base classifiers were used: Neural Networks, Support Vector Machines, Random Forests, Decision Trees, and Naïve Bayes. Their experimental results demonstrated the ability of their proposed combinations to improve predictive performance against all base classifiers. Their model was validated over five real-world credit scoring datasets.

Musyoka (6) compared data mining algorithms with the credit card approval dataset. This research focused on masked attributes and compared the Bayesian Network, Decision Tree, and J48 classifiers. Musyoka (6)’s results identified the Bayesian Network algorithm as being the most accurate, returning an accuracy of 86.21%.

Tanikella (7) examined the key features considered for issuing credit cards to customers. This work used machine learning to find that the attributes, prior default, years employed, credit score and debt were the most useful features.

Zhao (8) analyzed the prediction accuracy of multiple regression models and classifiers based on predetermined performance criterion. The experimental models used were Logistic Regression, Linear Support Vector Classification (Linear SVC), and the Naïve Bayes Classifier. In this study, linear SVC performed the best.

Though quite a few works, as presented above, have been done on different aspects of credit analysis using machine learning, none of the works have used the concept of synthetic feature generation in machine learning for credit analysis, which is the uniqueness and novelty of this article.

3. Datasets and processing

Four datasets were selected for this research: German Credit Risk (9), Credit Screening (10), Credit (11), and Bank Churners (12). All datasets contained attributes or features relating to credit cards or credit limits. In the tables describing the respective datasets, the attributes that appear in all four datasets are identified with four asterisks (****), the attributes that appear in three datasets are identified with three asterisks (***), and the attributes that appear in two datasets are identified with two asterisks (**). Preprocessing played a major role in this work, hence preprocessing is explained in detail in this section.

3.1. Preprocessing using feature selection

Feature selection is the process of identifying and selecting features or attributes within the dataset that will aid in improving the accuracy of the returned results. The selection process can be manual or automatic, but essentially the objective is the same – to achieve higher predictive accuracy. For this research, both manual and automatic feature selection was used in each dataset. Once the irrelevant or unusable attributes were removed, the datasets were imported into Weka, and Information Gain was run on each dataset using the Ranker search method. The output identified the amount of information gain for each attribute. Information gain is an entropy-based algorithm that determines the most relevant features necessary of the classification of a dataset.

3.2. Dataset 1: German credit risk

The German credit risk, obtained from Kaggle, was provided by Hofmann (9). This dataset consisted of 1,000 instances and ten attributes. Attribute descriptions and sample values are presented in Table 1.

TABLE 1
www.bohrpub.com

Table 1. Dataset 1: German credit risk (9).

3.2.1. Preprocessing the German credit risk dataset

3.2.1.1. Calculating information gain

For preprocessing, first the information gain was calculated using the original attributes. As shown in Figure 1, in the German Credit Risk dataset, the attribute with the highest information gain was credit amount, followed by housing and duration. The attributes that are not in Figure 1 have information gain values very close to zero.

FIGURE 1
www.bohrpub.com

Figure 1. Information gain for German credit risk dataset (dataset 1).

3.2.1.2. Removing attributes

The attribute, Purpose, was omitted. Based on its description and data it contained, it was not deemed relevant for this study.

3.2.1.3. Binning and synthetic feature generation

For the German Credit Risk dataset (9), synthetic feature (SF) generation was utilized for the attributes that had lower information gain: Age, Checking Account, Duration, and Saving Account.

The attribute, Age, was binned in two ways, as shown in Table 2. For regular binning, Age was grouped into four buckets, and for synthetic feature generation, the groups were based on the accepted classifications of the age generations (13). Figures 2, 3 show the distributions of each of the binning criteria.

TABLE 2
www.bohrpub.com

Table 2. Age: binning and SF generation.

FIGURE 2
www.bohrpub.com

Figure 2. Age (binned).

FIGURE 3
www.bohrpub.com

Figure 3. Age (SF).

The attribute, Job, was binned in two ways, as shown in Table 3.

TABLE 3
www.bohrpub.com

Table 3. Job: binning and SF generation.

The attributes, Checking and Savings, were binned as per Table 4.

TABLE 4
www.bohrpub.com

Table 4. Checking and savings: binning and SF generation.

The Duration attribute was grouped into four buckets and was narrowed down to three buckets for synthetic feature generation, as shown in Table 5.

TABLE 5
www.bohrpub.com

Table 5. Duration (months): binning and SF generation.

The Credit attribute was grouped into four buckets and was narrowed down to three buckets, for synthetic feature generation, as shown in Table 6.

TABLE 6
www.bohrpub.com

Table 6. Credit: binning and SF generation.

3.3. Dataset 2: Credit screening

The Credit Screening dataset was obtained from UCI Machine Learning Repository and provided by Keogh et al. (10). The dataset contained 16 variables but the variables were masked, hence the converted attributes were used as per Rane (14)’s. This dataset consisted of 690 labeled instances. A description of the attributes is presented in Table 7.

TABLE 7
www.bohrpub.com

Table 7. Dataset 2: credit screening (10).

3.3.1. Preprocessing the credit screening dataset

3.3.1.1. Calculating information gain

Information gain was calculated using the original set of attributes. As shown in Figure 4, in the Credit Screening dataset (10), the attribute with the highest information gain was Prior Default, and the attribute with the second highest information gain was credit score, with Employed following closely behind. The attributes that are not in Figure 4 have information gain values very close to zero.

FIGURE 4
www.bohrpub.com

Figure 4. Information gain on credit approved dataset (dataset 2).

3.3.1.2. Removing attributes

Citizen, education level, ethnicity, and zip code were removed.

• Citizen had three values: g, p, and s; g accounted for 90.5% of the applications so the assumption was made the most of the applicants were citizens and therefore the attribute was not used.

• The Education Level attribute had 14 unique values in alpha form, which were not easily interpretable, hence the attribute was removed (not used).

• The Zip Code attribute had values between 1 and 4 digits, hence was not used.

• Ethnicity had too many values, some of which were inconsistent; hence this attribute was removed (not used).

3.3.1.3. Handling missing values

All missing values were labeled “unknown.”

3.3.1.4. Binning and synthetic feature generation

Three of the attributes were binned: Age, Credit Score, and Income. Age was grouped as per Dataset 1 (Table 2), hence is not shown here. The binning and synthetic feature generation of Credit Score and Income are shown in Tables 8, 9, respectively. Debt and years employed are binned in Tables 10, 11, respectively.

TABLE 8
www.bohrpub.com

Table 8. Credit score: binning and SF generation.

TABLE 9
www.bohrpub.com

Table 9. Income: binning and SF generation.

TABLE 10
www.bohrpub.com

Table 10. Debt: binning.

TABLE 11
www.bohrpub.com

Table 11. Years employed: binning.

3.4. Dataset 3: Credit

The third dataset, Credit, obtained from Kaggle.com, was provided by Iacob (11). This dataset contained 400 instances with 11 attributes, as shown in Table 12. This research focuses on the effects of the attributes on credit limit.

TABLE 12
www.bohrpub.com

Table 12. Dataset 3: credit (11).

3.4.1. Preprocessing the credit dataset

3.4.1.1. Calculating information gain

Information Gain was calculated on the original attributes. From Figure 5, it can be noted that the attribute with the highest information gain was monthly balance, with credit rating being the second highest. There are far less attributes in Figure 5 than in Table 12. The attributes that are not in Figure 5 have information gain values very close to zero.

FIGURE 5
www.bohrpub.com

Figure 5. Information gain on credit limit dataset (dataset 3).

3.4.1.2. Removing attributes

Student, ethnicity, income, ID, and number of cards were removed.

• The student attribute was removed because the other datasets did not contain a similar attribute and only 10% of the cardholders were students.

• Ethnicity attribute was also removed because it was not adequately identified in the other datasets.

• Income data did appear correct, hence was not used.

• ID attribute was removed.

• Number of cards was not used because the information gain was close to zero, and other datasets did not include this attribute.

3.4.1.3. Binning and synthetic feature generation

For the Credit Limit dataset, synthetic feature creation was applied on two attributes: Age and Credit Limit. Age was grouped in the same buckets as the German Credit (Table 2). Credit Limit was grouped once in numerical buckets and the second grouping (synthetic feature) was High, Medium, and Low, as shown in Table 13. Other attributes that were binned were monthly balance (Table 14), credit rating (Table 15), and education (Table 16).

TABLE 13
www.bohrpub.com

Table 13. Credit limit: binning and SF generation.

TABLE 14
www.bohrpub.com

Table 14. Balance: binning.

TABLE 15
www.bohrpub.com

Table 15. Credit rating: binning.

TABLE 16
www.bohrpub.com

Table 16. Education: binning.

3.5. Dataset 4: Bank churners dataset

The Bank Churners dataset, obtained from Kaggle.com, was provided by Goyal (12). This dataset contained 10,127 instances with 27 attributes, as shown in Table 17. The Bank Churners dataset focused on bank customers and the relationship between attrition and the other attributes in dataset.

TABLE 17
www.bohrpub.com

Table 17. Dataset 4: bank churners (12).

3.5.1. Preprocessing the bank churners dataset

3.5.1.1. Calculating information gain

Information gain was calculated using the original attributes. As shown in Figure 6, in the Bank Churners dataset, the attribute with the highest information gain was income, followed by gender and revolving balance. The attributes that are not in Figure 6 have information gain values very close to zero.

FIGURE 6
www.bohrpub.com

Figure 6. Information gain on bank churners dataset (dataset 4).

3.5.1.2. Removing attributes

Attrition, Card Category, Client Number, Contacts, Dependents, Months Inactive, Months Total, Open to Buy, Total Amount Change, Total Transaction Amount, Total Transaction Count, Difference Transaction Count Quarterly, and Utilization Ratio Average were removed since they were not considered relevant for this analysis.

3.5.1.3. Binning and synthetic feature generation

For Dataset 4, Synthetic Feature Creation was done for Age (grouped as per the German Credit dataset, Table 2), Credit Limit (grouped as per the Credit dataset, Table 13), and Married. The Married attribute contained the following values: divorced, married, single, and unknown. Divorced and Single were grouped together into “no” (not married), leaving the grouped values for the Married attribute to be Yes, No, and Unknown, as shown in Table 18. Income was binned as per Table 19, and the synthetic feature for income was also categorized as shown in Table 19. Balance Revolving was binned as per Table 20, and months with bank were binned as per Table 21.

TABLE 18
www.bohrpub.com

Table 18. Married: binning and SF generation.

TABLE 19
www.bohrpub.com

Table 19. Income: binning and SF generation.

TABLE 20
www.bohrpub.com

Table 20. Balance revolving: binning.

TABLE 21
www.bohrpub.com

Table 21. Months with bank: binning.

4. Classifiers used

For classification, three tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Naïve Bayes) were used.

4.1. Random forest

Random Forest is a widely used machine learning classifier that constructs multiple decision trees randomly, and the term “forest” stems from the imagery of the many trees being created. In Random Forest, each tree is independently produced without pruning, and the nodes are split based on the user’s selection of available features (15). There are also works on how a user can prune Random Forest. Kulkarni and Sinha (16) showed a way of pruning by limiting the number of trees.

4.2. Random tree

The Random Tree classifier is similar to the Random Forest classifier, but it constructs only one decision tree and is based on a random set of attributes. The Random Tree classifier constructs a set of data to build the Random Tree, and every node is split from the best split among all variables (17). Essentially, the Random Tree is a simpler tree/forest classifier, but Random Forest tends to have better accuracy by decreasing the variance because it constructs multiple trees.

4.3. The REPTree classifier

Reduced error pruning tree (REPTree) builds the decision tree based on information gain (18). The tree that is built may be a decision/regression tree, but it is used for classification, and it will create multiple trees in different iterations (18). When the algorithm runs, it goes from each node starting at the bottom and works its way to the top, and at each node, it assesses if it should replace it with the most frequent class to improve the accuracy, and it prunes away items that would cause a reduction in accuracy (19). REPTree only sorts numeric attributes once.

4.4. The Naïve Bayes classifier

Naïve Bayes was chosen as an additional classifier because it is not a tree classifier. The Naïve Bayes classifier assumes that all variables are independent of the class attribute (20).

5. Results and discussion

The four different classifiers were run using Weka. For each dataset and each classifier, we looked at the following statistical measures: accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-measure, and ROC area.

Accuracy is the ratio of a model’s correct data (TP + TN) to the total data, calculated by the following equation:

( TP + TN ) / ( TP + TN + FP + FN ) .

TPR, also called sensitivity or recall, measures the proportion of actual attacks that were identified as attacks, given by the following equation:

TP / ( TP + FN ) .

FPR is where a non-attack is identified as an attack, given by the following equation:

F P / ( F P + T N ) .

Precision measures the proportion of positive identification of attacks that were actually attacks, given by the following equation:

T P / ( T P + F P ) .

F-measure is the harmonic means on precision and recall, calculated by the following equation:

2 * ( ( P r e c i s i o n * R e c a l l ) / ( P r e c i s i o n + R e c a l l ) )

ROC plots the relational of the TPR vs. FPR.

Where:

• True Positive (TP) is instances that were identified correctly as positives.

• True Negative (TN) is instances that were identified correctly as negatives.

• False Positive (FP) is instances that were identified incorrectly as positives.

• False Negative (FN) is instances that were identified incorrectly as negatives.

For each of the classification runs, various combinations of original attributes (OR), original binned attributes (Bin), and synthetic features (SF) were used. The original attributes and original binned attributes were selected based on information gain that was performed on each respective dataset.

To select the best results, the runs with the minimal set of attributes with the highest statistical measures were selected.

5.1. Classification results for the German credit dataset

For the German credit dataset (9), credit amount, a continuous attribute, was used as the class variable for classification. Tables 2225 present the statistical results of the classifications.

TABLE 22
www.bohrpub.com

Table 22. Naïve Bayes classification using the German credit dataset.

TABLE 23
www.bohrpub.com

Table 23. Random forest classification using the German credit dataset.

TABLE 24
www.bohrpub.com

Table 24. Random tree classification using the German credit dataset.

TABLE 25
www.bohrpub.com

Table 25. REPTree classification using the German credit dataset.

5.1.1. Naïve Bayes results

Results of the Naïve Bayes classification, presented in Table 22, show that 10 attributes with three synthetic features had the best results, with a classification accuracy of 73.6%. In this run, three synthetic features were used, for features, age, credit, and duration. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 46.1%).

5.1.2. Random forest results

Results of the Random Forest classification, presented in Table 23, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (73.2%). In this run, only the synthetic feature for credit was used. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 38%).

5.1.3. Random tree results

Results of the Random Tree classification, presented in Table 24, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (72.3%). In this run, the only synthetic feature used was for credit. The other statistical measures for this run were also high. Again, without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 37.6%).

5.1.4. REPTree results

Results of the REPTree classification, presented in Table 25, show that eight attributes with one synthetic feature had slightly higher classification accuracy (72.7%) than the other runs. In this run, the only synthetic feature used was credit. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 45.7%).

5.1.5. Overall classifier comparison for the German credit dataset

From Tables 2225, it can be noted that even using one synthetic attribute greatly improved the classification accuracy and other statistical measures.

A comparison of the classification accuracy of the all the classifiers, on the German Credit dataset (Table 26), show that two out of the four classifiers performed well with five attributes and only one synthetic feature. Naïve Bayes performed the best with 10 attributes and three synthetic features, and REPTree performed the best with eight attributes and one synthetic feature. The highest classification accuracy was achieved with the Naïve Bayes classifier, and the most improved accuracy was achieved using the Random Forest classifier (35.2% improvement, as shown in Table 27).

TABLE 26
www.bohrpub.com

Table 26. Classifier accuracy comparison on the German credit dataset.

TABLE 27
www.bohrpub.com

Table 27. German credit dataset – improvement in accuracy with synthetic features.

5.2. Classification results for the credit screening dataset

For the Credit Screening dataset (10), the attribute approved, a binary attribute, was used as the class variable for classification. Tables 2831 present the statistical results of the classifications. This dataset performed pretty well without the synthetic features too. Using three synthetic features for age, credit score, and income only very slightly improved the classification accuracy for the Random Forest and Random Tree algorithms.

TABLE 28
www.bohrpub.com

Table 28. Naïve Bayes classification using the credit screening dataset.

TABLE 29
www.bohrpub.com

Table 29. Random forest classification using the credit screening dataset.

TABLE 30
www.bohrpub.com

Table 30. Random tree classification using the credit screening dataset.

TABLE 31
www.bohrpub.com

Table 31. REPTree classification using the credit screening dataset.

5.2.1. Overall classifier comparison for the credit screening dataset

A comparison of the classification accuracy of the all the classifiers, on the Credit Screening dataset, from Table 32, shows that Random Forest and Random Tree had the highest accuracy using three synthetic features, but Naïve Bayes and REPTree did not perform better with synthetic features. An analysis of the accuracy improvement on this dataset, as shown in Table 33, shows very little improvement after adding synthetic features. In fact, there was a negative improvement with Naïve Bayes and REPTree.

TABLE 32
www.bohrpub.com

Table 32. Classifier accuracy comparison on credit screening dataset.

TABLE 33
www.bohrpub.com

Table 33. Credit screening dataset – improvement in accuracy with synthetic features.

5.3. Classification results for the credit dataset

For the Credit dataset (11), credit limit was used as the class variable for classification. Tables 3437 present the statistical results of the classifications.

TABLE 34
www.bohrpub.com

Table 34. Naïve Bayes classification using the credit dataset.

TABLE 35
www.bohrpub.com

Table 35. Random forest classification using the credit dataset.

TABLE 36
www.bohrpub.com

Table 36. Random tree classification using the credit dataset.

TABLE 37
www.bohrpub.com

Table 37. REPTree classification using the credit dataset.

5.3.1. Naïve Bayes results

Results of the Naïve Bayes classification, presented in Table 34, show that in terms of classification accuracy, seven attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 77.5%. Other statistical measures were also high for this run.

5.3.2. Random forest results

Results of the Random Forest classification, presented in Table 35, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (77.5%).

5.3.3. Random tree results

Results of the Random Tree, presented in Table 36, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (76.8%).

5.3.4. REPTree results

Results of the REPTree classification, presented in Table 37, show that five attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (79.3%).

5.3.5. Overall classifier comparison for the credit dataset

For the Credit dataset, for all classifiers, there was a significant increase in classification accuracy and other statistical measures after the synthetic features were added, as shown in Table 38. Comparing the classifiers, REPTree performed the best at 78.8% classification accuracy. Three of the four classifiers performed the best with seven attributes and one synthetic feature. Only Naïve Bayes performed the best with six attributes and two synthetic features. Other statistical measures were also higher for this set of runs.

TABLE 38
www.bohrpub.com

Table 38. Classifier accuracy comparison on credit dataset.

From Table 39, it can be observed that Random Tree had the highest improvement in accuracy (24.2%), closely followed by Random Forest at 24%. The other two classifiers, Naïve Bayes and REPTree, also had a significant improvement with the addition of synthetic attributes.

TABLE 39
www.bohrpub.com

Table 39. Credit dataset – improvement in accuracy with synthetic features.

5.4. Classification results for the bank churners dataset

For the Bank Churners dataset (12), credit limit was used as the class variable for classification. Tables 4043 present the statistical results of the classifications.

TABLE 40
www.bohrpub.com

Table 40. Naïve Bayes classification using the bank churners dataset.

TABLE 41
www.bohrpub.com

Table 41. Random forest classification using the bank churners dataset.

TABLE 42
www.bohrpub.com

Table 42. Random tree classification using the bank churners dataset.

TABLE 43
www.bohrpub.com

Table 43. REPTree classification using the bank churners dataset.

Results of the Naïve Bayes classification, presented in Table 40, show that in terms of classification accuracy, four attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 71.1%. Results of the Random Forest, Random Tree, and REPTree, presented in Tables 4143, respectively, also show that four attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (72.7, 72.7, and 72.5%, respectively).

5.4.1. Overall classifier comparison for the bank churners dataset

For this set of classifiers, adding one synthetic attribute improved the classification accuracy significantly and all four classifiers performed the best at four attributes with one synthetic attribute, as shown in Table 44. From Table 45, it can be noted that Random Tree had the highest improvement in accuracy at 22.7%, followed by Random Forest at 20.7%. The other two algorithms also had significant improvement in accuracy with just one synthetic feature.

TABLE 44
www.bohrpub.com

Table 44. Classifier accuracy comparisons on bank churners dataset.

TABLE 45
www.bohrpub.com

Table 45. Bank churners–improvement in accuracy with synthetic features.

6. Conclusion

Three of the four datasets used in this research showed an improvement in accuracy and other statistical measures using synthetic attributes. Overall, the tree-based classifiers, Random Forest, Random Tree, and REPTree, appeared to have better performances as well as better performance improvements than the non-tree-based classifier, Naïve Bayes.

Author contributions

SB conceptualized the article, responsible for guiding the research, and directing the formulation of the article. JW also helped to conceptualize the article, did most of the pre-processing and processing of the data, and wrote the initial draft of the article. Both authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

1. Wongchinsri P, Kuratach W. A survey - data mining frameworks in credit card processing. Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). Chiang Mai: IEEE (2016). p. 1–6.

Google Scholar

2. Zeng G. A necessary condition for a good binning algorithm in credit scoring. Appl Math Sci. (2014) 8:3229–42.

Google Scholar

3. Danenas P, Garsva G. Selection of support vector machines-based classifiers for credit risk domain. Expert Syst Appl. (2015) 42:3194–204.

Google Scholar

4. Lessmann S, Baesens B, Seow HV, Thomas LC. Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research. Eur J Operat Res. (2015) 247:124–36.

Google Scholar

5. Ala’raj M, Abbod MF. Classifiers consensus system approach for credit scoring. Knowl Based Syst. (2016) 104:89–105.

Google Scholar

6. Musyoka WM. Comparison of data mining algorithms in credit card approval. Int J Comput Inform Technol. (2018) 7:2.

Google Scholar

7. Tanikella U. Credit Card Approval Verification Model. PhD thesis. San Marcos, CA: California State University San Marcos (2020).

Google Scholar

8. Zhao Y. Credit card approval predictions using logistic regression, linear svm and naïve bayes classifier. 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE). Guilin: IEEE (2022). p. 207–11. doi: 10.1109/MLKE55170.2022.00047

CrossRef Full Text | Google Scholar

9. Hofmann H. German Credit Risk UCI Machine Learning. San Francisco, CA: Kaggle (2016).

Google Scholar

10. Keogh E, Blake C, Merz CJ. UCI Repository of Machine Learning Databases. (1998). Available online at: https://archive.ics.uci.edu/ml/datasets/credit+approval

Google Scholar

11. Iacob S. Predicting Credit Card Balance Using Regression. San Francisco, CA: Kaggle (2020).

Google Scholar

12. Goyal S. Credit Card Customers Predict Churning Customers. San Francisco, CA: Kaggle (2021).

Google Scholar

13. Ricaldi LC. Three Essays on Consumer Credit Card Behavior. PhD thesis. Lubbock, TX: Texas Tech University (2015).

Google Scholar

14. Rane K. Credit Card Approval Analysis. (2018). Available online at: https://nycdatascience.com/blog/student-works/credit-card-approval-analysis/

Google Scholar

15. Belgiu M, Drǎgut̨ L. Random forest in remote sensing: a review of applications and future directions. ISPRS J Photogr Remote Sens. (2016) 114:24–31. doi: 10.1016/j.isprsjprs.2016.01.011

CrossRef Full Text | Google Scholar

16. Kulkarni VY, Sinha PK. Pruning of random forest classifiers: a survey and future directions. 2012 International Conference on Data Science & Engineering (ICDSE). Cochin: IEEE (2012). p. 64–8. doi: 10.1109/ICDSE.2012.6282329

CrossRef Full Text | Google Scholar

17. Mishra AK, Ratha BK. Study of random tree and random forest data mining algorithms for microarray data analysis. Int J Adv Electr Comput Eng. (2016) 3.

Google Scholar

18. Kalmegh SR. Analysis of WEKA data mining algorithm reptree, simple cart and randomtree for classification of Indian news. Int J Innov Sci Eng Technol. (2015) 2.

Google Scholar

19. Rokach L, Maimon O. Data mining with decision trees - theory and applications. 2nd ed. Singapore: World Scientific Publishing (2015).

Google Scholar

20. Saritas MM, Yas̨ar AB. Performance analysis of ANN and Naive Bayes classification algorithm for data classification. Int J Intell Syst Appl Eng. (2019) 7:88–91.

Google Scholar

21. Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Patt Recogn. (1997) 30:1145–59.

Google Scholar

22. Breiman L. Random Forests. Mach Learn. (2001) 45:5–32. doi: 10.1023/A:1010933404324

CrossRef Full Text | Google Scholar

23. Frank E, Trigg LE, Holmes G, Witten IH. Naive bayes for regression (technical note). Mach Learn. (2000) 41:5–25. doi: 10.1023/A:1007670802811

CrossRef Full Text | Google Scholar

24. Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. Burlington, MA: Elsevier/Morgan Kaufmann (2012).

Google Scholar

25. Seker S. AutoML: End-to-end Introduction From Optiwisdom. (2019). Available online at: https://towardsdatascience.com/automl-end-to-end-introduction-from-optiwisdom-c17fe03a017f

Google Scholar

26. Taub J, Elliot M. The synthetic data challenge. Conference of European Statisticians Joint UNECE/eurostat Work Session on Statistical Data Confidentiality. Hague: UNECE (2019).

Google Scholar

27. WEKA. Weka 3: Machine Learning Software in Java, Weka 3 - Data Mining With Open Source Machine Learning Software in Java. (2022). Available online at: https://www.cs.waikato.ac.nz/ml/weka/ (accessed on August 22, 2022).

Google Scholar

28. Zhao H, Liu H, Fu Y. Incomplete multi-modal visual data grouping. Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI). Palo Alto, CA: AAAI Press (2016). p. 2392–8.

Google Scholar