<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Scit.</journal-id>
<journal-title>BOHR International Journal of Smart Computing and Information Technology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Scit.</abbrev-journal-title>
<issn pub-type="epub">2583-2026</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijscit.2023.34</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Synthetic feature generation to improve accuracy in prediction of credit limits</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Bagui</surname> <given-names>Sikha</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Walker</surname> <given-names>Jennifer</given-names></name>
</contrib>
</contrib-group>
<aff><institution>Department of Computer Science, University of West Florida</institution>, <addr-line>Pensacola, FL</addr-line>, <country>United States</country></aff>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Sikha Bagui, <email>bagui@uwf.edu</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>06</day>
<month>06</month>
<year>2023</year>
</pub-date>
<volume>4</volume>
<issue>1</issue>
<fpage>24</fpage>
<lpage>38</lpage>
<history>
<date date-type="received">
<day>05</day>
<month>04</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>05</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Bagui and Walker.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Bagui and Walker</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Financial institutions use various data mining algorithms to determine the credit limits for individuals using features like age, education, employment, gender, income, and marital status. But, there is still a question of accurate predictability, that is, how accurate can an institution be in predicting risk and granting credit levels. If an institution grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does not repay the credit/loan. The novelty of this work is that it shows how to improve the accuracy in predicting credit limits/loan amounts using synthetic feature generation. By creating secondary groupings and including both the original binning and the synthetic bins, the classification accuracy and other statistical measures like precision and ROC improved substantially. Hence, our research showed that without synthetic feature generation, the classification rates were low, and the use of synthetic features greatly improved the classification accuracy and other statistical measures.</p>
</abstract>
<kwd-group>
<kwd>synthetic feature generation</kwd>
<kwd>random forest</kwd>
<kwd>random tree</kwd>
<kwd>REPTree</kwd>
<kwd>Na&#x00EF;ve Bayes</kwd>
<kwd>credit amount</kwd>
<kwd>credit risk</kwd>
</kwd-group>
<counts>
<fig-count count="6"/>
<table-count count="45"/>
<equation-count count="5"/>
<ref-count count="28"/>
<page-count count="15"/>
<word-count count="7677"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>In today&#x2019;s world, there is no question that almost everyone has a credit card. One may need to be 18 years of age before applying for credit, but parents can actually add their young children as authorized users to their credit cards. So the real question asked by banks and credit card companies is whether the primary account holder is at high risk or low risk in repaying the credit or loan, and thereby credit is granted based on risk levels; high risk equals low credit limit, and low risk equals high credit limit. Institutions use various data mining algorithms to determine the credit limits for individuals. The typical features used by institutions to help determine credit limits include age, education, employment, gender, income, and marital status (<xref ref-type="bibr" rid="B1">1</xref>). Using these features, there is still a question of accurate predictability, that is, how accurate can an institution be in predicting the risk and granting credit levels. If an institution grants too low of a credit limit/loan for an individual, then the institution may lose business to competitors, but if the institution grants too high of a credit limit/loan, then the institution may lose money if that individual does not repay the credit/loan.</p>
<p>The novelty of this work it that it shows how to improve the accuracy in predicting credit limits/loan amounts using synthetic feature (SF) generation. By creating secondary groupings and including both the original binning and synthetic bins, the classification accuracy and other statistical measures like precision and ROC improved substantially.</p>
<p>Four different datasets were used for this analysis. Three datasets (datasets 1, 3, and 4) were used for prediction of credit limits/loans, and one dataset (dataset 2) was used for predicting bank approvals of credit/loans. In this work, first, feature selection was performed using information gain. Then, the classification was performed using features with higher information gain and the synthetic features. For classification, three different tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Na&#x00EF;ve Bayes) were used.</p>
<p>The rest of this article is organized as follows. Section 2 presents the related works; section 3 presents the datasets and preprocessing; section 4 presents the results and discussion; and section 5 presents the conclusions.</p>
</sec>
<sec id="S2">
<title>2. Related works</title>
<p>Zeng (<xref ref-type="bibr" rid="B2">2</xref>) studied effective binning on credit scoring. He focused on the weight of the evidence and regression modeling for binning of continuous variables. The feature, age (typically a variable in any financial dataset) was the example used for improving the binning process.</p>
<p>Danenas and Garsva (<xref ref-type="bibr" rid="B3">3</xref>) presented their work on credit risk evaluation based on linear support vector machine classifiers. This was combined with external evaluation and testing sliding windows, with a focus on larger dataset applications. These authors concluded that, using real-world financial datasets, for example, from the SEC EDGAR database, their method produced results comparable to other classifiers such as logistic regression and thus could be used for the future development of real credit risk evaluation models.</p>
<p>Lessmann et al. (<xref ref-type="bibr" rid="B4">4</xref>) compared several classification algorithms to credit scoring. They examined the extent to which the assessment of alternative scorecards differs across established and novel indicators of predictive accuracy.</p>
<p>Ala&#x2019;raj and Abbod (<xref ref-type="bibr" rid="B5">5</xref>) presented a new ensemble combination approach based on classifier consensus to combine multiple classifier systems of different classification algorithms. Specifically, five well known base classifiers were used: Neural Networks, Support Vector Machines, Random Forests, Decision Trees, and Na&#x00EF;ve Bayes. Their experimental results demonstrated the ability of their proposed combinations to improve predictive performance against all base classifiers. Their model was validated over five real-world credit scoring datasets.</p>
<p>Musyoka (<xref ref-type="bibr" rid="B6">6</xref>) compared data mining algorithms with the credit card approval dataset. This research focused on masked attributes and compared the Bayesian Network, Decision Tree, and J48 classifiers. Musyoka (<xref ref-type="bibr" rid="B6">6</xref>)&#x2019;s results identified the Bayesian Network algorithm as being the most accurate, returning an accuracy of 86.21%.</p>
<p>Tanikella (<xref ref-type="bibr" rid="B7">7</xref>) examined the key features considered for issuing credit cards to customers. This work used machine learning to find that the attributes, prior default, years employed, credit score and debt were the most useful features.</p>
<p>Zhao (<xref ref-type="bibr" rid="B8">8</xref>) analyzed the prediction accuracy of multiple regression models and classifiers based on predetermined performance criterion. The experimental models used were Logistic Regression, Linear Support Vector Classification (Linear SVC), and the Na&#x00EF;ve Bayes Classifier. In this study, linear SVC performed the best.</p>
<p>Though quite a few works, as presented above, have been done on different aspects of credit analysis using machine learning, none of the works have used the concept of synthetic feature generation in machine learning for credit analysis, which is the uniqueness and novelty of this article.</p>
</sec>
<sec id="S3">
<title>3. Datasets and processing</title>
<p>Four datasets were selected for this research: German Credit Risk (<xref ref-type="bibr" rid="B9">9</xref>), Credit Screening (<xref ref-type="bibr" rid="B10">10</xref>), Credit (<xref ref-type="bibr" rid="B11">11</xref>), and Bank Churners (<xref ref-type="bibr" rid="B12">12</xref>). All datasets contained attributes or features relating to credit cards or credit limits. In the tables describing the respective datasets, the attributes that appear in all four datasets are identified with four asterisks (&#x002A;&#x002A;&#x002A;&#x002A;), the attributes that appear in three datasets are identified with three asterisks (&#x002A;&#x002A;&#x002A;), and the attributes that appear in two datasets are identified with two asterisks (&#x002A;&#x002A;). Preprocessing played a major role in this work, hence preprocessing is explained in detail in this section.</p>
<sec id="S3.SS1">
<title>3.1. Preprocessing using feature selection</title>
<p>Feature selection is the process of identifying and selecting features or attributes within the dataset that will aid in improving the accuracy of the returned results. The selection process can be manual or automatic, but essentially the objective is the same &#x2013; to achieve higher predictive accuracy. For this research, both manual and automatic feature selection was used in each dataset. Once the irrelevant or unusable attributes were removed, the datasets were imported into Weka, and Information Gain was run on each dataset using the Ranker search method. The output identified the amount of information gain for each attribute. Information gain is an entropy-based algorithm that determines the most relevant features necessary of the classification of a dataset.</p>
</sec>
<sec id="S3.SS2">
<title>3.2. Dataset 1: German credit risk</title>
<p>The German credit risk, obtained from Kaggle, was provided by Hofmann (<xref ref-type="bibr" rid="B9">9</xref>). This dataset consisted of 1,000 instances and ten attributes. Attribute descriptions and sample values are presented in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Dataset 1: German credit risk (<xref ref-type="bibr" rid="B9">9</xref>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="center">Attribute</td>
<td valign="top" align="center">Description</td>
<td valign="top" align="center">Values</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="center">Age&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Age of customer</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="center">Checking acct</td>
<td valign="top" align="center">Customer&#x2019;s checking account balance in Deutsch Marks</td>
<td valign="top" align="center">Little/moderate/rich</td>
</tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="center">Credit&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Customer&#x2019;s credit amount in Deutsch Marks</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="center">Duration</td>
<td valign="top" align="center">Length of customer&#x2019;s credit with bank</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">5.</td>
<td valign="top" align="center">Gender&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Customer&#x2019;s gender</td>
<td valign="top" align="center">Male/female</td>
</tr>
<tr>
<td valign="top" align="left">6.</td>
<td valign="top" align="center">Housing</td>
<td valign="top" align="center">Customer&#x2019;s housing type</td>
<td valign="top" align="center">Own/rent/free</td>
</tr>
<tr>
<td valign="top" align="left">7.</td>
<td valign="top" align="center">Job</td>
<td valign="top" align="center">Customer&#x2019;s job type</td>
<td valign="top" align="center">0,1,2,3</td>
</tr>
<tr>
<td valign="top" align="left">8.</td>
<td valign="top" align="center">Purpose</td>
<td valign="top" align="center">Customer&#x2019;s reason for needing credit</td>
<td valign="top" align="center">Various</td>
</tr>
<tr>
<td valign="top" align="left">9.</td>
<td valign="top" align="center">Row number</td>
<td valign="top" align="center">Unique row number</td>
<td valign="top" align="center">Various</td>
</tr>
<tr>
<td valign="top" align="left">10.</td>
<td valign="top" align="center">Saving account</td>
<td valign="top" align="center">Customer&#x2019;s savings account level/balance</td>
<td valign="top" align="center">Little/moderate/rich/quite rich</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S3.SS2.SSS1">
<title>3.2.1. Preprocessing the German credit risk dataset</title>
<sec id="S3.SS2.SSS1.Px1">
<title>3.2.1.1. Calculating information gain</title>
<p>For preprocessing, first the information gain was calculated using the original attributes. As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, in the German Credit Risk dataset, the attribute with the highest information gain was credit amount, followed by housing and duration. The attributes that are not in <xref ref-type="fig" rid="F1">Figure 1</xref> have information gain values very close to zero.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Information gain for German credit risk dataset (dataset 1).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g001.tif"/>
</fig>
</sec>
<sec id="S3.SS2.SSS1.Px2">
<title>3.2.1.2. Removing attributes</title>
<p>The attribute, Purpose, was omitted. Based on its description and data it contained, it was not deemed relevant for this study.</p>
</sec>
<sec id="S3.SS2.SSS1.Px3">
<title>3.2.1.3. Binning and synthetic feature generation</title>
<p>For the German Credit Risk dataset (<xref ref-type="bibr" rid="B9">9</xref>), synthetic feature (SF) generation was utilized for the attributes that had lower information gain: Age, Checking Account, Duration, and Saving Account.</p>
<p>The attribute, Age, was binned in two ways, as shown in <xref ref-type="table" rid="T2">Table 2</xref>. For regular binning, Age was grouped into four buckets, and for synthetic feature generation, the groups were based on the accepted classifications of the age generations (<xref ref-type="bibr" rid="B13">13</xref>). <xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref> show the distributions of each of the binning criteria.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Age: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Age (binned)</td>
<td valign="top" align="center">Age (SF)</td>
<td/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;25</td>
<td valign="top" align="center">Gen alpha</td>
<td valign="top" align="center">&#x003C;18</td>
</tr>
<tr>
<td valign="top" align="left">25&#x2013;35</td>
<td valign="top" align="center">Gen Z</td>
<td valign="top" align="center">18&#x2013;22</td>
</tr>
<tr>
<td valign="top" align="left">36&#x2013;45</td>
<td valign="top" align="center">Millennials</td>
<td valign="top" align="center">23&#x2013;38</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;45</td>
<td valign="top" align="center">Gen X</td>
<td valign="top" align="center">39&#x2013;54</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Baby boomers</td>
<td valign="top" align="center">55&#x2013;73</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">Silent Generation</td>
<td valign="top" align="center">&#x003E;73</td>
</tr>
</tbody>
</table></table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Age (binned).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g002.tif"/>
</fig>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Age (SF).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g003.tif"/>
</fig>
<p>The attribute, Job, was binned in two ways, as shown in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Job: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Job (binned)</td>
<td valign="top" align="center">Job (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Unskilled</td>
<td valign="top" align="center">Unskilled</td>
</tr>
<tr>
<td valign="top" align="left">Skilled</td>
<td valign="top" align="center">Skilled</td>
</tr>
<tr>
<td valign="top" align="left">Highly skilled</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<p>The attributes, Checking and Savings, were binned as per <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Checking and savings: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Checking (binned)</td>
<td valign="top" align="center">Savings (binned)</td>
<td valign="top" align="center">SF</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Little</td>
<td valign="top" align="center">Little</td>
<td valign="top" align="center">&#x003C;Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Moderate</td>
<td valign="top" align="center">Moderate</td>
<td valign="top" align="center">Moderate</td>
</tr>
<tr>
<td valign="top" align="left">Rich</td>
<td valign="top" align="center">Rich</td>
<td valign="top" align="center">&#x003E;Moderate</td>
</tr>
<tr>
<td valign="top" align="left">NA</td>
<td valign="top" align="center">Quite rich</td>
<td/>
</tr>
<tr>
<td/>
<td valign="top" align="center">NA</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<p>The Duration attribute was grouped into four buckets and was narrowed down to three buckets for synthetic feature generation, as shown in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Duration (months): binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Duration (binned)</td>
<td valign="top" align="center">Duration (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;12</td>
<td valign="top" align="center">&#x003C;12</td>
</tr>
<tr>
<td valign="top" align="left">12&#x2013;18</td>
<td valign="top" align="center">12&#x2013;24</td>
</tr>
<tr>
<td valign="top" align="left">19&#x2013;24</td>
<td valign="top" align="center">&#x003E;24</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;24</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<p>The Credit attribute was grouped into four buckets and was narrowed down to three buckets, for synthetic feature generation, as shown in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>TABLE 6</label>
<caption><p>Credit: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Credit (binned)</td>
<td valign="top" align="center">Credit (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1500</td>
<td valign="top" align="center">Low</td>
</tr>
<tr>
<td valign="top" align="left">1500&#x2013;3000</td>
<td valign="top" align="center">Medium</td>
</tr>
<tr>
<td valign="top" align="left">3001&#x2013;5000</td>
<td valign="top" align="center">High</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;5000</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
<sec id="S3.SS3">
<title>3.3. Dataset 2: Credit screening</title>
<p>The Credit Screening dataset was obtained from UCI Machine Learning Repository and provided by Keogh et al. (<xref ref-type="bibr" rid="B10">10</xref>). The dataset contained 16 variables but the variables were masked, hence the converted attributes were used as per Rane (<xref ref-type="bibr" rid="B14">14</xref>)&#x2019;s. This dataset consisted of 690 labeled instances. A description of the attributes is presented in <xref ref-type="table" rid="T7">Table 7</xref>.</p>
<table-wrap position="float" id="T7">
<label>TABLE 7</label>
<caption><p>Dataset 2: credit screening (<xref ref-type="bibr" rid="B10">10</xref>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Attribute</td>
<td valign="top" align="left">Original value</td>
<td valign="top" align="left">Converted value</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="left">Age&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="left">Bank customer</td>
<td valign="top" align="left">g, p, gg</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="left">Citizen</td>
<td valign="top" align="left">g, p, s</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="left">Credit approved</td>
<td valign="top" align="left">+, &#x2212;</td>
<td valign="top" align="left">Yes, no</td>
</tr>
<tr>
<td valign="top" align="left">5.</td>
<td valign="top" align="left">Credit score</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">6.</td>
<td valign="top" align="left">Debt</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">7.</td>
<td valign="top" align="left">Driver&#x2019;s license</td>
<td valign="top" align="left">t, f</td>
<td valign="top" align="left">True, false</td>
</tr>
<tr>
<td valign="top" align="left">8.</td>
<td valign="top" align="left">Education&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="left">c, d, cc, I, j, k, m, r, q, w, x, e, aa, ff</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">9.</td>
<td valign="top" align="left">Employed</td>
<td valign="top" align="left">t, f</td>
<td valign="top" align="left">True, false</td>
</tr>
<tr>
<td valign="top" align="left">10.</td>
<td valign="top" align="left">Ethnicity&#x002A;&#x002A;</td>
<td valign="top" align="left">v, h, bb, j, n, z, dd, ff, o</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">11.</td>
<td valign="top" align="left">Gender&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="left">a, b</td>
<td valign="top" align="left">a = Male; b = Female</td>
</tr>
<tr>
<td valign="top" align="left">12.</td>
<td valign="top" align="left">Income&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">13.</td>
<td valign="top" align="left">Married&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="left">u, y, l</td>
<td valign="top" align="left">No, yes, unknown</td>
</tr>
<tr>
<td valign="top" align="left">14.</td>
<td valign="top" align="left">Prior default</td>
<td valign="top" align="left">t, f</td>
<td valign="top" align="left">True, false</td>
</tr>
<tr>
<td valign="top" align="left">15.</td>
<td valign="top" align="left">Years employed</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
<tr>
<td valign="top" align="left">16.</td>
<td valign="top" align="left">Zip code</td>
<td valign="top" align="left">Continuous</td>
<td valign="top" align="left"/></tr>
</tbody>
</table></table-wrap>
<sec id="S3.SS3.SSS1">
<title>3.3.1. Preprocessing the credit screening dataset</title>
<sec id="S3.SS3.SSS1.Px1">
<title>3.3.1.1. Calculating information gain</title>
<p>Information gain was calculated using the original set of attributes. As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, in the Credit Screening dataset (<xref ref-type="bibr" rid="B10">10</xref>), the attribute with the highest information gain was Prior Default, and the attribute with the second highest information gain was credit score, with Employed following closely behind. The attributes that are not in <xref ref-type="fig" rid="F4">Figure 4</xref> have information gain values very close to zero.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Information gain on credit approved dataset (dataset 2).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g004.tif"/>
</fig>
</sec>
<sec id="S3.SS3.SSS1.Px2">
<title>3.3.1.2. Removing attributes</title>
<p>Citizen, education level, ethnicity, and zip code were removed.</p>
<list list-type="simple">
<list-item>
<label>&#x2022;</label>
<p>Citizen had three values: g, p, and s; g accounted for 90.5% of the applications so the assumption was made the most of the applicants were citizens and therefore the attribute was not used.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>The Education Level attribute had 14 unique values in alpha form, which were not easily interpretable, hence the attribute was removed (not used).</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>The Zip Code attribute had values between 1 and 4 digits, hence was not used.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Ethnicity had too many values, some of which were inconsistent; hence this attribute was removed (not used).</p>
</list-item>
</list>
</sec>
<sec id="S3.SS3.SSS1.Px3">
<title>3.3.1.3. Handling missing values</title>
<p>All missing values were labeled &#x201C;unknown.&#x201D;</p>
</sec>
<sec id="S3.SS3.SSS1.Px4">
<title>3.3.1.4. Binning and synthetic feature generation</title>
<p>Three of the attributes were binned: Age, Credit Score, and Income. Age was grouped as per Dataset 1 (<xref ref-type="table" rid="T2">Table 2</xref>), hence is not shown here. The binning and synthetic feature generation of Credit Score and Income are shown in <xref ref-type="table" rid="T8">Tables 8</xref>, <xref ref-type="table" rid="T9">9</xref>, respectively. Debt and years employed are binned in <xref ref-type="table" rid="T10">Tables 10</xref>, <xref ref-type="table" rid="T11">11</xref>, respectively.</p>
<table-wrap position="float" id="T8">
<label>TABLE 8</label>
<caption><p>Credit score: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Credit score (binned)</td>
<td valign="top" align="center">Credit score (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1</td>
<td valign="top" align="center">&#x003C;1</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">&#x2265;1</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;1</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T9">
<label>TABLE 9</label>
<caption><p>Income: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Income (binned)</td>
<td valign="top" align="center">Income (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1</td>
<td valign="top" align="center">&#x003C;1</td>
</tr>
<tr>
<td valign="top" align="left">1&#x2013;499</td>
<td valign="top" align="center">&#x2265;1</td>
</tr>
<tr>
<td valign="top" align="left">500&#x2013;2000</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x003E;2000</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T10">
<label>TABLE 10</label>
<caption><p>Debt: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Debt (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;3</td>
</tr>
<tr>
<td valign="top" align="left">3&#x2013;6</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;6</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T11">
<label>TABLE 11</label>
<caption><p>Years employed: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Years employed (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1</td>
</tr>
<tr>
<td valign="top" align="left">1&#x2013;2</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;2</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
<sec id="S3.SS4">
<title>3.4. Dataset 3: Credit</title>
<p>The third dataset, Credit, obtained from <ext-link ext-link-type="uri" xlink:href="http://Kaggle.com">Kaggle.com</ext-link>, was provided by Iacob (<xref ref-type="bibr" rid="B11">11</xref>). This dataset contained 400 instances with 11 attributes, as shown in <xref ref-type="table" rid="T12">Table 12</xref>. This research focuses on the effects of the attributes on credit limit.</p>
<table-wrap position="float" id="T12">
<label>TABLE 12</label>
<caption><p>Dataset 3: credit (<xref ref-type="bibr" rid="B11">11</xref>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="center">Attribute</td>
<td valign="top" align="center">Description</td>
<td valign="top" align="center">Values</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="center">Age&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Age of cardholder</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="center">Balance&#x002A;&#x002A;</td>
<td valign="top" align="center">Cardholder&#x2019;s average credit card balance in dollars</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="center">Credit limit&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Credit limit assigned by card company</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="center">Credit rating</td>
<td valign="top" align="center">Cardholder&#x2019;s credit rating</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">5.</td>
<td valign="top" align="center">Education&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Number of educational years</td>
<td valign="top" align="center">5 through 20</td>
</tr>
<tr>
<td valign="top" align="left">6.</td>
<td valign="top" align="center">Ethnicity&#x002A;&#x002A;</td>
<td valign="top" align="center">Cardholder&#x2019;s ethnicity</td>
<td valign="top" align="center">African American, Asian, Caucasian</td>
</tr>
<tr>
<td valign="top" align="left">7.</td>
<td valign="top" align="center">Gender&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Cardholder&#x2019;s gender</td>
<td valign="top" align="center">Female, male</td>
</tr>
<tr>
<td valign="top" align="left">8.</td>
<td valign="top" align="center">ID</td>
<td valign="top" align="center">Cardholder&#x2019;s identification</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">9.</td>
<td valign="top" align="center">Income&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Cardholder&#x2019;s income in &#x0024;10,000 increments</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">10.</td>
<td valign="top" align="center">Married&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Cardholder&#x2019;s marital status</td>
<td valign="top" align="center">No, yes</td>
</tr>
<tr>
<td valign="top" align="left">11.</td>
<td valign="top" align="center">Number of cards</td>
<td valign="top" align="center">Cardholder&#x2019;s credit card count</td>
<td valign="top" align="center">1 through 9</td>
</tr>
<tr>
<td valign="top" align="left">12.</td>
<td valign="top" align="center">Student</td>
<td valign="top" align="center">Cardholder&#x2019;s current student status</td>
<td valign="top" align="center">No, yes</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S3.SS4.SSS1">
<title>3.4.1. Preprocessing the credit dataset</title>
<sec id="S3.SS4.SSS1.Px1">
<title>3.4.1.1. Calculating information gain</title>
<p>Information Gain was calculated on the original attributes. From <xref ref-type="fig" rid="F5">Figure 5</xref>, it can be noted that the attribute with the highest information gain was monthly balance, with credit rating being the second highest. There are far less attributes in <xref ref-type="fig" rid="F5">Figure 5</xref> than in <xref ref-type="table" rid="T12">Table 12</xref>. The attributes that are not in <xref ref-type="fig" rid="F5">Figure 5</xref> have information gain values very close to zero.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Information gain on credit limit dataset (dataset 3).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g005.tif"/>
</fig>
</sec>
<sec id="S3.SS4.SSS1.Px2">
<title>3.4.1.2. Removing attributes</title>
<p>Student, ethnicity, income, ID, and number of cards were removed.</p>
<list list-type="simple">
<list-item>
<label>&#x2022;</label>
<p>The student attribute was removed because the other datasets did not contain a similar attribute and only 10% of the cardholders were students.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Ethnicity attribute was also removed because it was not adequately identified in the other datasets.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Income data did appear correct, hence was not used.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>ID attribute was removed.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Number of cards was not used because the information gain was close to zero, and other datasets did not include this attribute.</p>
</list-item>
</list>
</sec>
<sec id="S3.SS4.SSS1.Px3">
<title>3.4.1.3. Binning and synthetic feature generation</title>
<p>For the Credit Limit dataset, synthetic feature creation was applied on two attributes: Age and Credit Limit. Age was grouped in the same buckets as the German Credit (<xref ref-type="table" rid="T2">Table 2</xref>). Credit Limit was grouped once in numerical buckets and the second grouping (synthetic feature) was High, Medium, and Low, as shown in <xref ref-type="table" rid="T13">Table 13</xref>. Other attributes that were binned were monthly balance (<xref ref-type="table" rid="T14">Table 14</xref>), credit rating (<xref ref-type="table" rid="T15">Table 15</xref>), and education (<xref ref-type="table" rid="T16">Table 16</xref>).</p>
<table-wrap position="float" id="T13">
<label>TABLE 13</label>
<caption><p>Credit limit: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Credit limit (binned)</td>
<td valign="top" align="center">Credit limit (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1500</td>
<td valign="top" align="center">Low</td>
</tr>
<tr>
<td valign="top" align="left">1500&#x2013;3000</td>
<td valign="top" align="center">Medium</td>
</tr>
<tr>
<td valign="top" align="left">3001&#x2013;5000</td>
<td valign="top" align="center">High</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;5000</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T14">
<label>TABLE 14</label>
<caption><p>Balance: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Balance (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;350</td>
</tr>
<tr>
<td valign="top" align="left">350&#x2013;700</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;700</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T15">
<label>TABLE 15</label>
<caption><p>Credit rating: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Credit rating (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;560</td>
</tr>
<tr>
<td valign="top" align="left">560&#x2013;659</td>
</tr>
<tr>
<td valign="top" align="left">660&#x2013;724</td>
</tr>
<tr>
<td valign="top" align="left">725&#x2013;759</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;759</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T16">
<label>TABLE 16</label>
<caption><p>Education: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Education (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;13</td>
</tr>
<tr>
<td valign="top" align="left">13&#x2013;16</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;16</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
<sec id="S3.SS5">
<title>3.5. Dataset 4: Bank churners dataset</title>
<p>The Bank Churners dataset, obtained from <ext-link ext-link-type="uri" xlink:href="http://Kaggle.com">Kaggle.com</ext-link>, was provided by Goyal (<xref ref-type="bibr" rid="B12">12</xref>). This dataset contained 10,127 instances with 27 attributes, as shown in <xref ref-type="table" rid="T17">Table 17</xref>. The Bank Churners dataset focused on bank customers and the relationship between attrition and the other attributes in dataset.</p>
<table-wrap position="float" id="T17">
<label>TABLE 17</label>
<caption><p>Dataset 4: bank churners (<xref ref-type="bibr" rid="B12">12</xref>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No</td>
<td valign="top" align="center">Attribute</td>
<td valign="top" align="center">Description</td>
<td valign="top" align="center">Values</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1.</td>
<td valign="top" align="center">Age&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Age of customer</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">2.</td>
<td valign="top" align="center">Attrition</td>
<td valign="top" align="center">Account closed or open</td>
<td valign="top" align="center">Existing/attritted</td>
</tr>
<tr>
<td valign="top" align="left">3.</td>
<td valign="top" align="center">Balance&#x002A;&#x002A;</td>
<td valign="top" align="center">Revolving balance</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">4.</td>
<td valign="top" align="center">Card category</td>
<td valign="top" align="center">Card category</td>
<td valign="top" align="center">Blue/gold/platinum/silver</td>
</tr>
<tr>
<td valign="top" align="left">5.</td>
<td valign="top" align="center">Client number</td>
<td valign="top" align="center">Unique identifier for each client</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">6.</td>
<td valign="top" align="center">Contacts</td>
<td valign="top" align="center">Number of contacts in last 12 months</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">7.</td>
<td valign="top" align="center">Credit limit&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Card credit limit</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">8.</td>
<td valign="top" align="center">Dependents</td>
<td valign="top" align="center">Number of dependents</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">9.</td>
<td valign="top" align="center">Education&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Education level</td>
<td valign="top" align="center">High school/college/graduate/doctorate/post-graduate/uneducated</td>
</tr>
<tr>
<td valign="top" align="left">10.</td>
<td valign="top" align="center">Gender&#x002A;&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Gender</td>
<td valign="top" align="center">Male/female</td>
</tr>
<tr>
<td valign="top" align="left">11.</td>
<td valign="top" align="center">Income&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Income</td>
<td valign="top" align="center">Categorized</td>
</tr>
<tr>
<td valign="top" align="left">12.</td>
<td valign="top" align="center">Married&#x002A;&#x002A;&#x002A;</td>
<td valign="top" align="center">Marital status</td>
<td valign="top" align="center">Married/single/divorced</td>
</tr>
<tr>
<td valign="top" align="left">13.</td>
<td valign="top" align="center">Months w/Bank</td>
<td valign="top" align="center">Number of active months with bank</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">14.</td>
<td valign="top" align="center">Months Inactive</td>
<td valign="top" align="center">Number of inactive months</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">15.</td>
<td valign="top" align="center">Months total</td>
<td valign="top" align="center">Number of months with bank</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">16.</td>
<td valign="top" align="center">Open to buy</td>
<td valign="top" align="center">Open to buy credit line</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">17.</td>
<td valign="top" align="center">Total amt change</td>
<td valign="top" align="center">Changed in transaction amount (Q4 over Q1)</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">18.</td>
<td valign="top" align="center">Total trans amount</td>
<td valign="top" align="center">Total transaction amount in last 12 months</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">19.</td>
<td valign="top" align="center">Total trans count</td>
<td valign="top" align="center">Total transaction count in last 12 months</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">20.</td>
<td valign="top" align="center">Diff trans cnt qtrly</td>
<td valign="top" align="center">Difference in transaction count (Q4 over Q1)</td>
<td valign="top" align="center">Continuous</td>
</tr>
<tr>
<td valign="top" align="left">21.</td>
<td valign="top" align="center">Utilization ratio</td>
<td valign="top" align="center">Average card utilization ratio</td>
<td valign="top" align="center">Continuous</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S3.SS5.SSS1">
<title>3.5.1. Preprocessing the bank churners dataset</title>
<sec id="S3.SS5.SSS1.Px1">
<title>3.5.1.1. Calculating information gain</title>
<p>Information gain was calculated using the original attributes. As shown in <xref ref-type="fig" rid="F6">Figure 6</xref>, in the Bank Churners dataset, the attribute with the highest information gain was income, followed by gender and revolving balance. The attributes that are not in <xref ref-type="fig" rid="F6">Figure 6</xref> have information gain values very close to zero.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Information gain on bank churners dataset (dataset 4).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2023-34-g006.tif"/>
</fig>
</sec>
<sec id="S3.SS5.SSS1.Px2">
<title>3.5.1.2. Removing attributes</title>
<p>Attrition, Card Category, Client Number, Contacts, Dependents, Months Inactive, Months Total, Open to Buy, Total Amount Change, Total Transaction Amount, Total Transaction Count, Difference Transaction Count Quarterly, and Utilization Ratio Average were removed since they were not considered relevant for this analysis.</p>
</sec>
<sec id="S3.SS5.SSS1.Px3">
<title>3.5.1.3. Binning and synthetic feature generation</title>
<p>For Dataset 4, Synthetic Feature Creation was done for Age (grouped as per the German Credit dataset, <xref ref-type="table" rid="T2">Table 2</xref>), Credit Limit (grouped as per the Credit dataset, <xref ref-type="table" rid="T13">Table 13</xref>), and Married. The Married attribute contained the following values: divorced, married, single, and unknown. Divorced and Single were grouped together into &#x201C;no&#x201D; (not married), leaving the grouped values for the Married attribute to be Yes, No, and Unknown, as shown in <xref ref-type="table" rid="T18">Table 18</xref>. Income was binned as per <xref ref-type="table" rid="T19">Table 19</xref>, and the synthetic feature for income was also categorized as shown in <xref ref-type="table" rid="T19">Table 19</xref>. Balance Revolving was binned as per <xref ref-type="table" rid="T20">Table 20</xref>, and months with bank were binned as per <xref ref-type="table" rid="T21">Table 21</xref>.</p>
<table-wrap position="float" id="T18">
<label>TABLE 18</label>
<caption><p>Married: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Married (binned)</td>
<td valign="top" align="center">Married (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Divorced</td>
<td valign="top" align="center">No</td>
</tr>
<tr>
<td valign="top" align="left">Married</td>
<td valign="top" align="center">Yes</td>
</tr>
<tr>
<td valign="top" align="left">Single</td>
<td valign="top" align="center">unknown</td>
</tr>
<tr>
<td valign="top" align="left">Unknown</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T19">
<label>TABLE 19</label>
<caption><p>Income: binning and SF generation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Income (original bin)</td>
<td valign="top" align="center">Income (SF)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;&#x0024;40,000</td>
<td valign="top" align="center">&#x003C;&#x0024;60,000</td>
</tr>
<tr>
<td valign="top" align="left">40,000&#x2013;&#x0024;60,000</td>
<td valign="top" align="center">&#x0024;60,000&#x2013;&#x0024;120,000</td>
</tr>
<tr>
<td valign="top" align="left">60,000&#x2013;&#x0024;80,000</td>
<td valign="top" align="center">&#x0024;120,000+</td>
</tr>
<tr>
<td valign="top" align="left">80,000&#x2013;&#x0024;120,000</td>
<td valign="top" align="center">Unknown</td>
</tr>
<tr>
<td valign="top" align="left">120,000+</td>
<td/>
</tr>
<tr>
<td valign="top" align="left">Unknown</td>
<td/>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T20">
<label>TABLE 20</label>
<caption><p>Balance revolving: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Balance revolving (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;1000</td>
</tr>
<tr>
<td valign="top" align="left">1000&#x2013;2000</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;2000</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T21">
<label>TABLE 21</label>
<caption><p>Months with bank: binning.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Months with bank (binned)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">&#x003C;24</td>
</tr>
<tr>
<td valign="top" align="left">24&#x2013;35</td>
</tr>
<tr>
<td valign="top" align="left">36&#x2013;47</td>
</tr>
<tr>
<td valign="top" align="left">&#x003E;47</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
</sec>
<sec id="S4">
<title>4. Classifiers used</title>
<p>For classification, three tree-based classifiers (Random Forest, Random Tree, and REPTree) and one non-tree-based classifier (Na&#x00EF;ve Bayes) were used.</p>
<sec id="S4.SS1">
<title>4.1. Random forest</title>
<p>Random Forest is a widely used machine learning classifier that constructs multiple decision trees randomly, and the term &#x201C;forest&#x201D; stems from the imagery of the many trees being created. In Random Forest, each tree is independently produced without pruning, and the nodes are split based on the user&#x2019;s selection of available features (<xref ref-type="bibr" rid="B15">15</xref>). There are also works on how a user can prune Random Forest. Kulkarni and Sinha (<xref ref-type="bibr" rid="B16">16</xref>) showed a way of pruning by limiting the number of trees.</p>
</sec>
<sec id="S4.SS2">
<title>4.2. Random tree</title>
<p>The Random Tree classifier is similar to the Random Forest classifier, but it constructs only one decision tree and is based on a random set of attributes. The Random Tree classifier constructs a set of data to build the Random Tree, and every node is split from the best split among all variables (<xref ref-type="bibr" rid="B17">17</xref>). Essentially, the Random Tree is a simpler tree/forest classifier, but Random Forest tends to have better accuracy by decreasing the variance because it constructs multiple trees.</p>
</sec>
<sec id="S4.SS3">
<title>4.3. The REPTree classifier</title>
<p>Reduced error pruning tree (REPTree) builds the decision tree based on information gain (<xref ref-type="bibr" rid="B18">18</xref>). The tree that is built may be a decision/regression tree, but it is used for classification, and it will create multiple trees in different iterations (<xref ref-type="bibr" rid="B18">18</xref>). When the algorithm runs, it goes from each node starting at the bottom and works its way to the top, and at each node, it assesses if it should replace it with the most frequent class to improve the accuracy, and it prunes away items that would cause a reduction in accuracy (<xref ref-type="bibr" rid="B19">19</xref>). REPTree only sorts numeric attributes once.</p>
</sec>
<sec id="S4.SS4">
<title>4.4. The Na&#x00EF;ve Bayes classifier</title>
<p>Na&#x00EF;ve Bayes was chosen as an additional classifier because it is not a tree classifier. The Na&#x00EF;ve Bayes classifier assumes that all variables are independent of the class attribute (<xref ref-type="bibr" rid="B20">20</xref>).</p>
</sec>
</sec>
<sec id="S5" sec-type="results|discussion">
<title>5. Results and discussion</title>
<p>The four different classifiers were run using Weka. For each dataset and each classifier, we looked at the following statistical measures: accuracy, true positive rate (TPR), false positive rate (FPR), precision, F-measure, and ROC area.</p>
<p>Accuracy is the ratio of a model&#x2019;s correct data (TP + TN) to the total data, calculated by the following equation:</p>
<disp-formula id="S5.Ex1">
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mi>TN</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mi>TN</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>FP</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>FN</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>TPR, also called sensitivity or recall, measures the proportion of actual attacks that were identified as attacks, given by the following equation:</p>
<disp-formula id="S5.Ex2">
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mtext>TP</mml:mtext>
<mml:mo>+</mml:mo>
<mml:mtext>FN</mml:mtext>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>FPR is where a non-attack is identified as an attack, given by the following equation:</p>
<disp-formula id="S5.Ex3">
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Precision measures the proportion of positive identification of attacks that were actually attacks, given by the following equation:</p>
<disp-formula id="S5.Ex4">
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>F-measure is the harmonic means on precision and recall, calculated by the following equation:</p>
<disp-formula id="S5.Ex5">
<mml:math id="M5">
<mml:mrow>
<mml:msup>
<mml:mn>2</mml:mn>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msup>
<mml:mi>n</mml:mi>
<mml:mo>&#x002A;</mml:mo>
</mml:msup>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>ROC plots the relational of the TPR vs. FPR.</p>
<p>Where:</p>
<list list-type="simple">
<list-item>
<label>&#x2022;</label>
<p>True Positive (TP) is instances that were identified correctly as positives.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>True Negative (TN) is instances that were identified correctly as negatives.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>False Positive (FP) is instances that were identified incorrectly as positives.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>False Negative (FN) is instances that were identified incorrectly as negatives.</p>
</list-item>
</list>
<p>For each of the classification runs, various combinations of original attributes (OR), original binned attributes (Bin), and synthetic features (SF) were used. The original attributes and original binned attributes were selected based on information gain that was performed on each respective dataset.</p>
<p>To select the best results, the runs with the minimal set of attributes with the highest statistical measures were selected.</p>
<sec id="S5.SS1">
<title>5.1. Classification results for the German credit dataset</title>
<p>For the German credit dataset (<xref ref-type="bibr" rid="B9">9</xref>), credit amount, a continuous attribute, was used as the class variable for classification. <xref ref-type="table" rid="T22">Tables 22</xref>&#x2013;<xref ref-type="table" rid="T25">25</xref> present the statistical results of the classifications.</p>
<table-wrap position="float" id="T22">
<label>TABLE 22</label>
<caption><p>Na&#x00EF;ve Bayes classification using the German credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 Bin; 0 SF)</td>
<td valign="top" align="center">46.1%</td>
<td valign="top" align="center">46.1%</td>
<td valign="top" align="center">21.1%</td>
<td valign="top" align="center">43.2%</td>
<td valign="top" align="center">43.5%</td>
<td valign="top" align="center">71.2%</td>
<td valign="top" align="center">1, 2, 3, 4, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.8%</td>
<td valign="top" align="center">72.8%</td>
<td valign="top" align="center">39.9%</td>
<td valign="top" align="center">72.0%</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">79.5%</td>
<td valign="top" align="center">1, 2, 3SF, 4, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">5 (1 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.6%</td>
<td valign="top" align="center">72.6%</td>
<td valign="top" align="center">39.5%</td>
<td valign="top" align="center">72.0%</td>
<td valign="top" align="center">72.2%</td>
<td valign="top" align="center">79.1%</td>
<td valign="top" align="center">1, 3SF, 4, 6, 7</td>
</tr>
<tr>
<td valign="top" align="left">10 (4 OR; 2 Bin; 3 SF)</td>
<td valign="top" align="center">73.6%</td>
<td valign="top" align="center">73.6%</td>
<td valign="top" align="center">32.7%</td>
<td valign="top" align="center">74.5%</td>
<td valign="top" align="center">74.0%</td>
<td valign="top" align="center">79.5%</td>
<td valign="top" align="center">1, 1SF, 2, 3SF, 4, 4SF, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">11 (2 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">36.0%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">78.6%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 7, 7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">12 (3 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">35.3%</td>
<td valign="top" align="center">73.1%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">78.9%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 5, 7, 7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">13 (4 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">35.0%</td>
<td valign="top" align="center">73.2%</td>
<td valign="top" align="center">72.8%</td>
<td valign="top" align="center">78.9%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 5, 6, 7, 7SF, 9, 9SF</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T23">
<label>TABLE 23</label>
<caption><p>Random forest classification using the German credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 Bin; 0 SF)</td>
<td valign="top" align="center">38.0%</td>
<td valign="top" align="center">38.0%</td>
<td valign="top" align="center">23.0%</td>
<td valign="top" align="center">38.0%</td>
<td valign="top" align="center">37.4%</td>
<td valign="top" align="center">65.8%</td>
<td valign="top" align="center">1,2,3,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">71.6%</td>
<td valign="top" align="center">71.6%</td>
<td valign="top" align="center">40.5%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">71.3%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">1, 2, 3SF, 4, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">5 (1 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">73.2%</td>
<td valign="top" align="center">73.2%</td>
<td valign="top" align="center">40.0%</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">72.6%</td>
<td valign="top" align="center">76.7%</td>
<td valign="top" align="center">1, 3SF, 4, 6, 7</td>
</tr>
<tr>
<td valign="top" align="left">10 (4 OR; 2 Bin; 3 SF)</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">41.1%</td>
<td valign="top" align="center">70.8%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">75.3%</td>
<td valign="top" align="center">1, 1SF, 2, 3SF, 4, 4SF, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">11 (2 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">43.1%</td>
<td valign="top" align="center">70.4%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">74.9%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 7, 7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">12 (3 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">42.9%</td>
<td valign="top" align="center">69.8%</td>
<td valign="top" align="center">70.2%</td>
<td valign="top" align="center">74.3%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 5, 7, 7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">13 (4 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">41.4%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">75.0%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF,4, 4SF, 5, 6, 7, 7SF, 9, 9SF</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T24">
<label>TABLE 24</label>
<caption><p>Random tree classification using the German credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 Bin; 0 SF)</td>
<td valign="top" align="center">37.6%</td>
<td valign="top" align="center">37.6%</td>
<td valign="top" align="center">23.4%</td>
<td valign="top" align="center">35.1%</td>
<td valign="top" align="center">35.8%</td>
<td valign="top" align="center">59.9%</td>
<td valign="top" align="center">1,2,3,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">68.7%</td>
<td valign="top" align="center">68.7%</td>
<td valign="top" align="center">40.3%</td>
<td valign="top" align="center">69.4%</td>
<td valign="top" align="center">69.0%</td>
<td valign="top" align="center">67.5%</td>
<td valign="top" align="center">1, 2, 3SF, 4, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">5 (1 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">38.7%</td>
<td valign="top" align="center">72.0%</td>
<td valign="top" align="center">72.1%</td>
<td valign="top" align="center">75.4%</td>
<td valign="top" align="center">1, 3SF, 4, 6, 7</td>
</tr>
<tr>
<td valign="top" align="left">10 (4 OR; 2 Bin; 3 SF)</td>
<td valign="top" align="center">69.9%</td>
<td valign="top" align="center">69.9%</td>
<td valign="top" align="center">39.2%</td>
<td valign="top" align="center">70.4%</td>
<td valign="top" align="center">70.1%</td>
<td valign="top" align="center">67.9%</td>
<td valign="top" align="center">1, 1SF, 2, 3SF, 4, 4SF, 5, 6, 7, 9</td>
</tr>
<tr>
<td valign="top" align="left">11 (2 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">69.2%</td>
<td valign="top" align="center">69.2%</td>
<td valign="top" align="center">41.4%</td>
<td valign="top" align="center">69.3%</td>
<td valign="top" align="center">69.3%</td>
<td valign="top" align="center">70.0%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 7, 7SF,9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">12 (3 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">68.0%</td>
<td valign="top" align="center">68.0%</td>
<td valign="top" align="center">41.7%</td>
<td valign="top" align="center">68.5%</td>
<td valign="top" align="center">68.2%</td>
<td valign="top" align="center">66.4%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF, 4, 4SF, 5, 7, 7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">13 (4 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">69.6%</td>
<td valign="top" align="center">69.6%</td>
<td valign="top" align="center">39.5%</td>
<td valign="top" align="center">70.1%</td>
<td valign="top" align="center">69.8%</td>
<td valign="top" align="center">67.3%</td>
<td valign="top" align="center">1, 1SF, 2, 2SF, 3SF,4, 4SF, 5, 6, 7, 7SF, 9, 9SF</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T25">
<label>TABLE 25</label>
<caption><p>REPTree classification using the German credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 Bin; 0 SF)</td>
<td valign="top" align="center">45.7%</td>
<td valign="top" align="center">45.7%</td>
<td valign="top" align="center">21.6%</td>
<td valign="top" align="center">41.8%</td>
<td valign="top" align="center">40.3%</td>
<td valign="top" align="center">68.0%</td>
<td valign="top" align="center">1,2,3,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">40.0%</td>
<td valign="top" align="center">71.9%</td>
<td valign="top" align="center">72.2%</td>
<td valign="top" align="center">76.4%</td>
<td valign="top" align="center">1,2,3SF,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">5 (1 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">44.5%</td>
<td valign="top" align="center">70.9%</td>
<td valign="top" align="center">71.2%</td>
<td valign="top" align="center">76.7%</td>
<td valign="top" align="center">1,3SF,4,6,7</td>
</tr>
<tr>
<td valign="top" align="left">10 (4 OR; 2 Bin; 3 SF)</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">42.6%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">75.6%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,4SF,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">11 (2 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.2%</td>
<td valign="top" align="center">72.2%</td>
<td valign="top" align="center">43.0%</td>
<td valign="top" align="center">70.9%</td>
<td valign="top" align="center">71.3%</td>
<td valign="top" align="center">76.5%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,7,7SF,9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">12 (3 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">42.9%</td>
<td valign="top" align="center">70.6%</td>
<td valign="top" align="center">71.0%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,5,7,7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">13 (4 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">42.6%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">75.6%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,5,6,7,7SF, 9,9SF</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S5.SS1.SSS1">
<title>5.1.1. Na&#x00EF;ve Bayes results</title>
<p>Results of the Na&#x00EF;ve Bayes classification, presented in <xref ref-type="table" rid="T22">Table 22</xref>, show that 10 attributes with three synthetic features had the best results, with a classification accuracy of 73.6%. In this run, three synthetic features were used, for features, age, credit, and duration. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 46.1%).</p>
</sec>
<sec id="S5.SS1.SSS2">
<title>5.1.2. Random forest results</title>
<p>Results of the Random Forest classification, presented in <xref ref-type="table" rid="T23">Table 23</xref>, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (73.2%). In this run, only the synthetic feature for credit was used. The other statistical measures for this run were also high. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 38%).</p>
</sec>
<sec id="S5.SS1.SSS3">
<title>5.1.3. Random tree results</title>
<p>Results of the Random Tree classification, presented in <xref ref-type="table" rid="T24">Table 24</xref>, show that five attributes with one synthetic feature had the best results in terms of classification accuracy (72.3%). In this run, the only synthetic feature used was for credit. The other statistical measures for this run were also high. Again, without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 37.6%).</p>
</sec>
<sec id="S5.SS1.SSS4">
<title>5.1.4. REPTree results</title>
<p>Results of the REPTree classification, presented in <xref ref-type="table" rid="T25">Table 25</xref>, show that eight attributes with one synthetic feature had slightly higher classification accuracy (72.7%) than the other runs. In this run, the only synthetic feature used was credit. Without the use of any synthetic features, for this dataset, the classification results were really poor (accuracy 45.7%).</p>
</sec>
<sec id="S5.SS1.SSS5">
<title>5.1.5. Overall classifier comparison for the German credit dataset</title>
<p>From <xref ref-type="table" rid="T22">Tables 22</xref>&#x2013;<xref ref-type="table" rid="T25">25</xref>, it can be noted that even using one synthetic attribute greatly improved the classification accuracy and other statistical measures.</p>
<p>A comparison of the classification accuracy of the all the classifiers, on the German Credit dataset (<xref ref-type="table" rid="T26">Table 26</xref>), show that two out of the four classifiers performed well with five attributes and only one synthetic feature. Na&#x00EF;ve Bayes performed the best with 10 attributes and three synthetic features, and REPTree performed the best with eight attributes and one synthetic feature. The highest classification accuracy was achieved with the Na&#x00EF;ve Bayes classifier, and the most improved accuracy was achieved using the Random Forest classifier (35.2% improvement, as shown in <xref ref-type="table" rid="T27">Table 27</xref>).</p>
<table-wrap position="float" id="T26">
<label>TABLE 26</label>
<caption><p>Classifier accuracy comparison on the German credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="4">Classification accuracy<hr/></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">No. of attributes</td>
<td valign="top" align="center">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Random forest</td>
<td valign="top" align="center">Random tree</td>
<td valign="top" align="center">REPTree</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 Bin; 0 SF)</td>
<td valign="top" align="center">46.1%</td>
<td valign="top" align="center">38.0%</td>
<td valign="top" align="center">37.6%</td>
<td valign="top" align="center">45.7%</td>
<td valign="top" align="center">1,2,3,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.8%</td>
<td valign="top" align="center">71.6%</td>
<td valign="top" align="center">68.7%</td>
<td valign="top" align="center"><bold>72.7%</bold></td>
<td valign="top" align="center">1,2,3SF,4,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">5 (1 OR; 3 Bin; 1 SF)</td>
<td valign="top" align="center">72.6%</td>
<td valign="top" align="center"><bold>73.2%</bold></td>
<td valign="top" align="center"><bold>72.3%</bold></td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">1,3SF,4,6,7</td>
</tr>
<tr>
<td valign="top" align="left">10 (4 OR; 2 Bin; 3 SF)</td>
<td valign="top" align="center"><bold>73.6%</bold></td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">69.9%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,4SF,5,6,7,9</td>
</tr>
<tr>
<td valign="top" align="left">11 (2 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">69.2%</td>
<td valign="top" align="center">72.2%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,7,7SF,9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">12 (3 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">70.7%</td>
<td valign="top" align="center">68.0%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,5,7,7SF, 9, 9SF</td>
</tr>
<tr>
<td valign="top" align="left">13 (4 OR; 3 Bin; 6 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">71.5%</td>
<td valign="top" align="center">69.6%</td>
<td valign="top" align="center">71.8%</td>
<td valign="top" align="center">1,1SF,2,2SF,3SF,4,4SF,5,6,7,7SF, 9,9SF</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T27">
<label>TABLE 27</label>
<caption><p>German credit dataset &#x2013; improvement in accuracy with synthetic features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Classifier</td>
<td valign="top" align="center">Synthetic feature creation</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">FP rate</td>
<td valign="top" align="center">Accuracy improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">73.6%</td>
<td valign="top" align="center">32.7%</td>
<td valign="top" align="center">27.5%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">46.1%</td>
<td valign="top" align="center">21.1%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">73.2%</td>
<td valign="top" align="center">40.0%</td>
<td valign="top" align="center">35.2%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">38.0%</td>
<td valign="top" align="center">23.0%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random tree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">38.7%</td>
<td valign="top" align="center">34.7%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">37.6%</td>
<td valign="top" align="center">24.1%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">REPTree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">40.0%</td>
<td valign="top" align="center">27%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">45.7%</td>
<td valign="top" align="center">21.8%</td>
<td valign="top" align="center"/></tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
<sec id="S5.SS2">
<title>5.2. Classification results for the credit screening dataset</title>
<p>For the Credit Screening dataset (<xref ref-type="bibr" rid="B10">10</xref>), the attribute approved, a binary attribute, was used as the class variable for classification. <xref ref-type="table" rid="T28">Tables 28</xref>&#x2013;<xref ref-type="table" rid="T31">31</xref> present the statistical results of the classifications. This dataset performed pretty well without the synthetic features too. Using three synthetic features for age, credit score, and income only very slightly improved the classification accuracy for the Random Forest and Random Tree algorithms.</p>
<table-wrap position="float" id="T28">
<label>TABLE 28</label>
<caption><p>Na&#x00EF;ve Bayes classification using the credit screening dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 (5 OR; 5 BIN)</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">16.4%</td>
<td valign="top" align="center">84.7%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">91.3%</td>
<td valign="top" align="center">1,2,3,5,6,9,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">12 (7 OR; 5 BIN)</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">16.2%</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">84.7%</td>
<td valign="top" align="center">91.2%</td>
<td valign="top" align="center">1,2,3,5,6,7,9,11,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">13 (5 OR; 5 BIN; 3 SF)</td>
<td valign="top" align="center">81.6%</td>
<td valign="top" align="center">81.6%</td>
<td valign="top" align="center">19.7%</td>
<td valign="top" align="center">81.6%</td>
<td valign="top" align="center">81.5%</td>
<td valign="top" align="center">89.4%</td>
<td valign="top" align="center">1,1SF,2,3,5,5SF,6,9,12,12SF,13,14,15</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T29">
<label>TABLE 29</label>
<caption><p>Random forest classification using the credit screening dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 (5 OR; 5 BIN)</td>
<td valign="top" align="center">83.3%</td>
<td valign="top" align="center">83.3%</td>
<td valign="top" align="center">17.0%</td>
<td valign="top" align="center">83.4%</td>
<td valign="top" align="center">83.3%</td>
<td valign="top" align="center">90.9%</td>
<td valign="top" align="center">1,2,3,5,6,9,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">12 (7 OR; 5 BIN)</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">15.9%</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">91.0%</td>
<td valign="top" align="center">1,2,3,5,6,7,9,11,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">13 (5 OR; 5 BIN; 3 SF)</td>
<td valign="top" align="center">85.4%</td>
<td valign="top" align="center">85.4%</td>
<td valign="top" align="center">15.0%</td>
<td valign="top" align="center">85.4%</td>
<td valign="top" align="center">85.4%</td>
<td valign="top" align="center">91.2%</td>
<td valign="top" align="center">1,1SF,2,3,5,5SF,6,9,12,12SF,13,14,15</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T30">
<label>TABLE 30</label>
<caption><p>Random tree classification using the credit screening dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 (5 OR; 5 BIN)</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">17.6%</td>
<td valign="top" align="center">82.4%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">85.5%</td>
<td valign="top" align="center">1,2,3,5,6,9,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">12 (7 OR; 5 BIN)</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">18.2%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">83.5%</td>
<td valign="top" align="center">1,2,3,5,6,7,9,11,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">13 (5 OR; 5 BIN; 3 SF)</td>
<td valign="top" align="center">83.8%</td>
<td valign="top" align="center">83.8%</td>
<td valign="top" align="center">16.1%</td>
<td valign="top" align="center">83.9%</td>
<td valign="top" align="center">83.8%</td>
<td valign="top" align="center">85.6%</td>
<td valign="top" align="center">1,1SF,2,3,5,5SF,6,9,12,12SF, 13,14,15</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T31">
<label>TABLE 31</label>
<caption><p>REPTree classification using the credit screening dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 (5 OR; 5 BIN)</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">14.7%</td>
<td valign="top" align="center">85.1%</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">89.4%</td>
<td valign="top" align="center">1,2,3,5,6,9,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">12 (7 OR; 5 BIN)</td>
<td valign="top" align="center">84.3%</td>
<td valign="top" align="center">84.3%</td>
<td valign="top" align="center">15.2%</td>
<td valign="top" align="center">84.7%</td>
<td valign="top" align="center">84.4%</td>
<td valign="top" align="center">89.5%</td>
<td valign="top" align="center">1,2,3,5,6,7,9,11,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">13 (5 OR; 5 BIN; 3 SF)</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">15.3%</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">84.7%</td>
<td valign="top" align="center">88.7%</td>
<td valign="top" align="center">1,1SF,2,3,5,5SF,6,9,12,12SF,13,14,15</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S5.SS2.SSS1">
<title>5.2.1. Overall classifier comparison for the credit screening dataset</title>
<p>A comparison of the classification accuracy of the all the classifiers, on the Credit Screening dataset, from <xref ref-type="table" rid="T32">Table 32</xref>, shows that Random Forest and Random Tree had the highest accuracy using three synthetic features, but Na&#x00EF;ve Bayes and REPTree did not perform better with synthetic features. An analysis of the accuracy improvement on this dataset, as shown in <xref ref-type="table" rid="T33">Table 33</xref>, shows very little improvement after adding synthetic features. In fact, there was a negative improvement with Na&#x00EF;ve Bayes and REPTree.</p>
<table-wrap position="float" id="T32">
<label>TABLE 32</label>
<caption><p>Classifier accuracy comparison on credit screening dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="4">Accuracy<hr/></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">No. of attributes</td>
<td valign="top" align="center">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Random forest</td>
<td valign="top" align="center">Random tree</td>
<td valign="top" align="center">REPTree</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">10 (5 OR; 5 BIN)</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">83.3%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center"><bold>84.8%</bold></td>
<td valign="top" align="center">1,2,3,5,6,9,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">12 (7 OR; 5 BIN)</td>
<td valign="top" align="center"><bold>84.8%</bold></td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">84.3%</td>
<td valign="top" align="center">1,2,3,5,6,7,9,11,12,13,14,15</td>
</tr>
<tr>
<td valign="top" align="left">13 (5 OR; 5 BIN; 3 SF)</td>
<td valign="top" align="center">81.6%</td>
<td valign="top" align="center"><bold>85.4%</bold></td>
<td valign="top" align="center"><bold>83.8%</bold></td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">1,1SF,2,3,5,5SF,6,9,12,12SF,13,14,15</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T33">
<label>TABLE 33</label>
<caption><p>Credit screening dataset &#x2013; improvement in accuracy with synthetic features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Classifier</td>
<td valign="top" align="center">Synthetic feature creation</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">FP rate</td>
<td valign="top" align="center">Accuracy improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">81.6%</td>
<td valign="top" align="center">16.0%</td>
<td valign="top" align="center">&#x2212;3.2%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">16.2%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">85.4%</td>
<td valign="top" align="center">15.1%</td>
<td valign="top" align="center">0.9%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">15.9%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random tree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">83.8%</td>
<td valign="top" align="center">18.8%</td>
<td valign="top" align="center">1.6%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">82.2%</td>
<td valign="top" align="center">18.2%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">REPTree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">15.2%</td>
<td valign="top" align="center">&#x2212;0.2%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">84.8%</td>
<td valign="top" align="center">15.2%</td>
<td valign="top" align="center"/></tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
<sec id="S5.SS3">
<title>5.3. Classification results for the credit dataset</title>
<p>For the Credit dataset (<xref ref-type="bibr" rid="B11">11</xref>), credit limit was used as the class variable for classification. <xref ref-type="table" rid="T34">Tables 34</xref>&#x2013;<xref ref-type="table" rid="T37">37</xref> present the statistical results of the classifications.</p>
<table-wrap position="float" id="T34">
<label>TABLE 34</label>
<caption><p>Na&#x00EF;ve Bayes classification using the credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (3 OR; 5 BIN)</td>
<td valign="top" align="center">59.0%</td>
<td valign="top" align="center">59.0%</td>
<td valign="top" align="center">18.2%</td>
<td valign="top" align="center">59.2%</td>
<td valign="top" align="center">58.4%</td>
<td valign="top" align="center">81.0%</td>
<td valign="top" align="center">1,2,3,4,5,7,10,12</td>
</tr>
<tr>
<td valign="top" align="left">5 (0 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">77.0%</td>
<td valign="top" align="center">77.0%</td>
<td valign="top" align="center">22.8%</td>
<td valign="top" align="center">81.8%</td>
<td valign="top" align="center">78.3%</td>
<td valign="top" align="center">84.5%</td>
<td valign="top" align="center">1,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">6 (0 OR; 4 BIN; 2 SF)</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">24.4%</td>
<td valign="top" align="center">80.9%</td>
<td valign="top" align="center">77.6%</td>
<td valign="top" align="center">82.9%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">7 (2 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">24.7%</td>
<td valign="top" align="center">81.3%</td>
<td valign="top" align="center">78.6%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">1,2,3SF,4,5,7,10</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T35">
<label>TABLE 35</label>
<caption><p>Random forest classification using the credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (3 OR; 5 BIN)</td>
<td valign="top" align="center">54.0%</td>
<td valign="top" align="center">54.0%</td>
<td valign="top" align="center">20.0%</td>
<td valign="top" align="center">54.8%</td>
<td valign="top" align="center">54.2%</td>
<td valign="top" align="center">79.1%</td>
<td valign="top" align="center">1,2,3,4,5,7,10,12</td>
</tr>
<tr>
<td valign="top" align="left">5 (0 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">75.8%</td>
<td valign="top" align="center">75.8%</td>
<td valign="top" align="center">35.8%</td>
<td valign="top" align="center">77.3%</td>
<td valign="top" align="center">76.4%</td>
<td valign="top" align="center">82.8%</td>
<td valign="top" align="center">1,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">6 (0 OR; 4 BIN; 2 SF)</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">40.1%</td>
<td valign="top" align="center">77.2%</td>
<td valign="top" align="center">77.3%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">7 (2 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">39.1%</td>
<td valign="top" align="center">76.7%</td>
<td valign="top" align="center">76.4%</td>
<td valign="top" align="center">84.4%</td>
<td valign="top" align="center">1,2,3SF,4,5,7,10</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T36">
<label>TABLE 36</label>
<caption><p>Random tree classification using the credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (3 OR; 5 BIN)</td>
<td valign="top" align="center">54.3%</td>
<td valign="top" align="center">54.3%</td>
<td valign="top" align="center">19.9%</td>
<td valign="top" align="center">56.8%</td>
<td valign="top" align="center">54.9%</td>
<td valign="top" align="center">73.4%</td>
<td valign="top" align="center">1,2,3,4,5,7,10,12</td>
</tr>
<tr>
<td valign="top" align="left">5 (0 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">75.3%</td>
<td valign="top" align="center">75.3%</td>
<td valign="top" align="center">41.6%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">75.4%</td>
<td valign="top" align="center">82.5%</td>
<td valign="top" align="center">1,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">6 (0 OR; 4 BIN; 2 SF)</td>
<td valign="top" align="center">76.8%</td>
<td valign="top" align="center">76.8%</td>
<td valign="top" align="center">43.9%</td>
<td valign="top" align="center">75.9%</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">83.4%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">7 (2 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">45.7%</td>
<td valign="top" align="center">74.7%</td>
<td valign="top" align="center">75.0%</td>
<td valign="top" align="center">81.3%</td>
<td valign="top" align="center">1,2,3SF,4,5,7,10</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T37">
<label>TABLE 37</label>
<caption><p>REPTree classification using the credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (3 OR; 5 BIN)</td>
<td valign="top" align="center">59.5%</td>
<td valign="top" align="center">59.5%</td>
<td valign="top" align="center">16.9%</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">79.8%</td>
<td valign="top" align="center">1,2,3,4,5,7,10,12</td>
</tr>
<tr>
<td valign="top" align="left">5 (0 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">79.3%</td>
<td valign="top" align="center">79.3%</td>
<td valign="top" align="center">15.8%</td>
<td valign="top" align="center">85.0%</td>
<td valign="top" align="center">80.5%</td>
<td valign="top" align="center">84.1%</td>
<td valign="top" align="center">1,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">6 (0 OR; 4 BIN; 2 SF)</td>
<td valign="top" align="center">78.3%</td>
<td valign="top" align="center">78.3%</td>
<td valign="top" align="center">16.1%</td>
<td valign="top" align="center">84.6%</td>
<td valign="top" align="center">79.6%</td>
<td valign="top" align="center">83.5%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">7 (2 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">31.4%</td>
<td valign="top" align="center">78.7%</td>
<td valign="top" align="center">77.1%</td>
<td valign="top" align="center">83.5%</td>
<td valign="top" align="center">1,2,3SF,4,5,7,10</td>
</tr>
</tbody>
</table></table-wrap>
<sec id="S5.SS3.SSS1">
<title>5.3.1. Na&#x00EF;ve Bayes results</title>
<p>Results of the Na&#x00EF;ve Bayes classification, presented in <xref ref-type="table" rid="T34">Table 34</xref>, show that in terms of classification accuracy, seven attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 77.5%. Other statistical measures were also high for this run.</p>
</sec>
<sec id="S5.SS3.SSS2">
<title>5.3.2. Random forest results</title>
<p>Results of the Random Forest classification, presented in <xref ref-type="table" rid="T35">Table 35</xref>, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (77.5%).</p>
</sec>
<sec id="S5.SS3.SSS3">
<title>5.3.3. Random tree results</title>
<p>Results of the Random Tree, presented in <xref ref-type="table" rid="T36">Table 36</xref>, show that six attributes with two synthetic features, age and credit limit, had the best results in terms of classification accuracy (76.8%).</p>
</sec>
<sec id="S5.SS3.SSS4">
<title>5.3.4. REPTree results</title>
<p>Results of the REPTree classification, presented in <xref ref-type="table" rid="T37">Table 37</xref>, show that five attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (79.3%).</p>
</sec>
<sec id="S5.SS3.SSS5">
<title>5.3.5. Overall classifier comparison for the credit dataset</title>
<p>For the Credit dataset, for all classifiers, there was a significant increase in classification accuracy and other statistical measures after the synthetic features were added, as shown in <xref ref-type="table" rid="T38">Table 38</xref>. Comparing the classifiers, REPTree performed the best at 78.8% classification accuracy. Three of the four classifiers performed the best with seven attributes and one synthetic feature. Only Na&#x00EF;ve Bayes performed the best with six attributes and two synthetic features. Other statistical measures were also higher for this set of runs.</p>
<table-wrap position="float" id="T38">
<label>TABLE 38</label>
<caption><p>Classifier accuracy comparison on credit dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="4">Classification accuracy<hr/></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">No. of attributes</td>
<td valign="top" align="center">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Random forest</td>
<td valign="top" align="center">Random tree</td>
<td valign="top" align="center">REPTree</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (3 OR; 5 BIN)</td>
<td valign="top" align="center">59.0%</td>
<td valign="top" align="center">54.0%</td>
<td valign="top" align="center">54.3%</td>
<td valign="top" align="center">59.5%</td>
<td valign="top" align="center">1,2,3,4,5,7,10,12</td>
</tr>
<tr>
<td valign="top" align="left">7 (2 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">75.8%</td>
<td valign="top" align="center"><bold>78.0%</bold></td>
<td valign="top" align="center"><bold>78.5%</bold></td>
<td valign="top" align="center"><bold>78.8%</bold></td>
<td valign="top" align="center">1,2,3SF,4,5,7,10</td>
</tr>
<tr>
<td valign="top" align="left">5 (0 OR; 4 BIN; 1 SF)</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">76.8%</td>
<td valign="top" align="center">78.3%</td>
<td valign="top" align="center">1,2,3SF,4,5</td>
</tr>
<tr>
<td valign="top" align="left">6 (0 OR; 4 BIN; 2 SF)</td>
<td valign="top" align="center"><bold>77.5%</bold></td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">76.3%</td>
<td valign="top" align="center">1,1SF,2,3SF,4,5</td>
</tr>
</tbody>
</table></table-wrap>
<p>From <xref ref-type="table" rid="T39">Table 39</xref>, it can be observed that Random Tree had the highest improvement in accuracy (24.2%), closely followed by Random Forest at 24%. The other two classifiers, Na&#x00EF;ve Bayes and REPTree, also had a significant improvement with the addition of synthetic attributes.</p>
<table-wrap position="float" id="T39">
<label>TABLE 39</label>
<caption><p>Credit dataset &#x2013; improvement in accuracy with synthetic features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Classifier</td>
<td valign="top" align="center">Synthetic feature creation</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">FP rate</td>
<td valign="top" align="center">Accuracy improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">77.5%</td>
<td valign="top" align="center">26.1%</td>
<td valign="top" align="center">18.5%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">59.0%</td>
<td valign="top" align="center">18.2%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">78.0%</td>
<td valign="top" align="center">42.8%</td>
<td valign="top" align="center">24%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">54.0%</td>
<td valign="top" align="center">20.0%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random tree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">78.5%</td>
<td valign="top" align="center">44.7%</td>
<td valign="top" align="center">24.2%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">54.3%</td>
<td valign="top" align="center">19.9%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">REPTree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">78.8%</td>
<td valign="top" align="center">15.8%</td>
<td valign="top" align="center">19.3%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">59.5%</td>
<td valign="top" align="center">17.6%</td>
<td valign="top" align="center"/></tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
<sec id="S5.SS4">
<title>5.4. Classification results for the bank churners dataset</title>
<p>For the Bank Churners dataset (<xref ref-type="bibr" rid="B12">12</xref>), credit limit was used as the class variable for classification. <xref ref-type="table" rid="T40">Tables 40</xref>&#x2013;<xref ref-type="table" rid="T43">43</xref> present the statistical results of the classifications.</p>
<table-wrap position="float" id="T40">
<label>TABLE 40</label>
<caption><p>Na&#x00EF;ve Bayes classification using the bank churners dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 BIN)</td>
<td valign="top" align="center">56.9%</td>
<td valign="top" align="center">56.9%</td>
<td valign="top" align="center">25.7%</td>
<td valign="top" align="center">52.3%</td>
<td valign="top" align="center">51.1%</td>
<td valign="top" align="center">73.8%</td>
<td valign="top" align="center">1,3,7,9,10,11,12,13</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 2 BIN)</td>
<td valign="top" align="center">57.2%</td>
<td valign="top" align="center">57.2%</td>
<td valign="top" align="center">26.3%</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">73.4%</td>
<td valign="top" align="center">3,7,10,11</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 1 BIN; 1 SF)</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">30.7%</td>
<td valign="top" align="center">72.80%</td>
<td valign="top" align="center">71.60%</td>
<td valign="top" align="center">75.5%</td>
<td valign="top" align="center">3,7SF,10,11</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T41">
<label>TABLE 41</label>
<caption><p>Random forest classification using the bank churners dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 BIN)</td>
<td valign="top" align="center">52.0%</td>
<td valign="top" align="center">52.0%</td>
<td valign="top" align="center">26.5%</td>
<td valign="top" align="center">48.3%</td>
<td valign="top" align="center">49.5%</td>
<td valign="top" align="center">68.2%</td>
<td valign="top" align="center">1,3,7,9,10,11,12,13</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 2 BIN)</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">25.5%</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">73.8%</td>
<td valign="top" align="center">3,7,10,11</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 1 BIN; 1 SF)</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">35.2%</td>
<td valign="top" align="center">72.3%</td>
<td valign="top" align="center">72.4%</td>
<td valign="top" align="center">75.9%</td>
<td valign="top" align="center">3,7SF,10,11</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T42">
<label>TABLE 42</label>
<caption><p>Random tree classification using the bank churners dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 BIN)</td>
<td valign="top" align="center">50.0%</td>
<td valign="top" align="center">50.0%</td>
<td valign="top" align="center">26.3%</td>
<td valign="top" align="center">47.8%</td>
<td valign="top" align="center">48.7%</td>
<td valign="top" align="center">65.3%</td>
<td valign="top" align="center">1,3,7,9,10,11,12,13</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 2 BIN)</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">25.5%</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">73.8%</td>
<td valign="top" align="center">3,7,10,11</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 1 BIN; 1 SF)</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">35.2%</td>
<td valign="top" align="center">72.30%</td>
<td valign="top" align="center">72.40%</td>
<td valign="top" align="center">75.8%</td>
<td valign="top" align="center">3,7SF,10,11</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T43">
<label>TABLE 43</label>
<caption><p>REPTree classification using the bank churners dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">No. of attributes used</td>
<td valign="top" align="center">Accur.</td>
<td valign="top" align="center">TPR</td>
<td valign="top" align="center">FPR</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">F-measure</td>
<td valign="top" align="center">ROC</td>
<td valign="top" align="center">Attributes</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">8 (4 OR; 4 BIN)</td>
<td valign="top" align="center">56.0%</td>
<td valign="top" align="center">56.0%</td>
<td valign="top" align="center">25.9%</td>
<td valign="top" align="center">51.2%</td>
<td valign="top" align="center">52.2%</td>
<td valign="top" align="center">72.1%</td>
<td valign="top" align="center">1,3,7,9,10,11,12,13</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 2 BIN)</td>
<td valign="top" align="center">57.8%</td>
<td valign="top" align="center">57.8%</td>
<td valign="top" align="center">25.9%</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">73.5%</td>
<td valign="top" align="center">3,7,10,11</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 1 BIN; 1 SF)</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">36.8%</td>
<td valign="top" align="center">71.90%</td>
<td valign="top" align="center">72.10%</td>
<td valign="top" align="center">75.2%</td>
<td valign="top" align="center">3,7SF,10,11</td>
</tr>
</tbody>
</table></table-wrap>
<p>Results of the Na&#x00EF;ve Bayes classification, presented in <xref ref-type="table" rid="T40">Table 40</xref>, show that in terms of classification accuracy, four attributes with one synthetic feature, credit limit, had the best results, with a classification accuracy of 71.1%. Results of the Random Forest, Random Tree, and REPTree, presented in <xref ref-type="table" rid="T41">Tables 41</xref>&#x2013;<xref ref-type="table" rid="T43">43</xref>, respectively, also show that four attributes with one synthetic feature, credit limit, had the best results in terms of classification accuracy (72.7, 72.7, and 72.5%, respectively).</p>
<sec id="S5.SS4.SSS1">
<title>5.4.1. Overall classifier comparison for the bank churners dataset</title>
<p>For this set of classifiers, adding one synthetic attribute improved the classification accuracy significantly and all four classifiers performed the best at four attributes with one synthetic attribute, as shown in <xref ref-type="table" rid="T44">Table 44</xref>. From <xref ref-type="table" rid="T45">Table 45</xref>, it can be noted that Random Tree had the highest improvement in accuracy at 22.7%, followed by Random Forest at 20.7%. The other two algorithms also had significant improvement in accuracy with just one synthetic feature.</p>
<table-wrap position="float" id="T44">
<label>TABLE 44</label>
<caption><p>Classifier accuracy comparisons on bank churners dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="4">Classification Accuracy<hr/></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">No. of attributes</td>
<td valign="top" align="center">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Random forest</td>
<td valign="top" align="center">Random tree</td>
<td valign="top" align="center">REPTree</td>
<td valign="top" align="center">Attributes used</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">4 (2 OR; 2 BIN)</td>
<td valign="top" align="center">57.2%</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">57.7%</td>
<td valign="top" align="center">57.8%</td>
<td valign="top" align="center">3,7,10,11</td>
</tr>
<tr>
<td valign="top" align="left">8 (4 OR; 4 BIN)</td>
<td valign="top" align="center">56.9%</td>
<td valign="top" align="center">52.0%</td>
<td valign="top" align="center">50.0%</td>
<td valign="top" align="center">56.0%</td>
<td valign="top" align="center">1,3,7,9,10,11,12,13</td>
</tr>
<tr>
<td valign="top" align="left">4 (2 OR; 1 BIN; 1SF)</td>
<td valign="top" align="center">71.1%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">3,7SF,10,11</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T45">
<label>TABLE 45</label>
<caption><p>Bank churners&#x2013;improvement in accuracy with synthetic features.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Classifier</td>
<td valign="top" align="center">Synthetic feature creation</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">FP rate</td>
<td valign="top" align="center">Accuracy improvement</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Na&#x00EF;ve Bayes</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">71.2%</td>
<td valign="top" align="center">30.9%</td>
<td valign="top" align="center">14.3%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">56.9%</td>
<td valign="top" align="center">25.7%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random forest</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">35.2%</td>
<td valign="top" align="center">20.7%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">52.0%</td>
<td valign="top" align="center">26.5%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">Random tree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">72.7%</td>
<td valign="top" align="center">35.2%</td>
<td valign="top" align="center">22.7%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">50.0%</td>
<td valign="top" align="center">26.3%</td>
<td valign="top" align="center"/></tr>
<tr>
<td valign="top" align="left">REPTree</td>
<td valign="top" align="center">Yes</td>
<td valign="top" align="center">72.5%</td>
<td valign="top" align="center">36.8%</td>
<td valign="top" align="center">16.5%</td>
</tr>
<tr>
<td valign="top" align="left"/>
<td valign="top" align="center">No</td>
<td valign="top" align="center">56.0%</td>
<td valign="top" align="center">25.9%</td>
<td valign="top" align="center"/></tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
<sec id="S6" sec-type="conclusion">
<title>6. Conclusion</title>
<p>Three of the four datasets used in this research showed an improvement in accuracy and other statistical measures using synthetic attributes. Overall, the tree-based classifiers, Random Forest, Random Tree, and REPTree, appeared to have better performances as well as better performance improvements than the non-tree-based classifier, Na&#x00EF;ve Bayes.</p>
</sec>
<sec id="S7" sec-type="author-contributions">
<title>Author contributions</title>
<p>SB conceptualized the article, responsible for guiding the research, and directing the formulation of the article. JW also helped to conceptualize the article, did most of the pre-processing and processing of the data, and wrote the initial draft of the article. Both authors contributed to the article and approved the submitted version.</p>
</sec>
</body>
<back>
<sec id="S8" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wongchinsri</surname> <given-names>P</given-names></name> <name><surname>Kuratach</surname> <given-names>W</given-names></name></person-group>. <article-title>A survey - data mining frameworks in credit card processing.</article-title> <source><italic>Proceedings of the 2016 13th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON).</italic></source> <publisher-loc>Chiang Mai</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2016</year>). p. <fpage>1</fpage>&#x2013;<lpage>6</lpage>.</citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeng</surname> <given-names>G</given-names></name></person-group>. <article-title>A necessary condition for a good binning algorithm in credit scoring.</article-title> <source><italic>Appl Math Sci.</italic></source> (<year>2014</year>) <volume>8</volume>:<fpage>3229</fpage>&#x2013;<lpage>42</lpage>.</citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Danenas</surname> <given-names>P</given-names></name> <name><surname>Garsva</surname> <given-names>G</given-names></name></person-group>. <article-title>Selection of support vector machines-based classifiers for credit risk domain.</article-title> <source><italic>Expert Syst Appl.</italic></source> (<year>2015</year>) <volume>42</volume>:<fpage>3194</fpage>&#x2013;<lpage>204</lpage>.</citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lessmann</surname> <given-names>S</given-names></name> <name><surname>Baesens</surname> <given-names>B</given-names></name> <name><surname>Seow</surname> <given-names>HV</given-names></name> <name><surname>Thomas</surname> <given-names>LC</given-names></name></person-group>. <article-title>Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research.</article-title> <source><italic>Eur J Operat Res.</italic></source> (<year>2015</year>) <volume>247</volume>:<fpage>124</fpage>&#x2013;<lpage>36</lpage>.</citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ala&#x2019;raj</surname> <given-names>M</given-names></name> <name><surname>Abbod</surname> <given-names>MF</given-names></name></person-group>. <article-title>Classifiers consensus system approach for credit scoring.</article-title> <source><italic>Knowl Based Syst.</italic></source> (<year>2016</year>) <volume>104</volume>:<fpage>89</fpage>&#x2013;<lpage>105</lpage>.</citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Musyoka</surname> <given-names>WM</given-names></name></person-group>. <article-title>Comparison of data mining algorithms in credit card approval.</article-title> <source><italic>Int J Comput Inform Technol.</italic></source> (<year>2018</year>) <volume>7</volume>:<issue>2</issue>.</citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tanikella</surname> <given-names>U.</given-names></name></person-group> <source><italic>Credit Card Approval Verification Model.</italic></source> <comment>PhD thesis</comment>. <publisher-loc>San Marcos, CA</publisher-loc>: <publisher-name>California State University San Marcos</publisher-name> (<year>2020</year>).</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>Y</given-names></name></person-group>. <article-title>Credit card approval predictions using logistic regression, linear svm and na&#x00EF;ve bayes classifier.</article-title> <source><italic>2022 International Conference on Machine Learning and Knowledge Engineering (MLKE).</italic></source> <publisher-loc>Guilin</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2022</year>). p. <fpage>207</fpage>&#x2013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1109/MLKE55170.2022.00047</pub-id></citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hofmann</surname> <given-names>H.</given-names></name></person-group> <source><italic>German Credit Risk UCI Machine Learning.</italic></source> <publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Kaggle</publisher-name> (<year>2016</year>).</citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Keogh</surname> <given-names>E</given-names></name> <name><surname>Blake</surname> <given-names>C</given-names></name> <name><surname>Merz</surname> <given-names>CJ.</given-names></name></person-group> <source><italic>UCI Repository of Machine Learning Databases.</italic></source> (<year>1998</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://archive.ics.uci.edu/ml/datasets/credit+approval">https://archive.ics.uci.edu/ml/datasets/credit+approval</ext-link></citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Iacob</surname> <given-names>S.</given-names></name></person-group> <source><italic>Predicting Credit Card Balance Using Regression.</italic></source> <publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Kaggle</publisher-name> (<year>2020</year>).</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goyal</surname> <given-names>S.</given-names></name></person-group> <source><italic>Credit Card Customers Predict Churning Customers.</italic></source> <publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Kaggle</publisher-name> (<year>2021</year>).</citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ricaldi</surname> <given-names>LC.</given-names></name></person-group> <source><italic>Three Essays on Consumer Credit Card Behavior.</italic></source> <comment>PhD thesis</comment>. <publisher-loc>Lubbock, TX</publisher-loc>: <publisher-name>Texas Tech University</publisher-name> (<year>2015</year>).</citation></ref>
<ref id="B14"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rane</surname> <given-names>K.</given-names></name></person-group> <source><italic>Credit Card Approval Analysis.</italic></source> (<year>2018</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://nycdatascience.com/blog/student-works/credit-card-approval-analysis/">https://nycdatascience.com/blog/student-works/credit-card-approval-analysis/</ext-link></citation></ref>
<ref id="B15"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Belgiu</surname> <given-names>M</given-names></name> <name><surname>Dr&#x01CE;gut&#x0328;</surname> <given-names>L</given-names></name></person-group>. <article-title>Random forest in remote sensing: a review of applications and future directions.</article-title> <source><italic>ISPRS J Photogr Remote Sens.</italic></source> (<year>2016</year>) <volume>114</volume>:<fpage>24</fpage>&#x2013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2016.01.011</pub-id></citation></ref>
<ref id="B16"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kulkarni</surname> <given-names>VY</given-names></name> <name><surname>Sinha</surname> <given-names>PK</given-names></name></person-group>. <article-title>Pruning of random forest classifiers: a survey and future directions.</article-title> <source><italic>2012 International Conference on Data Science &#x0026; Engineering (ICDSE).</italic></source> <publisher-loc>Cochin</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2012</year>). p. <fpage>64</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/ICDSE.2012.6282329</pub-id></citation></ref>
<ref id="B17"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mishra</surname> <given-names>AK</given-names></name> <name><surname>Ratha</surname> <given-names>BK</given-names></name></person-group>. <article-title>Study of random tree and random forest data mining algorithms for microarray data analysis.</article-title> <source><italic>Int J Adv Electr Comput Eng.</italic></source> (<year>2016</year>) <volume>3</volume>.</citation></ref>
<ref id="B18"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kalmegh</surname> <given-names>SR</given-names></name></person-group>. <article-title>Analysis of WEKA data mining algorithm reptree, simple cart and randomtree for classification of Indian news.</article-title> <source><italic>Int J Innov Sci Eng Technol.</italic></source> (<year>2015</year>) <volume>2</volume>.</citation></ref>
<ref id="B19"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rokach</surname> <given-names>L</given-names></name> <name><surname>Maimon</surname> <given-names>O.</given-names></name></person-group> <source><italic>Data mining with decision trees - theory and applications.</italic></source> <edition>2nd ed</edition>. <publisher-loc>Singapore</publisher-loc>: <publisher-name>World Scientific Publishing</publisher-name> (<year>2015</year>).</citation></ref>
<ref id="B20"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saritas</surname> <given-names>MM</given-names></name> <name><surname>Yas&#x0328;ar</surname> <given-names>AB</given-names></name></person-group>. <article-title>Performance analysis of ANN and Naive Bayes classification algorithm for data classification.</article-title> <source><italic>Int J Intell Syst Appl Eng.</italic></source> (<year>2019</year>) <volume>7</volume>:<fpage>88</fpage>&#x2013;<lpage>91</lpage>.</citation></ref>
<ref id="B21"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bradley</surname> <given-names>AP</given-names></name></person-group>. <article-title>The use of the area under the ROC curve in the evaluation of machine learning algorithms.</article-title> <source><italic>Patt Recogn.</italic></source> (<year>1997</year>) <volume>30</volume>:<fpage>1145</fpage>&#x2013;<lpage>59</lpage>.</citation></ref>
<ref id="B22"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname> <given-names>L</given-names></name></person-group>. <article-title>Random Forests.</article-title> <source><italic>Mach Learn.</italic></source> (<year>2001</year>) <volume>45</volume>:<fpage>5</fpage>&#x2013;<lpage>32</lpage>. <pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id></citation></ref>
<ref id="B23"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Frank</surname> <given-names>E</given-names></name> <name><surname>Trigg</surname> <given-names>LE</given-names></name> <name><surname>Holmes</surname> <given-names>G</given-names></name> <name><surname>Witten</surname> <given-names>IH</given-names></name></person-group>. <article-title>Naive bayes for regression (technical note).</article-title> <source><italic>Mach Learn.</italic></source> (<year>2000</year>) <volume>41</volume>:<fpage>5</fpage>&#x2013;<lpage>25</lpage>. <pub-id pub-id-type="doi">10.1023/A:1007670802811</pub-id></citation></ref>
<ref id="B24"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>J</given-names></name> <name><surname>Kamber</surname> <given-names>M</given-names></name> <name><surname>Pei</surname> <given-names>J.</given-names></name></person-group> <source><italic>Data Mining: Concepts and Techniques.</italic></source> <edition>3rd ed</edition>. <publisher-loc>Burlington, MA</publisher-loc>: <publisher-name>Elsevier/Morgan Kaufmann</publisher-name> (<year>2012</year>).</citation></ref>
<ref id="B25"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Seker</surname> <given-names>S.</given-names></name></person-group> <source><italic>AutoML: End-to-end Introduction From Optiwisdom.</italic></source> (<year>2019</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://towardsdatascience.com/automl-end-to-end-introduction-from-optiwisdom-c17fe03a017f">https://towardsdatascience.com/automl-end-to-end-introduction-from-optiwisdom-c17fe03a017f</ext-link></citation></ref>
<ref id="B26"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Taub</surname> <given-names>J</given-names></name> <name><surname>Elliot</surname> <given-names>M</given-names></name></person-group>. <article-title>The synthetic data challenge.</article-title> <source><italic>Conference of European Statisticians Joint UNECE/eurostat Work Session on Statistical Data Confidentiality.</italic></source> <publisher-loc>Hague</publisher-loc>: <publisher-name>UNECE</publisher-name> (<year>2019</year>).</citation></ref>
<ref id="B27"><label>27.</label><citation citation-type="journal"><collab>WEKA.</collab> <source><italic>Weka 3: Machine Learning Software in Java, Weka 3 - Data Mining With Open Source Machine Learning Software in Java.</italic></source> (<year>2022</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://www.cs.waikato.ac.nz/ml/weka/">https://www.cs.waikato.ac.nz/ml/weka/</ext-link> (<comment>accessed on August 22, 2022</comment>).</citation></ref>
<ref id="B28"><label>28.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H</given-names></name> <name><surname>Liu</surname> <given-names>H</given-names></name> <name><surname>Fu</surname> <given-names>Y</given-names></name></person-group>. <article-title>Incomplete multi-modal visual data grouping.</article-title> <source><italic>Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI).</italic></source> <publisher-loc>Palo Alto, CA</publisher-loc>: <publisher-name>AAAI Press</publisher-name> (<year>2016</year>). p. <fpage>2392</fpage>&#x2013;<lpage>8</lpage>.</citation></ref>
</ref-list>
</back>
</article>
