<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Iam.</journal-id>
<journal-title>BOHR International Journal of Internet of things, Artificial Intelligence and Machine Learning</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Iam.</abbrev-journal-title>
<issn pub-type="epub">2583-5521</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijiam.2023.12</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Discrete clusters formulation through the exploitation of optimized k-modes algorithm for hypotheses validation in social work research: the case of greek social workers working with refugees</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Lazanas</surname> <given-names>Alexis</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Siachos</surname> <given-names>Ilias</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Teloni</surname> <given-names>Dimitra-Dora</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Dedotsi</surname> <given-names>Sofia</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Telonis</surname> <given-names>Aristeidis G.</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Mechanical Engineering and Aeronautics, University of Patras</institution>, <addr-line>Rion-Patras</addr-line>, <country>Greece</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Social Work, University of West Attica</institution>, <addr-line>Athens</addr-line>, <country>Greece</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Human Genetics, Miller School of Medicine, University of Miami</institution>, <addr-line>Coral Gables, FL</addr-line>, <country>United States</country></aff>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Alexis Lazanas, <email> alexlas@upatras.gr</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>02</month>
<year>2023</year>
</pub-date>
<volume>2</volume>
<issue>1</issue>
<fpage>11</fpage>
<lpage>18</lpage>
<history>
<date date-type="received">
<day>21</day>
<month>01</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>07</day>
<month>02</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Lazanas, Siachos, Teloni, Dedotsi and Telonis.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Lazanas, Siachos, Teloni, Dedotsi and Telonis</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>This article focuses on the results of self-funded quantitative research conducted by social workers working in the &#x201C;refugee&#x201D; crisis and social services in Greece (<xref ref-type="bibr" rid="B1">1</xref>). The research, among other findings, argues that front-line professionals possess specific characteristics regarding their working profile. Statistical methods in the research performed significance tests to validate the initial hypotheses concerning the correlation between dataset variables. On the contrary of this concept, in this work, we present an alternative approach for validating initial hypotheses through the exploitation of clustering algorithms. Toward that goal, we evaluated several frequently used clustering algorithms regarding their efficiency in feature selection processes, and we finally propose a modified k-Modes algorithm for efficient feature subset selection.</p>
</abstract>
<kwd-group>
<kwd>clustering</kwd>
<kwd>k-Modes algorithm</kwd>
<kwd>social work</kwd>
<kwd>machine learning</kwd>
<kwd>refugee crisis</kwd>
</kwd-group>
<counts>
<fig-count count="8"/>
<table-count count="1"/>
<equation-count count="10"/>
<ref-count count="51"/>
<page-count count="8"/>
<word-count count="5531"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>Since social workers have been on the &#x201C;front-lines&#x201D; (<xref ref-type="bibr" rid="B2">2</xref>) of the so-called refugee &#x201C;crisis,&#x201D; facing a series of difficulties in effectively helping their users. While Greece is one of the &#x201C;entrance&#x201D; countries in Europe, there has been no current research in social work practice with refugees. The study was a self-funded, quantitative research project with main research questions concerning, among others, the exploration of front-line professionals&#x2019; profiles. Analytically, information about the research concerning: (i) aims and hypotheses, (ii) sampling strategies and research ethics, (iii) statistical methods and analysis, and (iv) the research&#x2018;s results can be retrieved from (<xref ref-type="bibr" rid="B1">1</xref>, <xref ref-type="bibr" rid="B3">3</xref>).</p>
<p>In this article, we present a model-based approach in order to provide an alternative validation to the globally acknowledged hypotheses testing method. More specifically, several hypothesis-testing cases are commonly performed through the application of either the chi-square (<xref ref-type="bibr" rid="B4">4</xref>) or hypergeometric test (<xref ref-type="bibr" rid="B5">5</xref>) in order to determine statistical significance. This procedure is typical for discovering &#x201C;correlation&#x201D; between independent variables in datasets with categorical values. Augmenting this typical approach, we propose a new formally structured model by using classification techniques in order to formulate data clusters capable of validating initially composed statistical hypotheses.</p>
<p>To accomplish the above goal, we adopted a rather &#x201C;typical&#x201D; technique regarding the feature selection process from our model&#x2018;s data. A candidate features sub-set is created each time in order to categorize data points into intuitively similar but not predefined user groups. Then, the clustering phase is implemented through the evaluation of the following clustering algorithms: (1) a modified k-Modes algorithm (<xref ref-type="bibr" rid="B6">6</xref>), (2) agglomerative clustering (<xref ref-type="bibr" rid="B7">7</xref>) and (3) a normal k-Modes algorithm (<xref ref-type="bibr" rid="B8">8</xref>) in order to adopt the most feature-selection efficient one.</p>
<p>The remainder of the article is organized as follows: Section &#x201C;2 Related work&#x201C;discusses related work issues considered in the context of literature; Section &#x201C;3 Social workers&#x2019; profile: Validating our initial hypothesis&#x201D; refers to the statistical hypothesis testing prior to our approach, which is described in Section &#x201C;4 Our approach&#x201C;, by demonstrating the applicability of the proposed approach through analytical steps; and finally, concluding remarks and future work directions are outlined in Section &#x201C;5 Conclusion.&#x201C;</p>
</sec>
<sec id="S2">
<title>2. Related work</title>
<p>Clustering and hypothesis testing are two powerful techniques used in data analysis (<xref ref-type="bibr" rid="B9">9</xref>). Clustering is the process of grouping similar objects together based on their characteristics, while hypothesis testing is a statistical technique used to test the validity of a hypothesis. In recent years, researchers have combined these two techniques to gain deeper insights into their data. In this review, we will discuss the combination of clustering and hypothesis testing and its applications in different fields (<xref ref-type="bibr" rid="B10">10</xref>&#x2013;<xref ref-type="bibr" rid="B13">13</xref>).</p>
<p>One of the primary applications of clustering and hypothesis testing is in biology, where interesting reviews can be found in Jacques and Preda (<xref ref-type="bibr" rid="B14">14</xref>); Wang et al. (<xref ref-type="bibr" rid="B15">15</xref>, <xref ref-type="bibr" rid="B16">16</xref>, <xref ref-type="bibr" rid="B17">17</xref>), as well as in medical imaging data (<xref ref-type="bibr" rid="B18">18</xref>, <xref ref-type="bibr" rid="B19">19</xref>). Researchers in this field use these techniques to identify groups of genes that are related to a particular disease (<xref ref-type="bibr" rid="B20">20</xref>) and chemometrics (<xref ref-type="bibr" rid="B21">21</xref>, <xref ref-type="bibr" rid="B22">22</xref>). Clustering is used to group genes that have similar characteristics (<xref ref-type="bibr" rid="B23">23</xref>), such as gene expression levels (<xref ref-type="bibr" rid="B24">24</xref>), while hypothesis testing is used to test whether the genes in each cluster are associated with the disease (<xref ref-type="bibr" rid="B25">25</xref>). By combining these techniques (<xref ref-type="bibr" rid="B26">26</xref>), researchers can identify clusters of genes that are statistically significant and associated with the disease (<xref ref-type="bibr" rid="B27">27</xref>).</p>
<p>Another area where the combination of clustering and hypothesis testing is used is in finance. Researchers in this field use clustering to group stocks that have similar characteristics (<xref ref-type="bibr" rid="B28">28</xref>, <xref ref-type="bibr" rid="B29">29</xref>), such as market capitalization, price-to-earnings ratio, and dividend yield. Hypothesis testing is used to test whether the stocks in each cluster have significantly different returns. This can help investors identify stocks that are undervalued or overvalued, and make more informed investment decisions.</p>
<p>In the field of marketing, clustering methods (<xref ref-type="bibr" rid="B30">30</xref>, <xref ref-type="bibr" rid="B31">31</xref>) and hypothesis testing are used to segment customers into different groups based on their characteristics (<xref ref-type="bibr" rid="B32">32</xref>) and test whether these groups have different purchasing behaviors. For example, a company may use clustering to group customers based on their age, income, and purchasing history (<xref ref-type="bibr" rid="B10">10</xref>). Hypothesis testing can then be used to test whether these groups have different purchasing behaviors, such as buying more frequently or spending more money. This can help companies develop more targeted marketing strategies and improve their overall sales.</p>
<p>In the field of image processing, clustering and hypothesis testing are used to segment images into different regions based on their characteristics and test whether these regions have different properties. For example, a researcher may use clustering to segment an image into regions based on color or texture (<xref ref-type="bibr" rid="B33">33</xref>, <xref ref-type="bibr" rid="B34">34</xref>). Hypothesis testing can then be used to test whether these regions have different properties, such as brightness or contrast. This can help researchers better understand the properties of the image and develop more advanced image processing algorithms.</p>
<p>One of the primary advantages of the combination of clustering and hypothesis testing is that it allows researchers to identify statistically significant groups of data that may not be apparent using either technique alone. Clustering can help identify groups of data that are similar (<xref ref-type="bibr" rid="B35">35</xref>), while hypothesis testing can help determine whether these groups are statistically significant. By combining these techniques, researchers can gain a deeper understanding of their data and develop more accurate models (<xref ref-type="bibr" rid="B36">36</xref>).</p>
<p>In conclusion, the combination of clustering (<xref ref-type="bibr" rid="B37">37</xref>) and hypothesis testing is a powerful technique that has numerous applications in different fields (<xref ref-type="bibr" rid="B14">14</xref>, <xref ref-type="bibr" rid="B38">38</xref>, <xref ref-type="bibr" rid="B39">39</xref>). By using clustering to group similar data and hypothesis testing to test whether these groups are statistically significant, researchers can gain deeper insights into their data and develop more accurate models (<xref ref-type="bibr" rid="B40">40</xref>, <xref ref-type="bibr" rid="B41">41</xref>). This technique has been used successfully in biology, finance, marketing, and image processing, among other fields (<xref ref-type="bibr" rid="B42">42</xref>, <xref ref-type="bibr" rid="B43">43</xref>), and is likely to continue to be an important tool in data analysis in the future.</p>
</sec>
<sec id="S3">
<title>3. Social workers&#x2019; profile: validating our initial hypothesis</title>
<p>A self-completed, anonymous electronic questionnaire was available online from June until September 2018, containing 52 questions (<xref ref-type="bibr" rid="B1">1</xref>, <xref ref-type="bibr" rid="B3">3</xref>) designed by the researchers. Out of a total of 158 responses, 21 were incomplete and were thus excluded. We analyzed the 137 complete responses, and statistical significance was evaluated in R version 3.4.3, and the results&#x2019; graphs were created using Excel 365 Pro Plus. We employed the hypergeometric or chi-squared test, and the statistical threshold was set at a <italic>P-</italic>value of 0.05. In this section, the findings of our research concerning the profile of social workers in the field of social services in Greece are presented in order to gradually construct our main hypothesis. To briefly describe the above concept, we note that the discussion in Authors&#x2019; own (<xref ref-type="bibr" rid="B1">1</xref>) &#x201C;provided important insights on the challenges and difficulties that social work professionals face in helping effectively their users.&#x201D; One of the main findings of this research was that social workers working in the refugee &#x201C;crisis&#x201D; are young graduates with limited work experience. We define our main hypothesis (H <sub>0</sub>) as follows:</p>
<p><italic>H<sub>0</sub> = &#x201C;social workers are young graduates with limited work experience.&#x201D;</italic></p>
<p>As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, regarding the social workers&#x2019; profile, the vast majority (80%) are women and 18.3% are men. In total, 52% are between 22 and 30 years old, while 39% are between 31 and 39 years old. It is straightforward that there is a plethora of under-middle-aged professionals working on the &#x201C;front-lines&#x201D; with refugees, and consequently, the first part of H0 is considered undoubtedly validated.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Respondents&#x2019; age and gender.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g001.tif"/>
</fig>
<p>Apart from the fact that the profession of social work is considered &#x201C;female dominated&#x201C;, as presented in <xref ref-type="fig" rid="F1">Figure 1</xref>, another critical point concerning social workers&#x2019; overall work experience arises from the research. More specifically, as presented in <xref ref-type="fig" rid="F2">Figure 2</xref>, there is an impressive &#x201C;wide area&#x201D; (almost 90%) of the professionals working on the &#x201C;front line&#x201D; with limited working experience varying in the interval between 0 and 4 years, within a statistically significant threshold (<italic>P</italic>-value &#x003C; 0.05; hypergeometric test).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Social workers&#x2019; experience in current working position.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g002.tif"/>
</fig>
<p>Moreover, the data analysis allowed us to capture the distribution of social workers experience in the past (total experience) and compare it with the time of the research (to be called current experience). As depicted in <xref ref-type="fig" rid="F3">Figure 3</xref>, responses concerning total and current experience were relatively similar at a statistically significant level (<italic>P</italic>-value &#x003C; 0.05; hypergeometric test). To further describe this issue, we note that the polynomial trendlines and R2 formulas regarding total and current experience follow an identical prediction pattern. The &#x201C;current experience&#x201D; trendline seems to gain value after 3&#x2013;4 years of experience. If we set &#x201C;3&#x2013;4 years&#x201D; as the trendlines&#x2019; curve alternation milestone, then we can substantially argue that social workers tend to remain in the same working position for less than the milestone of 4&#x2013;5 years. This observation can be further explained but falls outside the scope of this article.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Comparing social workers&#x2019; total with current experience.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g003.tif"/>
</fig>
<p>To summarize the aforementioned findings, it is obvious that our initial hypothesis (H0) has been validated through the comparison of statistically significant observations, as thoroughly described within this Section. Having validated H0 in Section &#x201C;4 Our approach,&#x201D; we present our alternative approach to testing H0 through a modified K-Modes algorithm.</p>
</sec>
<sec id="S4">
<title>4. Our approach</title>
<p>As stated in Section &#x201C;3 Social workers&#x2019; profile: Validating our initial hypothesis,&#x201D; our main hypotheses concern the fact that social workers working in the front line of refugee and immigration services are, in the majority, young graduates with/or limited work experience (<xref ref-type="bibr" rid="B1">1</xref>). In order to validate the afore-mentioned hypothesis, in this section we propose a clustering algorithm in order to categorize data points into intuitively similar&#x2014;but not predefined&#x2014;user groups. This process is particularly useful toward summarizing the data and understanding the basic features that differentiate one user group (Cj) from another (<xref ref-type="bibr" rid="B44">44</xref>). In other words, our basic goal is to distinguish a set of users who possess the main characteristics of the attributes mentioned in Section &#x201C;3 Social workers&#x2019; profile: Validating our initial hypothesis.&#x201C;</p>
<sec id="S4.SS1">
<title>4.1. Selecting input features for the clustering algorithm</title>
<p>Prior to clustering, our first task is to determine the input features. The importance stands on the fact that many research questions (features) present high diversity regarding their answers (e.g., 90% of the social workers who have participated in the survey hold a master&#x2018;s degree). These features can create a common problem known as overfitting (<xref ref-type="bibr" rid="B45">45</xref>) and can misleadingly be considered the main features that define the differences in users&#x2019; categorization. Plenty of methods can be utilized to define the optimal subset of features that have to be excluded from the clustering algorithm. Additionally, one of our major concerns during the feature selection process was the lack of an obvious way to evaluate the efficiency of feature selection without any specific domain knowledge capable of guiding the above process.</p>
<p>An indirect method of performing feature selection is to implement several scenarios with subsets of the available features and validate the efficiency of the clustering process for each scenario. This method may delay the desired features&#x2019; extraction but guarantees high efficiency and reliability. Toward this, we decided to implement a greedy algorithm that creates all possible subsets while we evaluate the algorithm&#x2018;s efficiency for each subset.</p>
<p>As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, every time a subset is generated, it undergoes through the clustering process and then gets evaluated. We define the subset with the minimum evaluation score as the optimal result of the &#x201C;feature selection&#x201D; phase. In order to calculate the evaluation score, a variety of metrics have been proposed in the literature, suggesting &#x201C;intracluster to intercluster distance ratio&#x201D; (<xref ref-type="bibr" rid="B46">46</xref>) being suggested as the most reliable. The idea behind this method is that the members of the same cluster should be &#x201C;closer&#x201D; to each other than the distance from other clusters&#x2019; members. Consequently, the following metric is adopted, as defined in Aggarwal (<xref ref-type="bibr" rid="B44">44</xref>)</p>
<disp-formula id="S4.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>a</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>P</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mpadded width="+5pt">
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S4.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mi>I</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>r</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mi>Q</mml:mi>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>Q</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mpadded width="+5pt">
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S4.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mpadded width="+5pt">
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:msub>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mpadded width="+5pt">
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>I</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Selecting the set of features for the clustering algorithm.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g004.tif"/>
</fig>
<p>where <italic>Sim</italic>(<italic>I</italic><sub><italic>i</italic></sub>,<italic>I</italic><sub><italic>j</italic></sub>) represents the overall similarity between two users. An above average clustering algorithm should produce results with a relatively small intra/inter ratio. Having calculated the ratio for every possible feature&#x2018;s subset, we select the subset with the minimum value.</p>
</sec>
<sec id="S4.SS2">
<title>4.2. Clustering with modified k-modes algorithm</title>
<p>Alternative clustering algorithms may apply to different data. In this occasion, we implement a modified k-Modes method (<xref ref-type="bibr" rid="B6">6</xref>, <xref ref-type="bibr" rid="B47">47</xref>) taking into consideration that most of our data contain categorical values. The k-Modes algorithm consists of the following steps:</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>First, the number of clusters <italic>N</italic><sub><italic>c</italic></sub> to be created is chosen, and one representative for each group is randomly picked up from the initial database instances. The set of the representatives for all <italic>N</italic><sub><italic>c</italic></sub> will be referred as <inline-formula><mml:math id="M5"><mml:mrow><mml:msub><mml:mi>T</mml:mi><mml:mi>R</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>In the second step, the similarity of each user to every representative <italic>R</italic><sub><italic>j</italic></sub> is calculated and inserted into a matrix <inline-formula><mml:math id="M6"><mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>D</mml:mi></mml:msub><mml:mtext>&#x2009;</mml:mtext><mml:mo>&#x00D7;</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula>, where <italic>N</italic><sub><italic>D</italic></sub> is the size of our database (i.e., the number of social workers who participated in the research).</p>
</list-item>
<list-item>
<label>(3)</label>
<p>Then, each user instance <italic>I</italic><sub><italic>i</italic></sub> is assigned to one of the <italic>N</italic><sub><italic>c</italic></sub> clusters, by determining the maximum of the similarity metrics for the row <italic>s</italic><sub><italic>i</italic></sub> of S, where <inline-formula><mml:math id="M7"><mml:mrow><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mtext>&#x2009;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula>. The cluster participation vector <inline-formula><mml:math id="M8"><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>&#x211D;</mml:mi><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow></mml:msup></mml:mrow></mml:math></inline-formula> (i.e., to which cluster every user belongs) is then straightforwardly implemented by the following formula:
<disp-formula id="S4.E4">
<label>(4)</label>
<mml:math id="M9"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>c</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:munder><mml:mo movablelimits="false">max</mml:mo><mml:mi>j</mml:mi></mml:munder><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>f</mml:mi><mml:mi>o</mml:mi><mml:mpadded width="+5pt"><mml:mi>r</mml:mi></mml:mpadded><mml:mpadded width="+3.3pt"><mml:mi>j</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:math></disp-formula>
<disp-formula id="S4.Ex1">
<mml:math id="M10"><mml:mrow><mml:mrow><mml:mo rspace="7.5pt">&#x2200;</mml:mo><mml:mpadded width="+3.3pt"><mml:mi>i</mml:mi></mml:mpadded></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></disp-formula></p></list-item>
<list-item>
<label>(4)</label>
<p>Having assigned each user <italic>I</italic><sub><italic>i</italic></sub> to a <italic>N</italic><sub><italic>c</italic></sub> cluster, the new representatives <italic>R</italic><sub><italic>j</italic></sub> must be calculated. The <italic>k-Modes</italic> algorithm suggests that, for every cluster, a &#x201C;virtual&#x201D; user is created who does not belong to the original database. Therefore, we observe the cluster members for every feature <italic>F</italic><sub><italic>r</italic></sub>, and we calculate the frequency <inline-formula><mml:math id="M11"><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>l</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> of the <inline-formula><mml:math id="M12"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> values. The value that is most common among the members is assigned to the new representative <italic>R</italic><sub><italic>j</italic></sub>. This method is referred to as taking the &#x201C;mode&#x201D; of the cluster members (<xref ref-type="bibr" rid="B6">6</xref>). If more than one of the <inline-formula><mml:math id="M13"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>l</mml:mi></mml:msubsup></mml:math></inline-formula> values are the most frequent for a feature <italic>F</italic><sub><italic>r</italic></sub>, the mode randomly selects one of them.</p>
</list-item>
<list-item>
<label>(5)</label>
<p>Finally, we iterate over steps 2 to 4 until either the set of the representatives (<inline-formula><mml:math id="M14"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi>T</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mi>o</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>l</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>d</mml:mi></mml:mrow></mml:msubsup></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:msubsup><mml:mi>T</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>e</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula>) or the cluster participation vector (<italic>C<sup>old</sup></italic>=<italic>C<sup>new</sup></italic>) remain the same.</p>
</list-item>
</list>
<p>At this point, we should note that the <italic>k-Modes</italic> parameters are tuned appropriately to achieve optimal accuracy and efficiency in final clusters. Generally speaking, the parameters that need to be determined are (a) the number of clusters <italic>N<sub>c</sub></italic>, (b) the similarity metric used to categorize the social workers, and (c) the features used for the clustering process (also referred to as feature selection). Considering our main hypothesis that most of the social workers are young graduates and/or of limited experience, the value <italic>N</italic><sub><italic>c</italic></sub>=2 is chosen. Explaining further this concept, we assume that if two distinguished groups of users are obtained, the largest of them containing most social workers who meet the standards of our hypothesis, then our hypothesis is indeed proven.</p>
<p>Regarding the metric used to calculate the similarity between users, we consider the following users: <inline-formula><mml:math id="M15"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mn>1</mml:mn><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mn>2</mml:mn><mml:mi>i</mml:mi></mml:msubsup><mml:mo rspace="4.2pt">,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mi>F</mml:mi></mml:msub><mml:mi>i</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M16"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mn>1</mml:mn><mml:mi>j</mml:mi></mml:msubsup><mml:mo rspace="7.5pt">,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mn>2</mml:mn><mml:mi>j</mml:mi></mml:msubsup><mml:mo rspace="4.2pt">,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:msub><mml:mi>N</mml:mi><mml:mi>F</mml:mi></mml:msub><mml:mi>j</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M17"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M18"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:math></inline-formula> are the values of the <italic>r<sup>th</sup></italic> feature for users <italic>I</italic><sub><italic>i</italic></sub> and <italic>I</italic><sub><italic>j</italic></sub>, respectively, while <italic>N</italic><sub><italic>F</italic></sub> represents the total number of features used to describe a user. The overall similarity between two users is then defined as follows:</p>
<disp-formula id="S4.E5"><label>(5)</label><mml:math id="M19"><mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>m</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mpadded width="+5pt"><mml:msub><mml:mi>I</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>&#x2062;</mml:mo><mml:msub><mml:mi>I</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow><mml:mo rspace="5.8pt">)</mml:mo></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mrow><mml:mpadded><mml:mi>r</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>F</mml:mi></mml:msub></mml:munderover><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M20"><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo rspace="7.5pt">,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> denotes the similarity measure for the individual attribute value <italic>x</italic><sub><italic>r</italic></sub> of the two users. The most straightforward technique is to set <inline-formula><mml:math id="M21"><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo rspace="7.5pt">,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> = 1, for every feature value that <inline-formula><mml:math id="M22"><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:mrow></mml:math></inline-formula> share in common. However, such a metric does not imply to the overall data distribution, meaning that the same similarity value is assigned in case where either the value appears in most of the data or it is of rare frequency. To overcome this drawback, we use the, as referred in the literature, &#x201C;inverse occurrence frequency&#x201D; metric (<xref ref-type="bibr" rid="B48">48</xref>). This metric calculates the similarity between matching attributes of two users by leveraging the weight of &#x201C;rare&#x201D; attribute values as follows:</p>
<disp-formula id="S4.E6"><label>(6)</label><mml:math id="M23"><mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mpadded width="+5pt"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:mpadded><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:mrow><mml:mo rspace="5.8pt">)</mml:mo></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mtable displaystyle="true" rowspacing="0pt"><mml:mtr><mml:mtd columnalign="center"><mml:mrow><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mpadded width="+5pt"><mml:mi>f</mml:mi></mml:mpadded><mml:mo>&#x2062;</mml:mo><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:mpadded></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="center"><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mrow><mml:mi>o</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>h</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>e</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>r</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mi>e</mml:mi></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable><mml:mi/></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M24"><mml:mrow><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo rspace="5.8pt">)</mml:mo></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mi>N</mml:mi></mml:mfrac></mml:mrow></mml:math></inline-formula>, and <italic>N</italic><sub><italic>r</italic></sub>(<italic>x</italic>) stands for the number of users that possess the value <italic>x</italic> for the r<italic><sup>th</sup></italic> feature. One problem we encountered with similarity metrics was that in case of two features and specifically in the attributes &#x201C;<italic>Current Organization Type</italic>&#x201D; and &#x201C;<italic>Past Organization Type</italic>&#x201D; of our dataset, most of the users had multiple selections in a single answer. A workaround for this problem was to perform a modified one-hot encoding of these features. One-hot encoding (<xref ref-type="bibr" rid="B49">49</xref>) is the procedure of producing binary features labeled as all the possible values an initial feature can take. However, this can lead to a common problem in data science knows as the &#x201C;curse of dimensionality&#x201D; (<xref ref-type="bibr" rid="B50">50</xref>). To overcome this problem, we applied the following technique:</p>
<p>The term <inline-formula><mml:math id="M25"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> refers to the set of values for the <italic>r<sup>th</sup></italic> feature of the user <italic>I</italic><sub><italic>i</italic></sub> takes. Accordingly, the term <inline-formula><mml:math id="M26"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:math></inline-formula> refers to the set of values for the <italic>r<sup>th</sup></italic> feature of the user <italic>I</italic><sub><italic>j</italic></sub>. <inline-formula><mml:math id="M27"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> will be the set that is produced from the intersection of <inline-formula><mml:math id="M28"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M29"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:math></inline-formula>:</p>
<disp-formula id="S4.E7"><label>(7)</label><mml:math id="M30"><mml:mrow><mml:mpadded width="+3.3pt"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>&#x2229;</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>The similarity <inline-formula><mml:math id="M31"><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo rspace="7.5pt">,</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> of the users for the <italic>r<sup>th</sup></italic> attribute can be defined as the sum of the similarity scores for the individual values that belong to the <inline-formula><mml:math id="M32"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> set. The above concept is shown in Equation 8.</p>
<disp-formula id="S4.E8"><label>(8)</label><mml:math id="M33"><mml:mrow><mml:mrow><mml:mi>S</mml:mi><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msubsup><mml:mo>&#x2062;</mml:mo><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mi>j</mml:mi></mml:msubsup></mml:mrow><mml:mo rspace="5.8pt">)</mml:mo></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo><mml:mo stretchy="false">|</mml:mo></mml:mover><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo fence="true" stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mpadded width="+3.3pt"><mml:mi>l</mml:mi></mml:mpadded><mml:mo rspace="5.8pt">=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mi>l</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mfrac></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M34"><mml:msubsup><mml:mi>x</mml:mi><mml:mi>l</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is the <italic>l</italic>th value in the <inline-formula><mml:math id="M35"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> set and <inline-formula><mml:math id="M36"><mml:mrow><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo rspace="5.8pt">)</mml:mo></mml:mrow></mml:mrow><mml:mo rspace="5.8pt">=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo>&#x2062;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mfrac></mml:mrow></mml:math></inline-formula>, where <italic>N</italic><sub><italic>r</italic></sub>(<italic>x</italic>) stands for the number of users that possess the value <italic>x</italic> for the <italic>r</italic><sub><italic>th</italic></sub> feature.</p>
<p>In general, this technique bears great resemblance with the one-hot encoding with difference that the algorithm does not have to deal with the &#x201C;curse of dimensionality,&#x201D; as the creation of new virtual features is avoided with the <italic>ad-hoc</italic> computation of the intersection set <inline-formula><mml:math id="M37"><mml:msubsup><mml:mi>X</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2229;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the &#x201C;inverse occurrence frequency&#x201D; measure for the set&#x2019;s elements.</p>
<p>The overall concept of the afore-mentioned technique is presented as pseudo-code as follows:</p>
<table-wrap position="float" id="A1">
<label>Algorithm 1</label>
<caption><p><italic>Modified k-Modes</italic> (Database <italic>D</italic>, Number of Clusters <italic>N<sub>c</sub></italic>).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<tbody>
<tr>
<td valign="top" align="left"><inline-graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-i001.jpg"/></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="S4.SS3">
<title>4.3. Evaluation of the clustering algorithm performance</title>
<p>Our dataset consists of 136 entries and 12 features. Two of the features (&#x201C;<italic>Current Organization Type</italic>&#x201D; and &#x201C;<italic>Past Organization Type</italic>&#x201D;) can contain multiple values. To evaluate our results, we compare our clustering method with (a) the normal k-Modes algorithm (one-hot encoding for the features that contain multiple values) and (b) the agglomerative clustering method (<xref ref-type="bibr" rid="B51">51</xref>). In <xref ref-type="table" rid="T1">Table 1</xref>, we present the best three subsets of features for each technique as they were generated by the feature selection phase, as well as the <italic>intra</italic>/<italic>inter</italic> score for each subset. It is obvious that the overfitting that one-hot encoding causes does not lead to qualitative clustering results, as it can add too much &#x201C;noise&#x201D; to the features. As a result, the modified k-Modes we propose manage to reduce the <italic>intra</italic>/<italic>inter</italic> ratio values rather than the other clustering techniques.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Algorithms&#x2019; performance evaluation results.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Algorithm</td>
<td valign="top" align="left">Feature subset</td>
<td valign="top" align="center">Ratio</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Modified k-Modes</bold></td>
<td valign="top" align="left"><italic>Age, hasMSc, Total experience, Current organization type, and Total time in the current organization</italic></td>
<td valign="top" align="center"><bold>0.3653</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Age, Total experience, Current organization type, Position at current organization, and Total yime in the past organization</italic></td>
<td valign="top" align="center"><bold>0.3719</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Age, Total experience, Total time in the current organization, Position at current organization, and Total time in the past organization</italic></td>
<td valign="top" align="center"><bold>0.3733</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>Agglomerative clustering</bold></td>
<td valign="top" align="left"><italic>Sex, Education, and hasMSc</italic></td>
<td valign="top" align="center"><bold>0.5656</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Education, hasMSc, and Position at current organization</italic></td>
<td valign="top" align="center"><bold>0.5837</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Education, hasMSc, and Location of residence</italic></td>
<td valign="top" align="center"><bold>0.6156</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>Normal k-Modes</bold></td>
<td valign="top" align="left"><italic>Age, Total experience, Total time in the current organization, Position at current organization, and Total time in the past organization</italic></td>
<td valign="top" align="center"><bold>0.3733</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Age, Location of residence, Total experience, Total time in the current organization, Position at current organization, and Total time in the past organization</italic></td>
<td valign="top" align="center"><bold>0.3852</bold></td>
</tr>
<tr>
<td valign="top" align="left"/><td valign="top" align="left"><italic>Total Experience, Total time in the current organization, Position at current organization, and Total time in the past organization</italic></td>
<td valign="top" align="center"><bold>0.4027</bold></td>
</tr>
</tbody>
</table></table-wrap>
<p>To further examine the behavior of the three algorithms, the curve of the mean <italic>intra</italic>/<italic>inter</italic> ratio to the number of features used was designed (<xref ref-type="fig" rid="F5">Figure 5</xref>). This allows us to observe and understand what to expect from these techniques in regards to how many features we are willing to discard. It is proven that the modified k-Modes outperform the other two clustering methods. In addition, while the mean <italic>intra</italic>/<italic>inter</italic> ratio is rising for hierarchical clustering and normal k-Modes as the number of features increases, the case is not the same for modified k-Modes. Not only does the mean ratio decrease, but as can be seen, the mean does not change drastically if we increase the number of features. This analysis indicates that the ideal number of features to be chosen is 5, as this value offers the advantage of fewer candidate features while maintaining appropriate levels of clustering efficiency. This assumption is additionally supported by the fact that the optimal subset produced through the feature selection method for the Modified k-Modes algorithm indeed had five features.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Mean ratio to number of features for all algorithms.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g005.tif"/>
</fig>
<p>Taking the subset of features that yield the best evaluation score for the modified k-Modes algorithm <italic>T</italic><sub><italic>F</italic></sub>={<italic>Age, hasMSc, Total Experience, Current Organization Type, Total Time in the Current Organization</italic>} leads us to the creation of two clusters <italic>C</italic><sub><italic>1</italic></sub> and <italic>C</italic><sub><italic>2</italic></sub>. The first cluster contains 86 members, and the second contains 51 users. The scatterplot regarding &#x201C;age&#x201D; and &#x201C;total experience&#x201D; is designed in order to compare the distribution of these two clusters. The centroids (representatives) of each cluster are also presented for informational purposes in <xref ref-type="fig" rid="F6">Figure 6</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Creating clusters with age/total experience scatterplot.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g006.tif"/>
</fig>
<p>Due to the categorical nature of these features, the following problem arose: if two users had the same values for the features age and total experience, the latter would cover the former. The solution to this problem was reached by adding Gaussian noise to every point on the plot so that the points would slightly differentiate around the initial spot. It is obvious that the members of the cluster <italic>C</italic><sub><italic>1</italic></sub> are in their majority social workers that are young people (22&#x2013;30 years old) and that their working experience is very limited, as only 8 of them have worked for more than 5 years. On the contrary, the users that were categorized in <italic>C</italic><sub><italic>2</italic></sub>, have more than 10 years of experience and their age distribution is mainly 35 years old and above.</p>
<p>To further analyze our results, we created two graphs that compare the &#x201C;age&#x201D; and &#x201C;total experience&#x201D; distribution between the two clusters. As shown in <xref ref-type="fig" rid="F7">Figure 7</xref>, in the cluster <italic>C</italic><sub><italic>1</italic></sub> 47 social workers have 2&#x2013;3 years of experience, while 16 of them have 0&#x2013;1 year of experience. This indicates that most of the social workers in this cluster have limited working experience.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>Total experience distribution in clusters C1 and C2.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g007.tif"/>
</fig>
<p>Finally, observing the age distribution for the same cluster in <xref ref-type="fig" rid="F8">Figure 8</xref>, it is straightforward that these social workers are also young graduates, as most of them are aged less than 30 years.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption><p>Age distribution in clusters C<sub>1</sub> and C<sub>2</sub>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-12-g008.tif"/>
</fig>
</sec>
</sec>
<sec id="S5" sec-type="conclusion">
<title>5. Conclusion</title>
<p>Social sciences cross the milestone of the artificial intelligence era in a rapidly changing scientific sub-field such as machine learning with enhanced data analysis tools as well as improved algorithms and techniques. Statistical hypotheses and testing through a variable significance level have been the norm for a very long period, demanding a &#x201C;statistical&#x201D; point of view into various problems and datasets.</p>
<p>Contrary to the above concept, we proposed an innovative approach that arises from clustering algorithms and aspires to become common ground for social science researchers in the upcoming years. Our approach exploits a modified k-Modes algorithm and bypasses statistical hypothesis testing through clustering construction. Moreover, we addressed the problem of selecting a subset of important features for the whole data in order to be aware of the &#x201C;important&#x201D; features before performing clustering. Consequently, the clustering process becomes more efficient, focused, and strict as only the important features are used. Therefore, our approach can be classified as a two-step method: we first rank and then select the subset of important features.</p>
<p>Finally, as a future work direction, it is important to note that the outcomes obtained through our approach, can be further evaluated with a number of alternative algorithms, and, in this regard, the researchers are free to apply their own technique and method selection or amendment and to reapply it if necessary. While clustering allows us to identify the sorting and allocation of observations, offering possibilities for researchers to study, we start with an initial number of clusters and then try to allocate the observations to correspondent clusters, with a future evaluation of the representativeness of each variable when creating them. Therefore, the result of one method can serve as input to the other, making this a &#x201C;cyclical approach&#x201D; or, as we define it, &#x201C;<bold><italic>a recursive feedback method</italic></bold>&#x201D;.</p>
</sec>
<sec id="S6">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teloni</surname> <given-names>DD</given-names></name> <name><surname>Dedotsi</surname> <given-names>S</given-names></name> <name><surname>Telonis</surname> <given-names>AG.</given-names></name></person-group> <article-title>Refugee &#x2018;crisis&#x2019; and social services in Greece: social workers&#x2019; profile and working conditions.</article-title> <source><italic>Eur J Soc Work.</italic></source> (<year>2020</year>) <volume>23</volume>:<fpage>1005</fpage>&#x2013;<lpage>18</lpage>.</citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jones</surname> <given-names>C</given-names></name></person-group>. <article-title>Voices from the front line: state social workers and New Labour.</article-title> <source><italic>Br J Soc Work.</italic></source> (<year>2001</year>) <volume>31</volume>:<fpage>547</fpage>&#x2013;<lpage>62</lpage>.</citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Teloni</surname> <given-names>DD</given-names></name> <name><surname>Dedotsi</surname> <given-names>S</given-names></name> <name><surname>Lazanas</surname> <given-names>A</given-names></name> <name><surname>Telonis</surname> <given-names>A.</given-names></name></person-group> <article-title>Social work with refugees: examining social workers&#x2019; role and practice in times of crisis in Greece.</article-title> <source><italic>Int Soc Work.</italic></source> (<year>2021</year>) <volume>1</volume>:<fpage>1</fpage>&#x2013;<lpage>18</lpage>.</citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lancaster</surname> <given-names>HO</given-names></name> <name><surname>Seneta</surname> <given-names>E</given-names></name></person-group>. <article-title>Chi-square distribution.</article-title> <source><italic>Encycl Biostat.</italic></source> (<year>2005</year>) <volume>2</volume>.</citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adams</surname> <given-names>WT</given-names></name> <name><surname>Skopek</surname> <given-names>TR</given-names></name></person-group>. <article-title>Statistical test for the comparison of samples from mutational spectra.</article-title> <source><italic>J Mol Biol.</italic></source> (<year>1987</year>) <volume>194</volume>:<fpage>391</fpage>&#x2013;<lpage>96</lpage>.</citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bai</surname> <given-names>L</given-names></name> <name><surname>Liang</surname> <given-names>J</given-names></name></person-group>. <article-title>The k-modes type clustering plus between-cluster information for categorical data.</article-title> <source><italic>Neurocomputing.</italic></source> (<year>2014</year>) <volume>133</volume>:<fpage>111</fpage>&#x2013;<lpage>21</lpage>.</citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kurita</surname> <given-names>T</given-names></name></person-group>. <article-title>An efficient agglomerative clustering algorithm using a heap.</article-title> <source><italic>Pattern Recogn.</italic></source> (<year>1991</year>) <volume>24</volume>:<fpage>205</fpage>&#x2013;<lpage>09</lpage>.</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chaturvedi</surname> <given-names>A</given-names></name> <name><surname>Green</surname> <given-names>PE</given-names></name> <name><surname>Caroll</surname> <given-names>JD</given-names></name></person-group>. <article-title>K-modes clustering.</article-title> <source><italic>J Class.</italic></source> (<year>2001</year>) <volume>18</volume>:<fpage>35</fpage>&#x2013;<lpage>55</lpage>.</citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zambom</surname> <given-names>AZ</given-names></name> <name><surname>Collazos</surname> <given-names>JA</given-names></name> <name><surname>Dias</surname> <given-names>R</given-names></name></person-group>. <article-title>Functional data clustering via hypothesis testing k-means.</article-title> <source><italic>Comput Stat.</italic></source> (<year>2019</year>) <volume>34</volume>:<fpage>527</fpage>&#x2013;<lpage>49</lpage>.</citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ferraty</surname> <given-names>F</given-names></name> <name><surname>Vieu</surname> <given-names>P.</given-names></name></person-group> <source><italic>Nonparametric functional data analysis: theory and practice.</italic></source> <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Science &#x0026; Business Media</publisher-name> (<year>2006</year>)</citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Horv&#x00E1;th</surname> <given-names>L</given-names></name> <name><surname>Kokoszka</surname> <given-names>P.</given-names></name></person-group> <source><italic>Inference for functional data with applications</italic></source>, <volume>Vol. 200</volume>. <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer Science &#x0026; Business Media</publisher-name> (<year>2012</year>).</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hsing</surname> <given-names>T</given-names></name> <name><surname>Eubank</surname> <given-names>R.</given-names></name></person-group> <source><italic>Theoretical foundations of functional data analysis, with an introduction to linear operators</italic></source>, <volume>Vol. 997</volume>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons</publisher-name> (<year>2015</year>).</citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kokoszka</surname> <given-names>P</given-names></name> <name><surname>Reimherr</surname> <given-names>M.</given-names></name></person-group> <source><italic>Introduction to functional data analysis.</italic></source> <publisher-loc>Boca Raton, FL</publisher-loc>: <publisher-name>Chapman and Hall/CRC</publisher-name> (<year>2017</year>).</citation></ref>
<ref id="B14"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jacques</surname> <given-names>J</given-names></name> <name><surname>Preda</surname> <given-names>C</given-names></name></person-group>. <article-title>Functional data clustering: a survey.</article-title> <source><italic>Adv Data Anal Class.</italic></source> (<year>2014</year>) <volume>8</volume>:<fpage>231</fpage>&#x2013;<lpage>55</lpage>.</citation></ref>
<ref id="B15"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>JL</given-names></name> <name><surname>Chiou</surname> <given-names>JM</given-names></name> <name><surname>M&#x00FC;ller</surname> <given-names>HG</given-names></name></person-group>. <article-title>Functional data analysis.</article-title> <source><italic>Annu Rev Stat Appl.</italic></source> (<year>2016</year>) <volume>3</volume>:<fpage>257</fpage>&#x2013;<lpage>95</lpage>.</citation></ref>
<ref id="B16"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reimherr</surname> <given-names>M</given-names></name> <name><surname>Nicolae</surname> <given-names>D</given-names></name></person-group>. <article-title>A functional data analysis approach for genetic association studies.</article-title> <source><italic>Ann Appl Stat.</italic></source> (<year>2014</year>) <volume>8</volume>:<fpage>406</fpage>&#x2013;<lpage>29</lpage>.</citation></ref>
<ref id="B17"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Young</surname> <given-names>DL</given-names></name> <name><surname>Fields</surname> <given-names>S</given-names></name></person-group>. <article-title>The role of functional data in interpreting the effects of genetic variation.</article-title> <source><italic>Mol Biol Cell.</italic></source> (<year>2015</year>) <volume>26</volume>:<fpage>3904</fpage>&#x2013;<lpage>08</lpage>.</citation></ref>
<ref id="B18"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bowman</surname> <given-names>FD</given-names></name> <name><surname>Guo</surname> <given-names>Y</given-names></name> <name><surname>Derado</surname> <given-names>G</given-names></name></person-group>. <article-title>Statistical approaches to functional neuroimaging data.</article-title> <source><italic>Neuroimaging Clin N Am.</italic></source> (2007) <volume>17</volume>:<fpage>441</fpage>&#x2013;<lpage>58</lpage>.</citation></ref>
<ref id="B19"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hasenstab</surname> <given-names>K</given-names></name> <name><surname>Scheffler</surname> <given-names>A</given-names></name> <name><surname>Telesca</surname> <given-names>D</given-names></name> <name><surname>Sugar</surname> <given-names>CA</given-names></name> <name><surname>Jeste</surname> <given-names>S</given-names></name> <name><surname>DiStefano</surname> <given-names>C.</given-names></name><etal/></person-group> <article-title>A multi-dimensional functional principal components analysis of EEG data.</article-title> <source><italic>Biometrics.</italic></source> (<year>2017</year>) <volume>73</volume>:<fpage>999</fpage>&#x2013;<lpage>1009</lpage>.</citation></ref>
<ref id="B20"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Di Salvo</surname> <given-names>F</given-names></name> <name><surname>Ruggieri</surname> <given-names>M</given-names></name> <name><surname>Plaia</surname> <given-names>A</given-names></name></person-group>. <article-title>Functional principal component analysis for multivariate multidimensional environmental data.</article-title> <source><italic>Environ Ecol Stat.</italic></source> (<year>2015</year>) <volume>22</volume>:<fpage>739</fpage>&#x2013;<lpage>57</lpage>.</citation></ref>
<ref id="B21"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Saeys</surname> <given-names>W</given-names></name> <name><surname>De Ketelaere</surname> <given-names>B</given-names></name> <name><surname>Darius</surname> <given-names>P</given-names></name></person-group>. <article-title>Potential applications of functional data analysis in chemometrics.</article-title> <source><italic>J Chemometr.</italic></source> (2008) <volume>22</volume>:<fpage>335</fpage>&#x2013;<lpage>44</lpage>.</citation></ref>
<ref id="B22"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aguilera</surname> <given-names>AM</given-names></name> <name><surname>Escabias</surname> <given-names>M</given-names></name> <name><surname>Valderrama</surname> <given-names>MJ.</given-names></name> <name><surname>Aguilera-Morillo</surname> <given-names>MC</given-names></name></person-group>. <article-title>Functional analysis of chemometric data.</article-title> <source><italic>Open J Stat.</italic></source> (2013) <volume>3</volume>:<issue>334</issue>.</citation></ref>
<ref id="B23"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eisen</surname> <given-names>MB.</given-names></name> <name><surname>Spellman</surname> <given-names>PT</given-names></name> <name><surname>Brown</surname> <given-names>PO</given-names></name> <name><surname>Botstein</surname> <given-names>D</given-names></name></person-group>. <article-title>Cluster analysis and display of genome-wide expression patterns.</article-title> <source><italic>Proc Natl Acad Sci USA.</italic></source> (<year>1998</year>) <volume>95</volume>:<fpage>14863</fpage>&#x2013;<lpage>8</lpage>.</citation></ref>
<ref id="B24"><label>24.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dahl</surname> <given-names>DB</given-names></name> <name><surname>Newton</surname> <given-names>MA</given-names></name></person-group>. <article-title>Multiple hypothesis testing by clustering treatment effects.</article-title> <source><italic>J Am Stat Assoc.</italic></source> (<year>2007</year>) <volume>102</volume>:<fpage>517</fpage>&#x2013;<lpage>26</lpage>.</citation></ref>
<ref id="B25"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dudoit</surname> <given-names>S</given-names></name> <name><surname>Shaffer</surname> <given-names>JP</given-names></name> <name><surname>Boldrick</surname> <given-names>JC</given-names></name></person-group>. <article-title>Multiple hypothesis testing in microarray experiments.</article-title> <source><italic>Stat Sci.</italic></source> (<year>2003</year>) <volume>18</volume>:<fpage>71</fpage>&#x2013;<lpage>103</lpage>.</citation></ref>
<ref id="B26"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Storey</surname> <given-names>JD</given-names></name> <name><surname>Taylor</surname> <given-names>JE</given-names></name> <name><surname>Siegmund</surname> <given-names>D</given-names></name></person-group>. <article-title>Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach.</article-title> <source><italic>J R Stat Soc Ser B.</italic></source> (<year>2004</year>) <volume>66</volume>:<fpage>187</fpage>&#x2013;<lpage>205</lpage>.</citation></ref>
<ref id="B27"><label>27.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Benjamini</surname> <given-names>Y</given-names></name> <name><surname>Yekutieli</surname> <given-names>D</given-names></name></person-group>. <article-title>The control of the false discovery rate in multiple testing under dependency.</article-title> <source><italic>Ann Stat.</italic></source> (<year>2001</year>) <volume>29</volume>:<fpage>1165</fpage>&#x2013;<lpage>88</lpage>.</citation></ref>
<ref id="B28"><label>28.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ward</surname> <given-names>JH</given-names> <suffix>Jr.</suffix> </name></person-group> <article-title>Hierarchical grouping to optimize an objective function.</article-title> <source><italic>J Am Stat Assoc.</italic></source> (<year>1963</year>) <volume>58</volume>:<fpage>236</fpage>&#x2013;<lpage>44</lpage>.</citation></ref>
<ref id="B29"><label>29.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hartigan</surname> <given-names>JA.</given-names></name></person-group> <source><italic>Clustering algorithms.</italic></source> <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x0026; Sons, Inc</publisher-name> (<year>1975</year>).</citation></ref>
<ref id="B30"><label>30.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fraley</surname> <given-names>C</given-names></name> <name><surname>Raftery</surname> <given-names>AE</given-names></name></person-group>. <article-title>Model-based clustering, discriminant analysis, and density estimation.</article-title> <source><italic>J Am Stat Assoc.</italic></source> (<year>2002</year>) <volume>97</volume>:<fpage>611</fpage>&#x2013;<lpage>31</lpage>.</citation></ref>
<ref id="B31"><label>31.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Medvedovic</surname> <given-names>M</given-names></name> <name><surname>Sivaganesan</surname> <given-names>S</given-names></name></person-group>. <article-title>Bayesian infinite mixture model based clustering of gene expression profiles.</article-title> <source><italic>Bioinformatics.</italic></source> (<year>2002</year>) <volume>18</volume>:<fpage>1194</fpage>&#x2013;<lpage>206</lpage>.</citation></ref>
<ref id="B32"><label>32.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Finch</surname> <given-names>H</given-names></name></person-group>. <article-title>Comparison of distance measures in cluster analysis with dichotomous data.</article-title> <source><italic>J Data Sci.</italic></source> (<year>2005</year>) <volume>3</volume>:<fpage>85</fpage>&#x2013;<lpage>100</lpage>.</citation></ref>
<ref id="B33"><label>33.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tarpey</surname> <given-names>T</given-names></name> <name><surname>Kinateder</surname> <given-names>KK</given-names></name></person-group>. <article-title>Clustering Functional Data.</article-title> <source><italic>J Class.</italic></source> (<year>2003</year>) <volume>20</volume>.</citation></ref>
<ref id="B34"><label>34.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yamamoto</surname> <given-names>M</given-names></name></person-group>. <article-title>Clustering of functional data in a low-dimensional subspace.</article-title> <source><italic>Adv Data Anal Class.</italic></source> (<year>2012</year>) <volume>6</volume>:<fpage>219</fpage>&#x2013;<lpage>47</lpage>.</citation></ref>
<ref id="B35"><label>35.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vrieze</surname> <given-names>SI</given-names></name></person-group>. <article-title>Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).</article-title> <source><italic>Psychol Methods.</italic></source> (2012) <volume>17</volume>:<issue>228</issue>.</citation></ref>
<ref id="B36"><label>36.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bouveyron</surname> <given-names>C</given-names></name> <name><surname>Brunet-Saumard</surname> <given-names>C</given-names></name></person-group>. <article-title>Model-based clustering of high-dimensional data: a review.</article-title> <source><italic>Computat Stat Data Anal.</italic></source> (2014) <volume>71</volume>:<fpage>52</fpage>&#x2013;<lpage>78</lpage>.</citation></ref>
<ref id="B37"><label>37.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sakamoto</surname> <given-names>Y</given-names></name> <name><surname>Ishiguro</surname> <given-names>M</given-names></name> <name><surname>Kitagawa</surname> <given-names>G.</given-names></name></person-group> <source><italic>Akaike information criterion statistics.</italic></source> <publisher-loc>Dordrecht</publisher-loc>: <publisher-name>Kluwer Academic Publishers Group</publisher-name> (<year>1986</year>).</citation></ref>
<ref id="B38"><label>38.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bouveyron</surname> <given-names>C</given-names></name> <name><surname>Jacques</surname> <given-names>J</given-names></name></person-group>. <article-title>Model-based clustering of time series in group-specific functional subspaces.</article-title> <source><italic>Adv Data Anal Class.</italic></source> (<year>2011</year>) <volume>5</volume>:<fpage>281</fpage>&#x2013;<lpage>300</lpage>.</citation></ref>
<ref id="B39"><label>39.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chiou</surname> <given-names>JM</given-names></name> <name><surname>Li</surname> <given-names>PL</given-names></name></person-group>. <article-title>Functional clustering and identifying substructures of longitudinal data.</article-title> <source><italic>J R Stati Soc Ser B.</italic></source> (<year>2007</year>) <volume>69</volume>:<fpage>679</fpage>&#x2013;<lpage>99</lpage>.</citation></ref>
<ref id="B40"><label>40.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sugar</surname> <given-names>CA</given-names></name> <name><surname>James</surname> <given-names>GM</given-names></name></person-group>. <article-title>Finding the number of clusters in a dataset: an information-theoretic approach.</article-title> <source><italic>J Am Stat Assoc.</italic></source> (<year>2003</year>) <volume>98</volume>:<fpage>750</fpage>&#x2013;<lpage>63</lpage>.</citation></ref>
<ref id="B41"><label>41.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Giacofci</surname> <given-names>M</given-names></name> <name><surname>Lambert-Lacroix</surname> <given-names>S</given-names></name> <name><surname>Marot</surname> <given-names>G</given-names></name> <name><surname>Picard</surname> <given-names>F</given-names></name></person-group>. <article-title>Wavelet-based clustering for mixed-effects functional models in high dimension.</article-title> <source><italic>Biometrics.</italic></source> (<year>2013</year>) <volume>69</volume>:<fpage>31</fpage>&#x2013;<lpage>40</lpage>.</citation></ref>
<ref id="B42"><label>42.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ciollaro</surname> <given-names>M</given-names></name> <name><surname>Genovese</surname> <given-names>CR</given-names></name> <name><surname>Wang</surname> <given-names>D</given-names></name></person-group>. <article-title>Nonparametric clustering of functional data using pseudo-densities.</article-title> <source><italic>Electron J Stat.</italic></source> (<year>2016</year>) <volume>10</volume>:<fpage>2922</fpage>&#x2013;<lpage>72</lpage>.</citation></ref>
<ref id="B43"><label>43.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bongiorno</surname> <given-names>EG</given-names></name> <name><surname>Goia</surname> <given-names>A</given-names></name></person-group>. <article-title>Classification methods for Hilbert data based on surrogate density.</article-title> <source><italic>Comput Stat Data Anal.</italic></source> (<year>2016</year>) <volume>99</volume>:<fpage>204</fpage>&#x2013;<lpage>22</lpage>.</citation></ref>
<ref id="B44"><label>44.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aggarwal</surname> <given-names>CC.</given-names></name></person-group> <source><italic>Data mining: the textbook.</italic></source> <publisher-loc>Berlin</publisher-loc>: <publisher-name>Springer</publisher-name> (<year>2015</year>).</citation></ref>
<ref id="B45"><label>45.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dietterich</surname> <given-names>T</given-names></name></person-group>. <article-title>Overfitting and undercomputing in machine learning.</article-title> <source><italic>ACM Comput Surv.</italic></source> (<year>1995</year>) <volume>27</volume>:<fpage>326</fpage>&#x2013;<lpage>7</lpage>.</citation></ref>
<ref id="B46"><label>46.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ray</surname> <given-names>S.</given-names></name> <name><surname>Turi</surname> <given-names>RH.</given-names></name></person-group> &#x201C;<article-title>Determination of number of clusters in k-means clustering and application in colour image segmentation</article-title>,&#x201D; in <source><italic>Proceedings of the 4th international conference on advances in pattern recognition and digital techniques</italic></source>, <publisher-loc>New Delhi</publisher-loc>: <publisher-name>Narosa Publishing House</publisher-name> (1999), <fpage>137</fpage>&#x2013;<lpage>43</lpage>.</citation></ref>
<ref id="B47"><label>47.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>He</surname> <given-names>Z</given-names></name> <name><surname>Xu</surname> <given-names>X</given-names></name> <name><surname>Deng</surname> <given-names>S</given-names></name></person-group>. <article-title>Attribute value weighting in k-modes clustering.</article-title> <source><italic>Expert Syst Appl.</italic></source> (<year>2011</year>) <volume>38</volume>:<fpage>15365</fpage>&#x2013;<lpage>9</lpage>.</citation></ref>
<ref id="B48"><label>48.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Havrlant</surname> <given-names>L</given-names></name> <name><surname>Kreinovich</surname> <given-names>V</given-names></name></person-group>. <article-title>A simple probabilistic explanation of term frequency-inverse document frequency (tf-idf) heuristic (and variations motivated by this explanation).</article-title> <source><italic>Int J Gen Syst.</italic></source> (<year>2017</year>) <volume>46</volume>:<fpage>27</fpage>&#x2013;<lpage>36</lpage>.</citation></ref>
<ref id="B49"><label>49.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dahouda</surname> <given-names>MK</given-names></name> <name><surname>Joe</surname> <given-names>I</given-names></name></person-group>. <article-title>A Deep-learned embedding technique for categorical features encoding.</article-title> <source><italic>IEEE Access.</italic></source> (<year>2021</year>) <volume>9</volume>:<fpage>114381</fpage>&#x2013;<lpage>91</lpage>.</citation></ref>
<ref id="B50"><label>50.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rust</surname> <given-names>J</given-names></name></person-group>. <article-title>Using randomization to break the curse of dimensionality.</article-title> <source><italic>Econometrica.</italic></source> (<year>1997</year>) <volume>65</volume>:<fpage>487</fpage>&#x2013;<lpage>516</lpage>.</citation></ref>
<ref id="B51"><label>51.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>M&#x00FC;llner</surname> <given-names>D</given-names></name></person-group>. <article-title>fastcluster: fast hierarchical, agglomerative clustering routines for R and Python.</article-title> <source><italic>J Stat Softw.</italic></source> (<year>2013</year>) <volume>53</volume>:<fpage>1</fpage>&#x2013;<lpage>18</lpage>.</citation></ref>
</ref-list>
</back>
</article>
