<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Iam.</journal-id>
<journal-title>BOHR International Journal of Internet of things, Artificial Intelligence and Machine Learning</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Iam.</abbrev-journal-title>
<issn pub-type="epub">2583-5521</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijiam.2023.11</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Emotion recognition based on speech signals by combining empirical mode decomposition and deep neural network</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Pan</surname> <given-names>Shing-Tai</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Ching-Fa</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Hong</surname> <given-names>Chuan-Cheng</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science and Information Engineering, National University of Kaohsiung</institution>, <addr-line>Kaohsiung, Taiwan</addr-line>, <country>R.O.C</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Electronic Engineering, Kao Yuan University</institution>, <addr-line>Kaohsiung, Taiwan</addr-line>, <country>R.O.C</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Computer Science and Information Engineering, National University of Kaohsiung</institution>, <addr-line>Kaohsiung, Taiwan</addr-line>, <country>R.O.C</country></aff>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Shing-Tai Pan, <email>stpan@nu.edu.tw</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>01</month>
<year>2023</year>
</pub-date>
<volume>2</volume>
<issue>1</issue>
<fpage>1</fpage>
<lpage>10</lpage>
<history>
<date date-type="received">
<day>05</day>
<month>01</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>01</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Pan, Chen and Hong.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Pan, Chen and Hong</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.</p>
</abstract>
<kwd-group>
<kwd>speech emotion recognition</kwd>
<kwd>empirical mode decomposition</kwd>
<kwd>deep neural network</kwd>
<kwd>mel-scale frequency cepstral coefficients</kwd>
<kwd>hidden markov model</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="7"/>
<equation-count count="21"/>
<ref-count count="23"/>
<page-count count="10"/>
<word-count count="5124"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>People can, most of the time, sense precisely the emotion of a speaker during communication. For example, people can detect an angry emotion from a loud harsh voice and a happy emotion from a voice full of laughter. This means that people can easily get information about the mood of a person simply by listening to them. In fact, emotion is a piece of vital information that speech signals carry apart from the verbal corpus (<xref ref-type="bibr" rid="B1">1</xref>). Human-computer interface (HCI) that can automatically detect emotions from the speech is then reasonable and promising. Recently, studies concerning automatic emotion recognition from speech attract a lot of attention. These studies include topics across many fields, for example, psychology, sociology, biomedical science, and education. All these foci on the impact of emotion on their health, and how to recognize the status of spirit one is from his speech. Speech is the most important media of these studies due to the following reasons: (a) the availability of fast computing systems, (b) the effectiveness of various signal processing algorithms, and (c) the acoustic differences in speech signals that are naturally embedded in various emotional situations (<xref ref-type="bibr" rid="B2">2</xref>).</p>
<p>Computing ability has enormously improved in this decade. Hence, it becomes possible and innovative to develop a system with machine learning methods or deep learning methods, which can recognize automatically people&#x2019;s emotions from their speeches. There are some literatures on this topic. For example, refer to the papers (<xref ref-type="bibr" rid="B1">1</xref>&#x2013;<xref ref-type="bibr" rid="B6">6</xref>). In the paper (<xref ref-type="bibr" rid="B1">1</xref>), the Fuzzy Rank-Based Ensemble of Transfer Learning Model is used for speech emotion recognition. In the paper (<xref ref-type="bibr" rid="B2">2</xref>), the empirical mode decomposition (EMD) is applied to decompose speech signals and obtain non-linear features for emotion recognition. This paper (<xref ref-type="bibr" rid="B3">3</xref>) uses Hidden Markov Model (HMM) for speech emotion recognition. A hybrid system of using signals from faces and voices to recognize people&#x2019;s emotions is proposed in the paper (<xref ref-type="bibr" rid="B4">4</xref>). An exploration of various models and speech features for speech emotion recognition is introduced in papers (<xref ref-type="bibr" rid="B5">5</xref>, <xref ref-type="bibr" rid="B6">6</xref>). However, according to the studies (<xref ref-type="bibr" rid="B7">7</xref>, <xref ref-type="bibr" rid="B8">8</xref>), the main topics of automatic emotion recognition from speeches include the selection of a database, feature extraction problems, and development of recognition algorithms (<xref ref-type="bibr" rid="B8">8</xref>). The exploration of the recognition algorithm is an important issue in the emotion recognition problem. There are some algorithms used for this application, for example, HMM (<xref ref-type="bibr" rid="B9">9</xref>), Support Vector Machine (SVM) (<xref ref-type="bibr" rid="B10">10</xref>, <xref ref-type="bibr" rid="B11">11</xref>), Gaussian Mixture Model (GMM) (<xref ref-type="bibr" rid="B12">12</xref>), K-Nearest Neighbors (KNN) (<xref ref-type="bibr" rid="B13">13</xref>), and Artificial Neural Network (ANN) (<xref ref-type="bibr" rid="B14">14</xref>). A method of combining speech and image for emotion recognition is explored in (<xref ref-type="bibr" rid="B15">15</xref>). However, this method will take more computation time, and more hardware resources are required. Hence, this paper focuses on emotion recognition based on speech signals. Since ANN mimics the architecture of neurons in the organism to process signals, it has some advantages over the other methods: excellent fault tolerance capacity, good learning ability, and suitable for nonlinear regression problems. Hence, this paper adopts ANN as an emotion recognition algorithm. A supervised ANN, Deep Neural Network (DNN), is used in this paper for training the emotion model of speeches and then recognizing emotions from speeches.</p>
<p>The objective of this paper was to improve the emotion recognition rates based on speech signals by applying deep learning methods due to the massive progress in the capability of deep learning methods in recent years. A DNN will be adopted in this paper for this purpose. First, this paper applies the EMD method to improve emotional feature extraction. The weighted IMFs decomposed by using EMD will be summed to obtain the emotional features from speeches. The weights for the IMFs are designed by using genetic algorithms. The weighted sum of IMFs is then calculated to extract MFCC features. The MFCCs are then used to train classifiers for emotion recognition. As to the classifier, since HMM has been used for decades for speech recognition and has been successfully applied in many applications of speech recognition, this paper will use the emotion recognition results from HMM for comparison. Besides, for the purpose of saving computation time and using fewer hardware resources, the DNN architecture will be designed to be as simple as possible to achieve better emotion recognition rates compared with those obtained by using HMM.</p>
<p>The organization of this paper is as follows. For readability, Section 2 will briefly introduce the preprocessing and feature extraction methods for speeches in this paper. The EMD method used for extracting emotional features is introduced in Section 3. The classifiers HMM and DNN are then introduced in Section 4 and Section 5, respectively. The experimental results of emotion recognition using the proposed methods are revealed in Section 6. Finally, Section 7 makes some conclusions for this paper.</p>
</sec>
<sec id="S2">
<title>2. Preprocessing and feature extraction for speech</title>
<sec id="S2.SS1">
<title>2.1. Framing speech</title>
<p>Speech signals are non-stationary signals and vary with time. It is necessary for speech signal processing to divide a speech into several short blocks to get more stationary signals. Hence, frames are taken from speech signals at the first step. The extracted frames are always overlapped to make the frame contain some previous information. Different rates of overlap will make difference to the features of speech signals. However, some experiments are required to choose a suitable over-lapping rate. A frame with 256 points is used in this paper. That is, for a speech with 8 kHz sampling rates and 1 second length of time, there will be about 32 frames obtained from framing the speech. In this paper, uniform sampling of speech signals is used since it is more robust and less biased than non-uniform sampling (<xref ref-type="bibr" rid="B16">16</xref>). However, since emotional speeches are always different in length of time, there are different numbers of frames after framing different speeches.</p>
</sec>
<sec id="S2.SS2">
<title>2.2. Speech preemphasis</title>
<p>While speeches transmit in the air, high-frequency signals in the speeches are attenuated more than low-frequency signals in speeches. Hence, a high-pass finite-impulse-response (FIR) filter is applied to speech signals to enhance the high-frequency components. A high-pass filter can be described as follows (<xref ref-type="bibr" rid="B8">8</xref>):</p>
<disp-formula id="S2.E1">
<label>(1)</label>
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mn>0.97</mml:mn>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>in which <italic>S</italic><sub><italic>pe</italic></sub>(<italic>n</italic>) is the output of the FIR filter; <italic>S</italic><sub><italic>of</italic></sub>(<italic>n</italic>) is the original speech signal; and <italic>N</italic> is the number of points in a frame.</p>
</sec>
<sec id="S2.SS3">
<title>2.3. Applying hamming window</title>
<p>Fourier transform is used commonly to calculate features of the speeches. However, due to the discontinuity at the start and at the end of a frame, high-frequency noisy signals may occur when the Fourier transform is taken on the frame. To solve this problem, a hamming window will be applied to the frames to reduce the effects caused by the noises. The hamming window is described by the following equation (<xref ref-type="bibr" rid="B9">9</xref>):</p>
<disp-formula id="S2.E2">
<label>(2)</label>
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="0pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mn>0.54</mml:mn>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mn>0.46</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mi>cos</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">&#x03C0;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mrow>
<mml:mo>;</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>h</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>w</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where N is the number of points in a frame. And,</p>
<disp-formula id="S2.E3">
<label>(3)</label>
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>W</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:mi>S</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>in which <italic>S</italic>(<italic>n</italic>) is the <italic>n</italic>th point in a frame, and <italic>F</italic>(<italic>n</italic>) is the result signal after applying a hamming window to the frame.</p>
</sec>
<sec id="S2.SS4">
<title>2.4. Fast fourier transform</title>
<p>To calculate Mel-Frequency Cepstral Coefficients (MFCC) for a frame, the speech signals will be presented in the frequency domain. Since the speech signals are presented initially in the time domain, fast Fourier transform (FFT) will be applied to the frames to transform them into frequency-domain representation. FFT can be described as follows (<xref ref-type="bibr" rid="B10">10</xref>):</p>
<disp-formula id="S2.E4">
<label>(4)</label>
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+1.3pt">
<mml:mi>n</mml:mi>
</mml:mpadded>
<mml:mo rspace="2.8pt">=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>S</mml:mi>
<mml:mi>w</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo rspace="5.8pt">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">&#x00D7;</mml:mo>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mn>&#x2005;0</mml:mn>
</mml:mpadded>
<mml:mo rspace="5.8pt">&#x2264;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>k</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">&#x2264;</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S2.E5">
<label>(5)</label>
<mml:math id="M5">
<mml:mrow>
<mml:msub>
<mml:mi>W</mml:mi>
<mml:mi>N</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mi>e</mml:mi>
<mml:mfrac>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>&#x2062;</mml:mo>
<mml:mi mathvariant="normal">&#x03C0;</mml:mi>
</mml:mrow>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
</mml:msup>
</mml:mrow>
</mml:math>
</disp-formula>
</sec>
<sec id="S2.SS5">
<title>2.5. Mel-frequency cepstral coefficients</title>
<p>The feature Mel-frequency cepstrum simulates the reception properties of human ears. The MFCC will be calculated for each frame. To calculate MFCC, FFT is applied to speech frames first. Then, Mel triangular band-pass filter is applied to the results of the FFT, <italic>X</italic>(<italic>k</italic>). The Mel triangular band-pass filter is described by the following equation:</p>
<disp-formula id="S2.E6">
<label>(6)</label>
<mml:math id="M6">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mtable displaystyle="true" rowspacing="5pt">
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo>&lt;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&lt;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
<mml:mi/>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <italic>M</italic> denotes the number of filters and 1 &#x2264; <italic>m</italic> &#x2264; <italic>M</italic>. The logarithm is then taken on the summation of the product of the frequency representation <italic>X</italic>(<italic>k</italic>) and Mel triangular band-pass filter <italic>B</italic><sub><italic>m</italic></sub>(<italic>k</italic>) as follows:</p>
<disp-formula id="S2.E7">
<label>(7)</label>
<mml:math id="M7">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>Y</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>|</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>B</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Then, the discrete cosine transform is applied to <italic>Y</italic>(<italic>m</italic>) as follows:</p>
<disp-formula id="S2.E8">
<label>(8)</label>
<mml:math id="M8">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>x</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mi>M</mml:mi>
</mml:mfrac>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:mi>Y</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:mrow>
<mml:mi>cos</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03C0;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>-</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:mfrac>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>in which cx (n) is MFCC. In this paper, the first 13 coefficients of cx (n) are calculated and then formed as a feature vector. These MFCCs for the frames of training speeches are used to train DNN, and those of testing speeches are used to be tested with the trained DNN.</p>
</sec>
</sec>
<sec id="S3">
<title>3. Empirical mode decomposition</title>
<p>In this paper, EMD is used to decompose emotional speech signals into various emotional components, which are defined as intrinsic mode functions (IMFs). An IMF must satisfy the following two conditions (<xref ref-type="bibr" rid="B17">17</xref>):</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>The number of local extremes and the number of zero-crossings differ at most by one.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>Upper and lower envelopes of the function are symmetric.</p>
</list-item>
</list>
<p>The steps of EMD are described as follows. It is noticed that, in this paper, the Cubic Spline (<xref ref-type="bibr" rid="B17">17</xref>) is adopted to construct the upper envelop and lower envelop of the signals in the process of deriving IMFs. Let the original signal be <sub><italic>X(t)</italic></sub> and <italic>Temp</italic>(<italic>t</italic>) = <italic>X</italic>(<italic>t</italic>).</p>
<p><bold>Step 1</bold>:</p>
<p>Find the upper envelope <italic>U</italic>(<italic>t</italic>) and lower envelope <italic>L</italic>(<italic>t</italic>) of the signal <italic>Temp</italic>(<italic>t</italic>). Calculate the mean of the two envelops <italic>m</italic>(<italic>t</italic>) = [<italic>U</italic>(<italic>t</italic>) + <italic>L</italic>(<italic>t</italic>)]/2. The intermediate signal <italic>h</italic>(<italic>t</italic>) is calculated as follows:</p>
<disp-formula id="S3.E9">
<label>(9)</label>
<mml:math id="M9">
<mml:mrow>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo rspace="5.8pt" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p><bold>Step 2</bold>:</p>
<p>Check whether the intermediate signal <italic>h</italic>(<italic>t</italic>) satisfies the conditions of IMF or not. If it does, then the first IMF is obtained as follows: <italic>imf</italic><sub>1</sub>(<italic>t</italic>) = <italic>h</italic>(<italic>t</italic>), and we moved to the next step or assigned the intermediate signal <italic>h</italic>(<italic>t</italic>) as <italic>Temp</italic>(<italic>t</italic>) and moved back to Step 1.</p>
<p><bold>Step 3</bold>:</p>
<p>Calculate the residue <italic>r</italic><sub>1</sub>(<italic>t</italic>) as follows:</p>
<disp-formula id="S3.E10">
<label>(10)</label>
<mml:math id="M10">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Assign the signal <italic>r</italic><sub>1</sub>(<italic>t</italic>) as <italic>X</italic>(<italic>t</italic>) and repeat Step 1 and Step 2 to find <italic>imf</italic><sub>2</sub>(<italic>t</italic>).</p>
<p><bold>Step 4</bold>:</p>
<p>Repeat Step 1 to Step 3 to find the subsequent IMFs as follows:</p>
<disp-formula id="S3.E11">
<label>(11)</label>
<mml:math id="M11">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>n</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>4</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>If the signal <sub><italic>r<sub>n</sub>(t)</italic></sub> is constant or a monotone function, then the EMD procedure is completed. Then, the following decomposition of <sub><italic>X(t)</italic></sub> is obtained as follows:</p>
<disp-formula id="S3.E12">
<label>(12)</label>
<mml:math id="M12">
<mml:mrow>
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The flowchart of EMD is depicted in <xref ref-type="fig" rid="F1">Figure 1</xref>. In this paper, a weighted sum of <italic>imf</italic>s is proposed to recover the emotional components and is written as the following equation (13). The value of weights <italic>w</italic><sub><italic>i</italic></sub> will be set according to the results in (<xref ref-type="bibr" rid="B7">7</xref>).</p>
<disp-formula id="S3.E13">
<label>(13)</label>
<mml:math id="M13">
<mml:mrow>
<mml:mi>X</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo rspace="9.1pt">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mpadded width="+6.6pt">
<mml:mi>i</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:munderover>
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x22C5;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>f</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>t</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Flowchart of empirical mode decomposition (EMD).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-11-g001.tif"/>
</fig>
</sec>
<sec id="S4">
<title>4. Hidden markov model</title>
<p>In this paper, a discrete HMM is used as a comparison with the proposed DNN method. The feature MFCCs that are extracted from the speech signals after EMD processing are used to train HMM and then for testing. The MFCC features of speech signals are arranged as a time series according to the order of frames obtained from framing each speech signal. The time series of MFCCs is treated as the observation of the HMM model, and the hidden states of the model will be estimated using the Viterbi Algorithm (<xref ref-type="bibr" rid="B18">18</xref>). <xref ref-type="fig" rid="F2">Figure 2</xref> shows the mechanism of the HMM model with the features, observations, and hidden states. The parameters in HMM &#x03BB; = (<italic>A</italic>, <italic>B</italic>, &#x03C0;) are explained as follows (<xref ref-type="bibr" rid="B18">18</xref>&#x2013;<xref ref-type="bibr" rid="B20">20</xref>):</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Mechanism of Hidden Markov Model (HMM).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-11-g002.tif"/>
</fig>
<p><italic>A</italic> = [<italic>a</italic><sub><italic>ij</italic></sub>], <italic>a</italic><sub><italic>ij</italic></sub> = <italic>P</italic>(<italic>q</italic><sub><italic>t</italic></sub> = <italic>x</italic><sub><italic>j</italic></sub>|<italic>q</italic><sub><italic>t</italic>&#x2212;1</sub> = <italic>x</italic><sub><italic>i</italic></sub>), the probability of hidden state <italic>x</italic><sub><italic>i</italic></sub> transfers to hidden state <italic>x</italic><sub><italic>j</italic></sub>,</p>
<p><italic>B</italic> = [<italic>b</italic><sub><italic>j</italic></sub>(<italic>k</italic>)], <italic>b</italic><sub><italic>j</italic></sub>(<italic>k</italic>) = <italic>P</italic>(<italic>o</italic><sub><italic>t</italic></sub> = <italic>v</italic><sub><italic>k</italic></sub>|<italic>q</italic><sub><italic>t</italic></sub> = <italic>x</italic><sub><italic>j</italic></sub>), the probability of <italic>kth</italic> observation <italic>v</italic><sub><italic>k</italic></sub> happens at the <italic>j</italic>th hidden state <italic>x</italic><sub><italic>j</italic></sub>,</p>
<p>&#x03C0; = [&#x03C0;<sub><italic>i</italic></sub>], &#x03C0;<sub><italic>i</italic></sub> = <italic>P</italic>(<italic>q</italic><sub>1</sub> = <italic>x</italic><sub><italic>i</italic></sub>), the probability of hidden state <sub><italic>x<sub>i</sub></italic></sub> happens at the initial of the time series,</p>
<p><italic>X</italic> = (<italic>x</italic><sub>1</sub>, <italic>x</italic><sub>2</sub>, &#x22EF;, <italic>x</italic><sub><italic>N</italic></sub>), is the hidden state of HMM.</p>
<p>The training process for modeling HMM is depicted in <xref ref-type="fig" rid="F3">Figure 3</xref>. The initial values of matrices A, B, and &#x03C0; are given randomly. A trained codebook is then used to quantize MFCC features. The matrices A, B, and &#x03C0; are then updated using the Viterbi Algorithm (<xref ref-type="bibr" rid="B18">18</xref>). This process will repeat until the parameters in A, B, and &#x03C0; converge. Then, the training process for HMM is completed.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Flowchart for the training of HMM.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-11-g003.tif"/>
</fig>
</sec>
<sec id="S5">
<title>5. Deep neural network</title>
<p>The architecture of ANN is constructed based on connections of the multiple-layer perceptron. Each layer comprises several neurons. The transition of signals between neurons is similar to that between neurons in an organism. In this paper, a deep neural network structure, i.e., the network with several hidden layers, is proposed and will be compared with HMM. This structure allows us to train the neural network deeply. The structure of a three-layer ANN is shown in <xref ref-type="fig" rid="F4">Figure 4</xref> (<xref ref-type="bibr" rid="B21">21</xref>).</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>The structure of the artificial neural network (ANN).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-11-g004.tif"/>
</fig>
<p>In <xref ref-type="fig" rid="F4">Figure 4</xref>, <italic>P</italic><sub><italic>n</italic></sub> is <italic>n</italic>th input, <inline-formula><mml:math id="INEQ28"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:math></inline-formula> is the weight between the <italic>i</italic>th and <italic>j</italic>th neuron in (<italic>k</italic>-1)<italic><sup>th</sup></italic> layer and <italic>k<sup>th</sup></italic> layer, respectively; <inline-formula><mml:math id="INEQ29"><mml:msubsup><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:msubsup></mml:math></inline-formula> is <italic>n</italic>th output in the <italic>k</italic>th layer, and <inline-formula><mml:math id="INEQ30"><mml:msubsup><mml:mi>b</mml:mi><mml:mi>n</mml:mi><mml:mi>k</mml:mi></mml:msubsup></mml:math></inline-formula> is the bias of <italic>n</italic>th neuron in the <italic>k<sup>th</sup></italic> layer.</p>
<p>After completing the calculation of MFCC for all emotional speech signals, we started to train ANN using the MFCCs obtained from training speech signals. The MFCC with a label of emotion will be fed to the input of ANN, and then the output of the ANN is used to compute the errors with respect to the label of MFCC in the input. The output function of ANN is described in equations (14) and (15).</p>
<disp-formula id="S5.E14">
<label>(14)</label>
<mml:math id="M14">
<mml:mrow>
<mml:msubsup>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mrow>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>o</mml:mi>
<mml:mi>i</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>-</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msubsup>
<mml:mi>b</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>k</mml:mi>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E15">
<label>(15)</label>
<mml:math id="M15">
<mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi>tanh</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The back-propagation algorithm with the steepest descent method (SDM) is used to train ANN by updating the weights based on the error function between the output and the goal to find the optimal parameters of ANN. The back-propagation algorithm is described in equations (16)-(21).</p>
<disp-formula id="S5.E16">
<label>(16)</label>
<mml:math id="M16">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>-</mml:mo>
<mml:msubsup>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:msup>
<mml:mi>f</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>o</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msubsup>
<mml:mi>b</mml:mi>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E17">
<label>(17)</label>
<mml:math id="M17">
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0394;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03B7;</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x00D7;</mml:mo>
<mml:msubsup>
<mml:mi>o</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E18">
<label>(18)</label>
<mml:math id="M18">
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0394;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>b</mml:mi>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03B7;</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E19">
<label>(19)</label>
<mml:math id="M19">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:munder>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mi>n</mml:mi>
</mml:munder>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:msup>
<mml:mi>f</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo largeop="true" movablelimits="false" symmetric="true">&#x2211;</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mo>&#x2062;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
<mml:mo>-</mml:mo>
<mml:msubsup>
<mml:mi>b</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E20">
<label>(20)</label>
<mml:math id="M20">
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0394;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>W</mml:mi>
<mml:mrow>
<mml:mi>r</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03B7;</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S5.E21">
<label>(21)</label>
<mml:math id="M21">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">&#x0394;</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:msubsup>
<mml:mi>b</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>-</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">&#x03B7;</mml:mi>
<mml:mo>&#x00D7;</mml:mo>
<mml:msub>
<mml:mi mathvariant="normal">&#x03B4;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mrow>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <sub><italic>T<sub>n</sub></italic></sub> is the goal of the output of ANN; <sub>&#x03B7;</sub> is the learning step.</p>
</sec>
<sec id="S6">
<title>6. Experimental results</title>
<p>The experiments conducted in this paper are performed on a personal computer (PC), and the algorithms for the experiments are implemented using MATLAB. The emotional speech database used for the experiments in this paper is the Berlin emotional database (<xref ref-type="bibr" rid="B22">22</xref>). The database is recorded in the German language by 10 professional actors. The professional actors include 5 males and 5 females. All the speeches in this database are sampled with 8 kHz of 16bits length in.wav format. The details of the Berlin emotional database are described in <xref ref-type="table" rid="T1">Table 1</xref>. A 10-fold cross-validation method is adopted for this experiment.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Description of Berlin emotional database.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left" colspan="2">Name of database</td>
<td valign="top" align="left" colspan="3">Berlin emotional database</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" colspan="2">Language</td>
<td valign="top" align="left" colspan="3">German</td>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Speaker</td>
<td valign="top" align="left">Gender</td>
<td valign="top" align="center">Number</td>
<td valign="top" align="center">Total</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left">Male</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">10</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left">Female</td>
<td valign="top" align="center">5</td>
<td/>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Sentence</td>
<td valign="top" align="left">Type</td>
<td valign="top" align="center">Number</td>
<td valign="top" align="center">Total</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left">Long sentence</td>
<td valign="top" align="center">5</td>
<td valign="top" align="center">10</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left">Short sentence</td>
<td valign="top" align="center">5</td>
<td/>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Emotion</td>
<td valign="top" align="left" colspan="2">Type</td>
<td valign="top" align="center">Number</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">anger</td>
<td valign="top" align="center">7</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">joy</td>
<td/>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">sadness</td>
<td/>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">fear</td>
<td/>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">disgust</td>
<td/>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">boredom</td>
<td/>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="2">neutral</td>
<td/>
</tr>
<tr>
<td valign="top" align="left" colspan="2">Facility for recording</td>
<td valign="top" align="left" colspan="3">Sennheiser MKH 40-P48 microphone</td>
</tr>
<tr>
<td valign="top" colspan="2"/>
<td valign="top" align="left" colspan="3">Tascam DA-P1 portable DAT recorder</td>
</tr>
<tr>
<td valign="top" align="left">File</td>
<td valign="top" align="left">Data acquisition</td>
<td valign="top" align="left" colspan="3">Sampling rate: 16 kHz</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left" colspan="3">Resolution: 16bits</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left" colspan="3">Channel: Mono</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left" colspan="3">Format: wav</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Number</td>
<td valign="top" align="left" colspan="3">535</td>
</tr>
</tbody>
</table></table-wrap>
<p>The experiments conducted in this paper are performed by the following steps:</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Classify the dataset from Berlin Emotional Database into the training dataset and testing dataset.</p>
</list-item>
<list-item>
<label>2.</label>
<p>Separate all the speeches in the training dataset and testing dataset into various IMFs and recombine the IMFs with the weights in ref. (<xref ref-type="bibr" rid="B9">9</xref>).</p>
</list-item>
<list-item>
<label>3.</label>
<p>Calculate MFCC for the results in Step 2.</p>
</list-item>
<list-item>
<label>4.</label>
<p>Train the DNN model and the HMM model using the feature MFCC obtained in Step 3.</p>
</list-item>
<list-item>
<label>5.</label>
<p>Repeat Step 1 to Step 4 until the recognition rate meets the goal set in this experiment.</p>
</list-item>
<list-item>
<label>6.</label>
<p>Use the model trained in Step 5 for testing.</p>
</list-item>
</list>
<p>The steps of performing the experiments in this paper are described in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>The steps of the experiments.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijiam-2023-11-g005.tif"/>
</fig>
<p>In this experiment, the structure of DNN, activation functions, and settings for the experiment are described in <xref ref-type="table" rid="T2">Table 2</xref>. The number of hidden layers is set to 5 to fulfill a deep learning architecture. A commonly used activation, hyperbolic tangent function, is adopted here.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Parameters and activation function for deep neural network (DNN).</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Model</td>
<td valign="top" align="left">DNN</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">No. of hidden layers</td>
<td valign="top" align="left">5</td>
</tr>
<tr>
<td valign="top" align="left">No. of neuron in hidden layers</td>
<td valign="top" align="left">13</td>
</tr>
<tr>
<td valign="top" align="left">Activation function</td>
<td valign="top" align="left">Hyperbolic tangent function</td>
</tr>
<tr>
<td valign="top" align="left">Iterations</td>
<td valign="top" align="left">100000</td>
</tr>
<tr>
<td valign="top" align="left">Goal of error</td>
<td valign="top" align="left">10^(-15)</td>
</tr>
<tr>
<td valign="top" align="left">Learning rate</td>
<td valign="top" align="left">0.1</td>
</tr>
</tbody>
</table></table-wrap>
<p>The results of the experiments with and without EMD will be compared to verify the advantage of applying EMD in the experiments. First, the experimental numerical results without EMD for the 7 emotions in the Berlin database with 10-fold validation are shown in <xref ref-type="table" rid="T3">Table 3</xref>. The average recognition rates of the 7 emotions are from 50.36% to 97%. The recognition rates for some emotions are high, especially the emotions &#x201C;disgust&#x201D; and &#x201C;sadness&#x201D;. This is because the features of these two emotional speeches are distinct while the emotions &#x201C;fear&#x201D;, &#x201C;boredom,&#x201D; and &#x201C;neutral&#x201D; have similar features to each other. Hence, the recognition rates are relatively low. Then, EMD is applied for emotion extractions of speeches. The experimental results for the 7 emotions in the Berlin database with 10-fold validations are shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Recognition rates using DNN without empirical mode decomposition (EMD) for 7 emotions based on 10-fold validation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Emotions Experiments</td>
<td valign="top" align="center">Anger (%)</td>
<td valign="top" align="center">Joy (%)</td>
<td valign="top" align="center">Sadness (%)</td>
<td valign="top" align="center">Fear (%)</td>
<td valign="top" align="center">Disgust (%)</td>
<td valign="top" align="center">Boredom (%)</td>
<td valign="top" align="center">Neutral (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">40.00</td>
<td valign="top" align="center">45.00</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">55.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">65.00</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">45.00</td>
<td valign="top" align="center">40.00</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">50.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">10.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">15.00</td>
<td valign="top" align="center">60.00</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">55.00</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">35.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">20.00</td>
<td valign="top" align="center">28.57</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">35.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">10.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">25.00</td>
<td valign="top" align="center">25.00</td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">65.00</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">60.00</td>
</tr>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">55.00</td>
<td valign="top" align="center">60.00</td>
</tr>
<tr>
<td valign="top" align="left">Avg.</td>
<td valign="top" align="center">65.50</td>
<td valign="top" align="center">66.00</td>
<td valign="top" align="center">77.00</td>
<td valign="top" align="center">56.50</td>
<td valign="top" align="center">97.00</td>
<td valign="top" align="center">51.00</td>
<td valign="top" align="center">50.36</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Recognition rates using DNN <italic>with</italic> EMD for 7 emotions based on 10-fold validation.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Emotions Experiments</td>
<td valign="top" align="center">Anger (%)</td>
<td valign="top" align="center">Joy (%)</td>
<td valign="top" align="center">Sadness (%)</td>
<td valign="top" align="center">Fear (%)</td>
<td valign="top" align="center">Disgust (%)</td>
<td valign="top" align="center">Boredom (%)</td>
<td valign="top" align="center">Neutral (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">55.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">40.00</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">70.00</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">50.00</td>
<td valign="top" align="center">45.00</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="center">80.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">20.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">20.00</td>
<td valign="top" align="center">60.00</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">60.00</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">25.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">25.00</td>
<td valign="top" align="center">28.57</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">50.00</td>
<td valign="top" align="center">65.00</td>
<td valign="top" align="center">10.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">15.00</td>
<td valign="top" align="center">45.00</td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">70.00</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">90.00</td>
<td valign="top" align="center">70.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">75.00</td>
<td valign="top" align="center">55.00</td>
</tr>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">95.00</td>
<td valign="top" align="center">85.00</td>
<td valign="top" align="center">100.00</td>
<td valign="top" align="center">60.00</td>
<td valign="top" align="center">65.00</td>
</tr>
<tr>
<td valign="top" align="left">Avg.</td>
<td valign="top" align="center">69.50</td>
<td valign="top" align="center">73.00</td>
<td valign="top" align="center">76.50</td>
<td valign="top" align="center">56.50</td>
<td valign="top" align="center">96.50</td>
<td valign="top" align="center">55.00</td>
<td valign="top" align="center">53.86</td>
</tr>
</tbody>
</table></table-wrap>
<p>The comparisons of the recognition rates between the DNN with and without EMD for 10-fold experiments are shown in <xref ref-type="table" rid="T5">Table 5</xref>. It can be seen, from the red context in the table, that the recognition rates of most runs and the average recognition rate in the experiment are better when EMD is applied. Moreover, according to <xref ref-type="table" rid="T6">Table 6</xref>, when EMD is applied for extractions of emotion components, better recognition rates are gained for emotions &#x201C;anger&#x201D;, &#x201C;joy&#x201D;, &#x201C;boredom,&#x201D; and &#x201C;neutral&#x201D;. The recognition rate for the emotion &#x201C;fear&#x201D; remains the same. The emotions &#x201C;sadness&#x201D; and &#x201C;disgust&#x201D; have slightly lower recognition rates. The average recognition rate is better than that without using EMD. Please refer to the red context in <xref ref-type="table" rid="T6">Table 6</xref> for more details. These experimental results verify the advantage of using EMD to extract emotional components.</p>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Comparison of recognition rates for 10-fold experiments using DNN with and without EMD.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Set 1 (%)</td>
<td valign="top" align="center">Set 2 (%)</td>
<td valign="top" align="center">Set 3 (%)</td>
<td valign="top" align="center">Set 4 (%)</td>
<td valign="top" align="center">Set 5 (%)</td>
<td valign="top" align="center">Set 6 (%)</td>
<td valign="top" align="center">Set 7 (%)</td>
<td valign="top" align="center">Set 8 (%)</td>
<td valign="top" align="center">Set 9 (%)</td>
<td valign="top" align="center">Set 10 (%)</td>
<td valign="top" align="center">Avg. (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Without EMD</td>
<td valign="top" align="center">68.571</td>
<td valign="top" align="center">74.286</td>
<td valign="top" align="center">68.571</td>
<td valign="top" align="center">53.571</td>
<td valign="top" align="center">61.429</td>
<td valign="top" align="center">58.865</td>
<td valign="top" align="center">47.143</td>
<td valign="top" align="center">77.143</td>
<td valign="top" align="center">76.429</td>
<td valign="top" align="center">75.714</td>
<td valign="top" align="center">66.172</td>
</tr>
<tr>
<td valign="top" align="left">With EMD</td>
<td valign="top" align="center">70.0</td>
<td valign="top" align="center">77.143</td>
<td valign="top" align="center">70.0</td>
<td valign="top" align="center">62.857</td>
<td valign="top" align="center">60.0</td>
<td valign="top" align="center">56.028</td>
<td valign="top" align="center">50.0</td>
<td valign="top" align="center">79.286</td>
<td valign="top" align="center">77.857</td>
<td valign="top" align="center">83.571</td>
<td valign="top" align="center">68.674</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T6">
<label>TABLE 6</label>
<caption><p>Comparison of recognition rates for 7 emotions using DNN with and without EMD.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Anger (%)</td>
<td valign="top" align="center">Joy (%)</td>
<td valign="top" align="center">Sadness (%)</td>
<td valign="top" align="center">Fear (%)</td>
<td valign="top" align="center">Disgust (%)</td>
<td valign="top" align="center">Boredom (%)</td>
<td valign="top" align="center">Neutral (%)</td>
<td valign="top" align="center">Avg. (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">without EMD</td>
<td valign="top" align="center">65.50</td>
<td valign="top" align="center">66.00</td>
<td valign="top" align="center">77.00</td>
<td valign="top" align="center">56.50</td>
<td valign="top" align="center">97.00</td>
<td valign="top" align="center">51.00</td>
<td valign="top" align="center">50.36</td>
<td valign="top" align="center">66.19</td>
</tr>
<tr>
<td valign="top" align="left">With EMD</td>
<td valign="top" align="center">69.50</td>
<td valign="top" align="center">73.00</td>
<td valign="top" align="center">76.50</td>
<td valign="top" align="center">56.50</td>
<td valign="top" align="center">96.50</td>
<td valign="top" align="center">55.00</td>
<td valign="top" align="center">53.86</td>
<td valign="top" align="center">68.69</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn><p>Moreover, the results gained by the proposed method are compared with those in ref. (<xref ref-type="bibr" rid="B9">9</xref>), in which HMM was used. <xref ref-type="table" rid="T7">Table 7</xref> shows the results of the comparison. It is obvious that the proposed method has higher recognition rates at each fold of experiments as well as at the average recognition rates of the 10-fold experiments in both cases whether EMD is used or not. This verifies the performance of the proposed method.</p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="S7" sec-type="conclusion">
<title>7. Conclusion</title>
<p>In this paper, EMD is applied to extract the emotional features from speeches. The experimental results of this paper reveal that the emotion recognition rates are better for both classifiers, i.e., HMM and DNN, after applying EMD for emotional feature extractions. However, according to <xref ref-type="table" rid="T6">Table 6</xref>, EMD does not work well for two emotions, i.e., &#x201C;sadness&#x201D; and &#x201C;disgust&#x201D;. It is likely that the features of the two emotions are similar, and EMD cannot effectively distinguish them. In the future work, some advanced EMD, such as Ensemble EMD, may be used to get a better extraction of emotional features from emotional speeches and hence get better emotion recognition rates. Besides, in this paper, a simple DNN is designed to get better recognition rates than those gained by using HMM in both cases whether EMD is applied or not. According to <xref ref-type="table" rid="T7">Table 7</xref>, the improved recognition rates are about 10% and 2% respective to that the EMD is not and is applied to speech signals. However, in our experiments, only a few minutes are needed to train HMM while DNN used in this paper takes more than 40 minutes. Consequently, the improvement of time consumption of DNN is still an open problem.</p>
<table-wrap position="float" id="T7">
<label>TABLE 7</label>
<caption><p>Comparison of the results by the proposed method and those by the Hidden Markov Model (HMM) (<xref ref-type="bibr" rid="B9">9</xref>)</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">HMM (%)</td>
<td valign="top" align="center">HMM+EMD (%)</td>
<td valign="top" align="center">DNN (%)</td>
<td valign="top" align="center">DNN +EMD (%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">57.14</td>
<td valign="top" align="center">62.86</td>
<td valign="top" align="center">68.571</td>
<td valign="top" align="center">70.0</td>
</tr>
<tr>
<td valign="top" align="left">2</td>
<td valign="top" align="center">56.43</td>
<td valign="top" align="center">67.14</td>
<td valign="top" align="center">74.286</td>
<td valign="top" align="center">77.143</td>
</tr>
<tr>
<td valign="top" align="left">3</td>
<td valign="top" align="center">55.71</td>
<td valign="top" align="center">67.86</td>
<td valign="top" align="center">68.571</td>
<td valign="top" align="center">70.0</td>
</tr>
<tr>
<td valign="top" align="left">4</td>
<td valign="top" align="center">47.86</td>
<td valign="top" align="center">54.29</td>
<td valign="top" align="center">53.571</td>
<td valign="top" align="center">62.857</td>
</tr>
<tr>
<td valign="top" align="left">5</td>
<td valign="top" align="center">66.43</td>
<td valign="top" align="center">76.43</td>
<td valign="top" align="center">61.429</td>
<td valign="top" align="center">60.0</td>
</tr>
<tr>
<td valign="top" align="left">6</td>
<td valign="top" align="center">47.14</td>
<td valign="top" align="center">58.57</td>
<td valign="top" align="center">58.865</td>
<td valign="top" align="center">56.028</td>
</tr>
<tr>
<td valign="top" align="left">7</td>
<td valign="top" align="center">39.29</td>
<td valign="top" align="center">52.86</td>
<td valign="top" align="center">47.143</td>
<td valign="top" align="center">50.0</td>
</tr>
<tr>
<td valign="top" align="left">8</td>
<td valign="top" align="center">64.29</td>
<td valign="top" align="center">74.29</td>
<td valign="top" align="center">77.143</td>
<td valign="top" align="center">79.286</td>
</tr>
<tr>
<td valign="top" align="left">9</td>
<td valign="top" align="center">69.29</td>
<td valign="top" align="center">73.57</td>
<td valign="top" align="center">76.429</td>
<td valign="top" align="center">77.857</td>
</tr>
<tr>
<td valign="top" align="left">10</td>
<td valign="top" align="center">65.71</td>
<td valign="top" align="center">77.14</td>
<td valign="top" align="center">75.714</td>
<td valign="top" align="center">83.571</td>
</tr>
<tr>
<td valign="top" align="left">Avg.</td>
<td valign="top" align="center">56.93</td>
<td valign="top" align="center">66.50</td>
<td valign="top" align="center">66.172</td>
<td valign="top" align="center">68.674</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S8" sec-type="author-contributions">
<title>Author contributions</title>
<p>S-TP: Conceptualization, methodology, investigation, and writing&#x2014;original draft preparation. C-FC: validation, formal analysis, and writing&#x2014;review and editing. C-CH: software and resources.</p>
</sec>
</body>
<back>
<sec id="S9" sec-type="funding-information">
<title>Funding</title>
<p>This research was funded by the Ministry of Science and Technology of the Republic of China, grant number MOST 109-2221-E-390-014-MY2. This research work was supported by the Ministry of Science and Technology of the Republic of China under contract MOST 108-2221-E-390-018.</p>
</sec>
<sec id="S10" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sahoo</surname> <given-names>KK</given-names></name> <name><surname>Dutta</surname> <given-names>I</given-names></name> <name><surname>Ijaz</surname> <given-names>MF</given-names></name> <name><surname>Wozniak</surname> <given-names>M</given-names></name> <name><surname>Singh</surname> <given-names>PK</given-names></name></person-group>. <article-title>TLEFuzzyNet: fuzzy rank-based ensemble of transfer learning models for emotion recognition from human speeches.</article-title> <source><italic>IEEE Access.</italic></source> (<year>2021</year>) <volume>9</volume>:<fpage>166518</fpage>&#x2013;<lpage>30</lpage>.</citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krishnan</surname> <given-names>PT</given-names></name> <name><surname>Joseph Raj</surname> <given-names>AN</given-names></name> <name><surname>Rajangam</surname> <given-names>V</given-names></name></person-group>. <article-title>Emotion classification from speech signal based on empirical mode decomposition and non-linear features.</article-title> <source><italic>Complex Intelligent Syst.</italic></source> (<year>2021</year>) <volume>7</volume>:<fpage>1919</fpage>&#x2013;<lpage>34</lpage>.</citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B</given-names></name> <name><surname>Rigoll</surname> <given-names>G</given-names></name> <name><surname>Lang</surname> <given-names>M</given-names></name></person-group>. <article-title>Hidden markov model-based speech emotion recognition.</article-title> In: <source><italic>Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003).</italic></source> <publisher-loc>Baltimore, MD</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2003</year>). p. <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fragopanagos</surname> <given-names>N</given-names></name> <name><surname>Taylor</surname> <given-names>JG</given-names></name></person-group>. <article-title>Emotion recognition in human-computer interaction.</article-title> <source><italic>Neural Networks.</italic></source> (<year>2005</year>). <volume>18</volume>:<fpage>389</fpage>&#x2013;<lpage>405</lpage>.</citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cen</surname> <given-names>L</given-names></name> <name><surname>Ser</surname> <given-names>W</given-names></name> <name><surname>Yu</surname> <given-names>ZL</given-names></name> <name><surname>Cen</surname> <given-names>W</given-names></name></person-group>. <article-title>Automatic recognition of emotional states from human speeches.</article-title> In: <person-group person-group-type="editor"><name><surname>Herout</surname> <given-names>A</given-names></name></person-group> <role>editor</role>. <source><italic>Pattern recognition Recent Advances.</italic></source> <publisher-loc>London</publisher-loc>: <publisher-name>intechopen</publisher-name>. (<year>2011</year>).</citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>S</given-names></name> <name><surname>Falk</surname> <given-names>TH</given-names></name> <name><surname>Chan</surname> <given-names>WY</given-names></name></person-group>. <article-title>Automatic speech emotion recognition using modulation spectral features.</article-title> <source><italic>Speech Commun.</italic></source> (<year>2011</year>) <volume>53</volume>:<fpage>768</fpage>&#x2013;<lpage>85</lpage>.</citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Picard</surname> <given-names>RW.</given-names></name></person-group> <source><italic>Affective Computing.</italic></source> <publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT press</publisher-name> (<year>1997</year>).</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Basu</surname> <given-names>S.</given-names></name> <name><surname>Chakraborty</surname> <given-names>J.</given-names></name> <name><surname>Bag</surname> <given-names>A.</given-names></name> <name><surname>Aftabuddin</surname> <given-names>M</given-names></name></person-group>. <article-title>A review on emotion recognition using speech.</article-title> In: <source><italic>Proceedings of the International Conference on Inventive Communication and Computational Technologies (ICICCT).</italic></source> <publisher-loc>Coimbatore</publisher-loc> (<year>2017</year>). <fpage>p. 109</fpage>&#x2013;<lpage>14</lpage>.</citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lee</surname> <given-names>YW</given-names></name> <name><surname>Pan</surname> <given-names>ST.</given-names></name></person-group> <source><italic>Applications of Empirical Mode Decomposition on the Computation of Emotional Speech Features.</italic></source> <publisher-loc>Taiwan</publisher-loc>: <publisher-name>National University of Kaohsiung</publisher-name> (<year>2012</year>).</citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>L</given-names></name> <name><surname>Chen</surname> <given-names>L</given-names></name> <name><surname>Zhao</surname> <given-names>D</given-names></name> <name><surname>Zhou</surname> <given-names>J</given-names></name> <name><surname>Zhang</surname> <given-names>W</given-names></name></person-group>. <article-title>Emotion recognition from chinese speech for smart affective services using a combination of SVM and DBN.</article-title> <source><italic>Sensors.</italic></source> (<year>2017</year>) <volume>17</volume>:<issue>1694</issue>.</citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Trabelsi</surname> <given-names>I</given-names></name> <name><surname>Bouhlel</surname> <given-names>MS</given-names></name></person-group>. <article-title>Feature selection for GUMI kernel-based SVM in speech emotion recognition.</article-title> In: <collab>Information Reso Management Association</collab>, <role>editor.</role> <source><italic>Artificial Intelligence: Concepts, Methodologies, Tools, and Applications.</italic></source> <publisher-loc>Pennsylvania</publisher-loc>: <publisher-name>IGI Global</publisher-name> (<year>2017</year>). <fpage>p. 941</fpage>&#x2013;<lpage>53</lpage>.</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Patel</surname> <given-names>P</given-names></name> <name><surname>Chaudhari</surname> <given-names>A</given-names></name> <name><surname>Kale</surname> <given-names>R</given-names></name> <name><surname>Pund</surname> <given-names>M</given-names></name></person-group>. <article-title>Emotion recognition from speech with gaussian mixture models &#x0026; via boosted GMM.</article-title> <source><italic>Int J Res Sci Eng</italic>.</source> (<year>2017</year>) <volume>3</volume>: <fpage>47</fpage>&#x2013;<lpage>53</lpage>.</citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jo</surname> <given-names>Y</given-names></name> <name><surname>Lee</surname> <given-names>H</given-names></name> <name><surname>Cho</surname> <given-names>A</given-names></name> <name><surname>Whang</surname> <given-names>M</given-names></name></person-group>. <article-title>Emotion recognition through cardiovascular response in daily life using KNN classifier.</article-title> In: <person-group person-group-type="editor"><name><surname>Park</surname> <given-names>J. J.</given-names></name></person-group> <role>editor</role>. <source><italic>Advances in Computer Science and Ubiquitous Computing.</italic></source> <publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name> (<year>2017</year>). <fpage>p. 1451</fpage>&#x2013;<lpage>6</lpage>.</citation></ref>
<ref id="B14"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alhagry</surname> <given-names>S</given-names></name> <name><surname>Fahmy</surname> <given-names>AA</given-names></name> <name><surname>El-Khoribi</surname> <given-names>RA</given-names></name></person-group>. <article-title>Emotion Recognition based on EEG using LSTM Recurrent Neural Network.</article-title> <source><italic>Emotion.</italic></source> (<year>2017</year>) <volume>8</volume>:<fpage>355</fpage>&#x2013;<lpage>8</lpage>.</citation></ref>
<ref id="B15"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ghaleb</surname> <given-names>E</given-names></name> <name><surname>Popa</surname> <given-names>M</given-names></name> <name><surname>Asteriadis</surname> <given-names>S</given-names></name></person-group>. <article-title>Metric Learning-based multimodal audio-visual emotion recognition.</article-title> <source><italic>IEEE multimedia.</italic></source> (<year>2020</year>) <volume>27</volume>:<fpage>1</fpage>&#x2013;<lpage>8</lpage>.</citation></ref>
<ref id="B16"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shang</surname> <given-names>Y</given-names></name></person-group>. <article-title>Subgraph robustness of complex networks under attacks.</article-title> <source><italic>IEEE Trans Syst Man Cybernet.</italic></source> (<year>2019</year>) <volume>49</volume>:<fpage>821</fpage>&#x2013;<lpage>32</lpage>.</citation></ref>
<ref id="B17"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>NE</given-names></name></person-group>. <article-title>The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis.</article-title> <source><italic>Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci</italic>.</source> (<year>1996</year>) <volume>454</volume>:<fpage>903</fpage>&#x2013;<lpage>95</lpage>.</citation></ref>
<ref id="B18"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blunsom</surname> <given-names>P.</given-names></name></person-group> <source><italic>Hidden Markov Model.</italic></source> <publisher-loc>Parkville, VIC</publisher-loc>: <publisher-name>The University of Melbourne</publisher-name> (<year>2004</year>)</citation></ref>
<ref id="B19"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>ST</given-names></name> <name><surname>Hong</surname> <given-names>TP</given-names></name></person-group>. <article-title>Robust speech recognition by DHMM with a codebook trained by genetic algorithm.</article-title> <source><italic>J Informat Hiding Multimedia Signal Processing</italic>.</source> (<year>2012</year>) <volume>3</volume>:<fpage>306</fpage>&#x2013;<lpage>19</lpage>.</citation></ref>
<ref id="B20"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>ST</given-names></name> <name><surname>Li</surname> <given-names>WC</given-names></name></person-group>. <article-title>Fuzzy-HMM modeling for emotion detection using electrocardiogram signals.</article-title> <source><italic>Asian J Control.</italic></source> (<year>2020</year>) <volume>22</volume>:<fpage>2206</fpage>&#x2013;<lpage>16</lpage>.</citation></ref>
<ref id="B21"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>ST</given-names></name> <name><surname>Lan</surname> <given-names>ML</given-names></name></person-group>. <article-title>An efficient hybrid learning algorithm for neural network&#x2013;based speech recognition systems on FPGA chip.</article-title> <source><italic>Neural Comput Appl.</italic></source> (<year>2014</year>) <volume>24</volume>:<fpage>1879</fpage>&#x2013;<lpage>85</lpage>.</citation></ref>
<ref id="B22"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Burkhardt</surname> <given-names>F</given-names></name> <name><surname>Paeschke</surname> <given-names>A</given-names></name> <name><surname>Rolfes</surname> <given-names>M</given-names></name> <name><surname>Sendlmeier</surname> <given-names>WF</given-names></name> <name><surname>Weiss</surname> <given-names>B</given-names></name></person-group>. <article-title>A database of German emotional speech.</article-title> <source><italic>Interspeech.</italic></source> (<year>2005</year>) <volume>5</volume>:<fpage>1517</fpage>&#x2013;<lpage>20</lpage>.</citation></ref>
<ref id="B23"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mathews</surname> <given-names>JH</given-names></name> <name><surname>Fink</surname> <given-names>KD.</given-names></name></person-group> <source><italic>Numerical Methods Using MATLAB</italic></source>. <edition>4th ed</edition>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>Prentice-Hall</publisher-name> (<year>2004</year>)</citation></ref>
</ref-list>
</back>
</article>
