<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Scit.</journal-id>
<journal-title>BOHR International Journal of Smart Computing and Information Technology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Scit.</abbrev-journal-title>
<issn pub-type="epub">2583-2026</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijscit.2022.28</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Detection of abnormal human behavior using deep learning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Ghosh</surname> <given-names>Partha</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Bose</surname> <given-names>Sombit</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Roy</surname> <given-names>Sayantan</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Mondal</surname> <given-names>Avisek</given-names></name>
</contrib>
</contrib-group>
<aff><institution>Department of Computer Science and Engineering, Government College of Engineering and Ceramic Technology</institution>, <addr-line>Kolkata</addr-line>, <country>India</country></aff>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Partha Ghosh, <email>parth_ghos@rediffmail.com</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>02</day>
<month>11</month>
<year>2022</year>
</pub-date>
<volume>3</volume>
<issue>1</issue>
<fpage>59</fpage>
<lpage>68</lpage>
<history>
<date date-type="received">
<day>18</day>
<month>09</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>12</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Ghosh, Bose, Roy and Mondal.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Ghosh, Bose, Roy and Mondal</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>The complete human body or the various limb postures are involved in human action. These days, Abnormal Human Activity Recognition (Abnormal HAR) is highly well noticed and surveyed in many studies. However, because of complicated difficulties such as sensor movement, positioning, and so on, as well as how individuals carry out their activities, it continues to be a difficult process. Identifying particular activities benefits human-centric applications such as postoperative trauma recovery, gesture detection, exercise, fitness, and home care help. The HAR system has the ability to automate or simplify most of the people&#x2019;s everyday chores. HAR systems often use supervised or unsupervised learning as their foundation. Unsupervised systems operate according to a set of rules, whereas supervised systems need to be trained beforehand using specific datasets. This study conducts detailed literature reviews on the development of various activity identification techniques currently being used. The three methods&#x2014;wearable device-based, pose-based, and smartphone sensor&#x2014;are examined in this inquiry for identifying abnormal acts (AAD). The sensors in wearable devices collect data, whereas the gyroscopes and accelerometers in smartphones provide input to the sensors in wearable devices. To categorize activities, pose estimation uses a neural network. The Anomalous Action Detection Dataset (Ano-AAD) is created and improved using several methods. The study examines fresh datasets and innovative models, including UCF-Crime. A new pattern in anomalous HAR systems has emerged, linking anomalous HAR tasks to computer vision applications including security, video surveillance, and home monitoring. In terms of issues and potential solutions, the survey looks at vision-based HAR.</p>
</abstract>
<kwd-group>
<kwd>HAR</kwd>
<kwd>LRCN</kwd>
<kwd>LSTM</kwd>
<kwd>GRU</kwd>
<kwd>abnormal human behavior</kwd>
</kwd-group>
<counts>
<fig-count count="16"/>
<table-count count="6"/>
<equation-count count="4"/>
<ref-count count="13"/>
<page-count count="10"/>
<word-count count="4467"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>1. Introduction</title>
<p>The complete human body or the various limb postures are involved in human action. These days, Abnormal Human Activity Recognition (Abnormal HAR) is highly well noticed and surveyed in many studies. However, because of complicated difficulties such sensor movement, positioning, and so on, as well as how individuals carry out their activities, it continues to be a difficult process. Applications that focus on people, such as gesture recognition, exercise, fitness, and home care support, benefit from recognizing particular activities to improve results. The HAR system has the ability to automate or simplify most of the people&#x2019;s everyday chores. HAR systems often use supervised or unsupervised learning as their foundation. Unsupervised systems operate according to a set of rules, whereas supervised systems need to be trained beforehand using specific datasets. The recent developments in various activity identification algorithms are thoroughly examined in this literature review work. Three methods&#x2014;pose-based, smartphone sensors, and wearable device-based&#x2014;are examined in this investigation. Smartphone sensors get data from gyroscopes and accelerometers, while wearable devices gather data via sensors. The last method employs a neural network to estimate body key points while estimating posture to classify activities.</p>
<sec id="S1.SS1">
<title>1.1. Project overview/specifications</title>
<p>In this project, we have made a new dataset, Anomalous Action Detection Dataset (Ano-AAD), to study anomalous behavior using deep learning models like convolutional LSTM-GRU and Long Recurrent Convolutional Network (LRCN) (<xref ref-type="bibr" rid="B1">1</xref>). Our dataset is divided into two parts: 1. Anomaly videos and 2. Normal videos. The total number of videos in the dataset is 392. The anomaly part has 351 videos, and there are 41 normal videos. The total number of class is 9. The names of the classes are in the anomaly section:</p>
<p>Burglary (49 videos, 78 min), Fighting (50 videos, 85 min), Explosion (49 videos, 72 min), Fire raising (52 videos, 96 min), Ill treatment (32 videos, 68 min), Traffic Irregularities (5 videos, 3 min), Violence (26 videos, 37 min), Arrest (50 videos, 93 min), and Attack (38 videos, 71 min). LRCN has achieved 87% testing accuracy, and convolutional LSTM-GRU has achieved 94% testing accuracy.</p>
</sec>
<sec id="S1.SS2">
<title>1.2. Hardware and software specification</title>
<p>GPU: 1xTesla K80, compute 3.7, having 2496 CUDA cores, 12GB GDDR5 VRAM</p>
<p>CPU: 1xsingle core hyper threaded Xeon Processors at 2.3Ghz, that is, (1 core, 2 threads)</p>
<p>RAM: &#x223C;12.6 GB Available</p>
<p>Disk: &#x223C;33 GB Available</p>
<p>Google Colab is the software system that we used for working on our own dataset.</p>
</sec>
</sec>
<sec id="S2">
<title>2. Literature survey</title>
<p>There are three approaches to HAR:</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>Pose-based approach (vision-based approach): This approach uses the body&#x2019;s main points for activity identification and feature extraction as pixel-based coordinates.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>Smartphone sensor-based approach: Here, sensors are mounted on smartphones.</p>
</list-item>
<list-item>
<label>(3)</label>
<p>Wearable sensor-based: Here, sensors are mounted on the human body. They collect data from the human body.</p>
</list-item>
</list>
<p>As per our research project topic, we will be focusing on the vision-based approach domain.</p>
<sec id="S2.SS1">
<title>2.1. Existing methods</title>
<p>There are mainly three deep learning methodologies:</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Generative methods (unsupervised) [e.g., autoencoders (<xref ref-type="bibr" rid="B2">2</xref>, <xref ref-type="bibr" rid="B3">3</xref>), GANs]</p>
</list-item>
<list-item>
<label>2.</label>
<p>Discriminative methods (supervised) [e.g., DNN, CNN, RNN, RNN+LSTM (<xref ref-type="bibr" rid="B1">1</xref>)]</p>
</list-item>
<list-item>
<label>3.</label>
<p>Hybrid methods (integrate both)</p>
</list-item>
</list>
<p>These methods are applied on other kinds of popular deep learning datasets.</p>
</sec>
<sec id="S2.SS2">
<title>2.2 Related works</title>
<p>Recognition and comprehension of human behavior have gotten a great deal of attention lately (<xref ref-type="bibr" rid="B4">4</xref>&#x2013;<xref ref-type="bibr" rid="B6">6</xref>) (<xref ref-type="bibr" rid="B7">7</xref>, <xref ref-type="bibr" rid="B8">8</xref>). To comprehend the scene, many strategies have been utilized to understand behavioral and activity patterns. In this effort, we have mostly looked at articles from 2018 to 2022. There are related works (mostly related or close to HAR anomalies) on motion detection, face detecting, shoplifting (<xref ref-type="bibr" rid="B5">5</xref>), tracking, loiter detecting, abandoned luggage detecting, crowd behavior, and snatch detecting algorithms. Convolutional neural networks (CNNs) have demonstrated impressive performance in computer vision in recent years (<xref ref-type="bibr" rid="B9">9</xref>, <xref ref-type="bibr" rid="B10">10</xref>).</p>
<p>Researchers used Alex Net, VGG-Net, Res Net, and Inception-like pretrained models (<xref ref-type="bibr" rid="B9">9</xref>) to increase accuracy. Particularly, 3DCNN focuses on removing spatial and temporal details from videos. Researchers also used autoencoders, RNN, LSTM (<xref ref-type="bibr" rid="B6">6</xref>), and GAN-like systems combined with new learning methods like transfer learning (<xref ref-type="bibr" rid="B9">9</xref>) and meta learning. They also used combined architecture models to achieve accuracy.</p>
<p>In the next pages, we have listed features of some pretrained network models, their features, their accuracy, and number of parameters. We also listed considered datasets of our survey and listed the work done by the other researchers on various datasets.</p>
<sec id="S2.SS2.SSS1">
<title>2.2.1. Features of pretrained network models</title>
<p>Features of pretrained network models (<xref ref-type="bibr" rid="B9">9</xref>, <xref ref-type="bibr" rid="B10">10</xref>) are depicted in <xref ref-type="table" rid="T1">Table 1</xref> below.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Features of pre-trained network models.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Network architecture</td>
<td valign="top" align="left">Features</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">Parameters</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">AlexNet</td>
<td valign="top" align="left">Deeper</td>
<td valign="top" align="center">84.70%</td>
<td valign="top" align="center">62 million</td>
</tr>
<tr>
<td valign="top" align="left">VGGNet</td>
<td valign="top" align="left">Fixed size kernel</td>
<td valign="top" align="center">92.30%</td>
<td valign="top" align="center">138 million</td>
</tr>
<tr>
<td valign="top" align="left">ResNet</td>
<td valign="top" align="left">Skip short connections</td>
<td valign="top" align="center">95.51%</td>
<td valign="top" align="center">60.3 million</td>
</tr>
<tr>
<td valign="top" align="left">Inception</td>
<td valign="top" align="left">Parallel wider kernels</td>
<td valign="top" align="center">93.30%</td>
<td valign="top" align="center">6.4 million</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S2.SS2.SSS2">
<title>2.2.2. Considered datasets in our survey</title>
<p>We have done an extensive survey on how researchers worked on anomaly-related datasets like UCF-Crime and its subsets like HR-Crime, XD-Violence, UCSD Anomaly Detection Dataset (crowd anomaly), Shanghai-Tech, LAD (large anomaly dataset), Avenue, CAVIAR, PETS-2016, and so on, and we have listed their methods and achieved accuracy.</p>
<p>Sultani et al. (<xref ref-type="bibr" rid="B4">4</xref>) from the University of Central Florida made the UCF-Crime dataset. It has a total of 1,900 videos in 13 classes like Abuse, Arson, and so on and a total video length of 128 h.</p>
</sec>
<sec id="S2.SS2.SSS3">
<title>2.2.3. Mascorro&#x2019;s concept to analyze pre-crime scenes</title>
<p>Mascorro et al. (<xref ref-type="bibr" rid="B5">5</xref>) used 3D-CNN to detect abnormal behavior on shoplifting cases. They introduced a new concept to analyze precrime scenes:</p>
<list list-type="simple">
<list-item>
<label>(1)</label>
<p>Strict crime moment (SCM): The shoplifting crime (SCM) is depicted in the video clip.</p>
</list-item>
<list-item>
<label>(2)</label>
<p>Comprehensive crime moment (CCM): This is the exact second that a regular person may recognize the suspect&#x2019;s actions. This stage also includes noting failed efforts to rearrange items.</p>
</list-item>
<list-item>
<label>(3)</label>
<p>Crime lapse (CL): A crime is shown throughout a video clip. It will not be feasible to prove that there is a criminal act in the video if the lapse is removed.</p>
</list-item>
<list-item>
<label>(4)</label>
<p>Precrime behavior (PCB): The PCB describes what occurs before the suspect is identified and before CCM really starts.</p>
</list-item>
</list>
</sec>
<sec id="S2.SS2.SSS4">
<title>2.2.4. Some popular works done by other researchers</title>
<p>Sultani et al. (<xref ref-type="bibr" rid="B4">4</xref>) used deep neural networks with multiple instance learning to classify real-world anomalies including accidents, explosions, conflicts, abuse, arson, and so on. The AUC for their product is 75.41%. They obtained accuracies of 23.0 and 28.4% using C3D and TCNN Architecture, respectively.</p>
<p>Sabokrou et al. (<xref ref-type="bibr" rid="B2">2</xref>) used CNNs with 3D deep autoencoders to detect irregularities in videos.</p>
<p>Ullah et al. (<xref ref-type="bibr" rid="B6">6</xref>) utilized an approach that used 15 consecutive frames of video to construct a feature vector, which was then fed into a multilayer bidirectional LSTM to differentiate anomalous occurrences. They were 85.53% correct. On UCF-Crime, the VGG-19 with multilayer BD-LSTM achieved an accuracy of 82%. On UCF-Crime, the concept V3 with multilayer BD-LSTM achieved 80% accuracy.</p>
<p>Hasan et al. (<xref ref-type="bibr" rid="B3">3</xref>) created a convolutional autoencoder (Conv-AE) framework for scene reconstruction and then estimated reconstruction costs for abnormality detection.</p>
<p>Dubey et al. (<xref ref-type="bibr" rid="B7">7</xref>) suggested the 3D deep Multiple Instance Learning with ResNet (MILR) approach as well as a novel proposed ranking loss function. With that new ranking loss function, they obtained an AUC of 76.67%.</p>
<p>In their suggested technique, Nasaruddin et al. (<xref ref-type="bibr" rid="B11">11</xref>) used a strong background subtraction to extract motion and identify the locations of attention areas. Eventually, a 3D CNN receives the output areas. They used C3D (convolution 3-dimensional) to their full advantage, developing a deep convolutional network to discern between typical and anomalous occurrences. Their locality learning model achieved an accuracy of 99.25%.</p>
</sec>
<sec id="S2.SS2.SSS5">
<title>2.2.5. Datasets and results of various datasets</title>
<p>AUC results based on publicly available codes (<xref ref-type="bibr" rid="B8">8</xref>, <xref ref-type="bibr" rid="B9">9</xref>) are shown in <xref ref-type="table" rid="T2">Table 2</xref> below.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>AUC Results Based on Publicly Available Codes.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Methods</td>
<td valign="top" align="left">Learning type</td>
<td valign="top" align="center">UCSD Ped2</td>
<td valign="top" align="center">Shanghai-tech</td>
<td valign="top" align="center">UCF-crime</td>
<td valign="top" align="center">Avenue</td>
<td valign="top" align="center">LAD</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Sparse</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">65.51</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">50.31</td>
</tr>
<tr>
<td valign="top" align="left">ConvAE</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">50.60</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">53.24</td>
</tr>
<tr>
<td valign="top" align="left">GMM</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">56.43</td>
<td valign="top" align="center">&#x2013;</td>
<td valign="top" align="center">41.02</td>
</tr>
<tr>
<td valign="top" align="left">Stacked RNN</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">52.58</td>
<td valign="top" align="center">67.66</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">70.09</td>
<td valign="top" align="center">49.42</td>
</tr>
<tr>
<td valign="top" align="left">U-Net</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">71.26</td>
<td valign="top" align="center">56.69</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">55.26</td>
<td valign="top" align="center">53.96</td>
</tr>
<tr>
<td valign="top" align="left">MNAD</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">46.72</td>
<td valign="top" align="center">51.13</td>
<td valign="top" align="center">56.20</td>
<td valign="top" align="center">73.58</td>
<td valign="top" align="center">45.84</td>
</tr>
<tr>
<td valign="top" align="left">OGNet</td>
<td valign="top" align="left">Unsupervised</td>
<td valign="top" align="center">69.08</td>
<td valign="top" align="center">69.26</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">63.23</td>
<td valign="top" align="center">55.07</td>
</tr>
<tr>
<td valign="top" align="left">DeepMIL</td>
<td valign="top" align="left">Weakly supervised</td>
<td valign="top" align="center">90.09</td>
<td valign="top" align="center">86.30</td>
<td valign="top" align="center">75.41</td>
<td valign="top" align="center">87.53</td>
<td valign="top" align="center">70.18</td>
</tr>
<tr>
<td valign="top" align="left">MLEP</td>
<td valign="top" align="left">Weakly supervised</td>
<td valign="top" align="center">-</td>
<td valign="top" align="center">73.40</td>
<td valign="top" align="center">50.01</td>
<td valign="top" align="center">89.20</td>
<td valign="top" align="center">50.57</td>
</tr>
<tr>
<td valign="top" align="left">AR-Net</td>
<td valign="top" align="left">Weakly supervised</td>
<td valign="top" align="center">93.64</td>
<td valign="top" align="center">91.24</td>
<td valign="top" align="center">74.36</td>
<td valign="top" align="center">89.31</td>
<td valign="top" align="center">79.84</td>
</tr>
</tbody>
</table></table-wrap>
<p>Results given in anomaly detection in video sequences: A benchmark and computational model by Boyang et al. (<xref ref-type="bibr" rid="B8">8</xref>, <xref ref-type="bibr" rid="B9">9</xref>).</p>
</sec>
</sec>
</sec>
<sec id="S3">
<title>3. Dataset and preprocessing</title>
<sec id="S3.SS1">
<title>3.1. Our own dataset</title>
<p>We have made a new dataset to detect anomalous action and named it Anomalous Action Detection Dataset (Anno-AAD) to detect anomalous behavior. Our dataset is divided into two parts: 1. Anomaly videos and 2. Normal videos (41 videos, 62 min). The total number of videos in the dataset is 392. The anomaly part has 351 videos, and there are 41 normal videos. The total number of classes is 9. The names of the classes are in the anomaly section: 1. Burglary (49 videos, 78 min), 2. Fighting (50 videos, 85 min), 3. Explosion (49 videos, 72 min), 4. Fire raising (52 videos, 96 min), 5. Ill treatment (32 videos, 68 min), 6. Traffic irregularities (5 videos, 3 min), 7. Violence (26 videos, 37 min), 8. Arrest (50 videos, 93 min), and 9. Attack (38 videos, 71 min).</p>
<sec id="S3.SS1.SSS1">
<title>3.1.1. Anno-AAD dataset</title>
<p>The total video length of the anomaly part is 10 h 03 min. The total video length of the normal part is 62 min. The total video length of the criminal action detection dataset is 11 h 05 min. The average length of a video is 1 min 42 s. Snapshots of instances of different categories of action from our dataset are shown in <xref ref-type="fig" rid="F1">Figure 1</xref> below.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Snapshots of instances of different categories of action from our dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g001.tif"/>
</fig>
</sec>
</sec>
<sec id="S3.SS2">
<title>3.2. Our own model and used libraries</title>
<sec id="S3.SS2.SSS1">
<title>3.2.1. Model</title>
<p>We have made our own convolutional Long Short Term Memory Gated Recurrent Unit model (conv-LSTM-GRU) and Long Recurrent Convolutional Network (LRCN) model to predict.</p>
<p>Used python libraries:</p>
<list list-type="simple">
<list-item><p>1. os, 2. cv2, 3. Math, 4. Random, 5. Numpy, 6. Tensorflow, 7. Collections, 8. Matplotlib, 9. Moviepy, 10. Matplotlib, 11. sklearn, and 12. beautiful soup.</p>
</list-item>
</list>
</sec>
</sec>
<sec id="S3.SS3">
<title>3.3. Dataset preprocessing</title>
<p>We perform data preprocessing in the dataset mainly to reduce the number of computations and enhance easy training of our deep learning model. The following are done:</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Resizing the frames to a permanent width and height after reading the video files from the dataset.</p>
</list-item>
<list-item>
<label>2.</label>
<p>Normalizing the data range in [0,1] by dividing255.</p>
</list-item>
</list>
<p>Here, the frame size is 64&#x00D7;64(<italic>height</italic>&#x00D7;<italic>width</italic>).</p>
<p>The sequence length is 20.</p>
<p>We introduce frames_extraction(), which generates a list of the shrunk and normalized frames from a movie whose path is supplied as an argument. The function will watch the video frame by frame, but not every frame will be additional to the list as we only require a consistent number of frames throughout the course of the series. Train accounts for 75% of the dataset, whereas Test makes up 25%.</p>
</sec>
<sec id="S3.SS4">
<title>3.4. Our models: convolutional-LSTM-GRU and LRCN</title>
<sec id="S3.SS4.SSS1">
<title>3.4.1. Conv-LSTM-GRU</title>
<p>A Time Series is an assortment of data congregated over time. In such instances, a model based on LSTM, a Recurrent Neural Network architecture, is an attractive solution. The previous concealed state is sent to the next phase in the sequence in this design. As a result, the network stores information based on past data and consumes it to make judgements. In other words, data order is crucial.</p>
<p>When working with photographs, a CNN architecture is the optimum option. Convolutional layers are used to extract essential elements from the picture. The output is joined to a fully coupled dense network after going through a series of convolutional layers. Conv-LSTM layers can be used in the situation of successive pictures. It is a recurrent layer like the LSTM, except that internal matrix multiplications are substituted with convolution operations. As a result, data passing through the Conv-LSTM cells keeps the original dimension.</p>
<p>GRUs and LSTM are quite similar. GRU uses gates to regulate the information flow, the same as LSTM. When compared to LSTM, they are quite new. They have a simpler design and provide certain improvements over LSTM because of this. In order to construct a new model to make predictions over a video as Time Series Data of a series of frames, we attempt to integrate the properties of Conv-LSTM and GRU. <xref ref-type="fig" rid="F2">Figures 2</xref>, <xref ref-type="fig" rid="F3">3</xref>, respectively, depict Conv-LSTM (<xref ref-type="bibr" rid="B12">12</xref>) and GRU (<xref ref-type="bibr" rid="B13">13</xref>).</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Convolutional Long Short Term Memory.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g002.tif"/>
</fig>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Gated recurrent unit.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g003.tif"/>
</fig>
</sec>
<sec id="S3.SS4.SSS2">
<title>3.4.2. Long recurrent convolutional network (<xref ref-type="bibr" rid="B1">1</xref>)</title>
<p>Long-term recurrent convolutional networks (LRCNs) are architectures that utilize CNNs for visual recognition and extend them to time-varying inputs and outputs. They examine visual inputs (potentially variable-length) and outputs into recurrent sequence models (LSTMs), resulting in variable-length predictions. The CNN and LSTM weights are shared, allowing scaling to any sequence length. The architecture of LRCN is depicted in <xref ref-type="fig" rid="F4">Figure 4 (1</xref>).</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Architecture of LRCN.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g004.tif"/>
</fig>
</sec>
</sec>
<sec id="S3.SS5">
<title>3.5. Model description</title>
<p>The models we have used in our experiments are discussed in the following sections.</p>
<sec id="S3.SS5.SSS1">
<title>3.5.1. Model description: CONV-LSTM-GRU</title>
<p>The number of parameters in CONV-LSTM-GRU is given in <xref ref-type="fig" rid="F5">Figure 5</xref>.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Number of parameters in CONV-LSTM-GRU.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g005.tif"/>
</fig>
<p>The design of the CONV-LSTM-GRU model is depicted in <xref ref-type="fig" rid="F6">Figure 6</xref> below.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Design of the CONV-LSTM-GRU model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g006.tif"/>
</fig>
</sec>
<sec id="S3.SS5.SSS2">
<title>3.5.2. Model description: our model LRCN</title>
<p>The number of parameters used in LRCN is shown in <xref ref-type="fig" rid="F7">Figure 7</xref> below.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>Number of parameters in LRCN.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g007.tif"/>
</fig>
<p>The design of LRCN model is presented in <xref ref-type="fig" rid="F8">Figure 8</xref> below.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption><p>Design of LRCN model.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g008.tif"/>
</fig>
</sec>
</sec>
<sec id="S3.SS6">
<title>3.6. Training parameters</title>
<p>For LRCN, the model was trained with an adam optimizer and categorical crossentrophy as a loss function with batch size = 4 and epochs = 80. For CONV-LSTM-GRU, the model was trained with an adam optimizer and categorical crossentrophy as a loss function with batch size = 4 and epochs = 35.</p>
</sec>
</sec>
<sec id="S4">
<title>4. Experimental results of our work</title>
<p>We will talk about the experimental findings in the parts that follow.</p>
<sec id="S4.SS1">
<title>4.1. Results on our dataset</title>
<p>LRCN has achieved 87% accuracy and Conv-LSTM-GRU has achieved 94% accuracy on our dataset. Accuracies based on our methods, Conv LSTM-GRU, and LRCN are mentioned in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Accuracies Based on Our Methods, Conv LSTM-GRU, and LRCN.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Methods</td>
<td valign="top" align="center">Accuracy</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Conv LSTM-GRU</td>
<td valign="top" align="center">94%</td>
</tr>
<tr>
<td valign="top" align="left">LRCN</td>
<td valign="top" align="center">87%</td>
</tr>
</tbody>
</table></table-wrap>
<p>In the next section, we have briefly explained the results on our dataset.</p>
<p>List of the works done by us:</p>
<list list-type="simple">
<list-item>
<label>1.</label>
<p>Total loss versus validation loss graph using Conv LSTM-GRU and LRCN</p>
</list-item>
<list-item>
<label>2.</label>
<p>Total accuracy versus total validation accuracy graph using Conv LSTM-GRU and LRCN</p>
</list-item>
<list-item>
<label>3.</label>
<p>Confusion matrix, precision, Recall, F1-Score on our dataset using Conv LSTM-GRU and LRCN</p>
</list-item>
</list>
<sec id="S4.SS1.SSS1">
<title>4.2.1. Conv-LSTM-GRU: total loss vs validation loss graph</title>
<p>From <xref ref-type="fig" rid="F9">Figure 9</xref>, we can see clearly that the loss decreases as we increase the number of epochs; hence, we can conclude that the model has reached a global minimum solution. The validation loss also decreases along with the training loss; hence, we can say the model does not suffer from overfitting.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption><p>Conv-LSTM-GRU: Total loss vs validation loss graph.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g009.tif"/>
</fig>
</sec>
<sec id="S4.SS1.SSS2">
<title>4.2.2. Conv-LSTM-GRU: total accuracy versus total validation accuracy graph</title>
<p>From <xref ref-type="fig" rid="F10">Figure 10</xref>, we can see clearly that the accuracy increases as we increase the number of epochs; hence, we can conclude that the model is a near perfect fit and does not suffer from overfitting.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption><p>Conv-LSTM-GRU: Total accuracy vs total validation accuracy graph.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g010.tif"/>
</fig>
</sec>
<sec id="S4.SS1.SSS3">
<title>4.2.3. LRCN: total loss versus validation loss graph</title>
<p>From <xref ref-type="fig" rid="F11">Figure 11</xref>, we can see clearly that the loss decreases as we increase the number of epochs; hence, we can conclude that the model has reached a global minimum solution. The validation loss also decreases along with the training loss; hence, we can say the model does not suffer from overfitting.</p>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption><p>LRCN: Total loss vs validation loss graph.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g011.tif"/>
</fig>
</sec>
<sec id="S4.SS1.SSS4">
<title>4.2.4. LRCN: total accuracy versus total validation accuracy graph</title>
<p>From <xref ref-type="fig" rid="F12">Figure 12</xref>, we can see clearly that the accuracy increases as we increase the number of epochs; hence, we can conclude that the model is a near perfect fit and does not suffer from overfitting.</p>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption><p>LRCN: Total accuracy vs total validation accuracy graph.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g012.tif"/>
</fig>
</sec>
</sec>
</sec>
<sec id="S5">
<title>5. Evaluation of models</title>
<p>We will now talk about how our suggested models were evaluated.</p>
<sec id="S5.SS1">
<title>5.1. AUC and ROC curves</title>
<sec id="S5.SS1.SSS1">
<title>5.1.1. Conv-LSTM-GRU: AUC and ROC curve</title>
<p>From <xref ref-type="fig" rid="F13">Figure 13</xref>, we can see clearly that the ROC (receiver operating characteristic curve) and the area under the ROC curve (AUC) for the 10 classes are near to 1, which signifies that our classifier model can nearly distinguish between all the positive and negative class points correctly.</p>
<fig id="F13" position="float">
<label>FIGURE 13</label>
<caption><p>Conv-LSTM-GRU: AUC and ROC plot.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g013.tif"/>
</fig>
</sec>
<sec id="S5.SS1.SSS2">
<title>5.1.2. LRCN: AUC and ROC curve</title>
<p>From <xref ref-type="fig" rid="F14">Figure 14</xref>, we can see clearly that the ROC and AUC for the 10 classes are near to 1, which signifies that our classifier model can nearly distinguish between all the positive and negative class points correctly.</p>
<fig id="F14" position="float">
<label>FIGURE 14</label>
<caption><p>LRCN: AUC and ROC plot.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g014.tif"/>
</fig>
</sec>
</sec>
<sec id="S5.SS2">
<title>5.2. Confusion matrix, precision, recall, and F1-score</title>
<p>Confusion matrix: Confusion matrix is a widely used measure for solving classification problems, applied to binary and multiclass problemsas shown in <xref ref-type="table" rid="T4">Table 4</xref>. In this case, a one-versus-all approach was used.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Confusion Matrix for Binary Classification.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center" colspan="3">Predicted<hr/></td>
</tr>
<tr>
<td valign="top" align="left">Actual</td>
<td/>
<td valign="top" align="left">Negative</td>
<td valign="top" align="left">Positive</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="left">Negative</td>
<td valign="top" align="left">True negative (TN)</td>
<td valign="top" align="left">False positive (FP)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Positive</td>
<td valign="top" align="left">False negative (FN)</td>
<td valign="top" align="left">True positive (TP)</td>
</tr>
</tbody>
</table></table-wrap>
<p>Accuracy: Accuracy calculation compares system efficiency by calculating the total number of true predictions using the following equation:</p>
<disp-formula id="S5.Ex1">
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>y</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>t</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">/</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
<mml:mo rspace="5.8pt" stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">+</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>N</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">+</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>N</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Recall: The fraction of successfully detected positive inputs is used to determine the recall. It is the TP rate, and the following equation measures it:</p>
<disp-formula id="S5.Ex2">
<mml:math id="M2">
<mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Precision: Precision refers to how accurately the classifier has predicted positive cases. The equation provided measures it as follows:</p>
<disp-formula id="S5.Ex3">
<mml:math id="M3">
<mml:mrow>
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>n</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">/</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>P</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>F1 Score: Another indicator of test accuracy is the F1 score or F measure. The term refers to a precision and recall weighted mean. Its poorest value is 0, and its highest value is 1.</p>
<disp-formula id="S5.Ex4">
<mml:math id="M4">
<mml:mrow>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mn>1</mml:mn>
</mml:mpadded>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>e</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mn>2</mml:mn>
</mml:mpadded>
<mml:mo rspace="5.8pt">&#x002A;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+3.3pt">
<mml:mi>n</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="5.8pt">&#x002A;</mml:mo>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>l</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo rspace="7.5pt">/</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>r</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>o</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mpadded width="+5pt">
<mml:mi>n</mml:mi>
</mml:mpadded>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<sec id="S5.SS2.SSS1">
<title>5.2.1. Conv-LSTM-GRU: confusion matrix, precision, recall, and F1-score</title>
<p>The confusion matrix of Conv-LSTM GRU method on our dataset is shown in <xref ref-type="fig" rid="F15">Figure 15</xref>.</p>
<fig id="F15" position="float">
<label>FIGURE 15</label>
<caption><p>Confusion matrix of Conv-LSTM GRU method on our dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g015.tif"/>
</fig>
<sec id="S5.SS2.SSS1.Px1">
<title>5.2.1.1. Confusion matrix</title>
</sec>
<sec id="S5.SS2.SSS1.Px2">
<title>5.2.1.2. Precision, recall, and F1-score of conv-LSTM-GRU method</title>
<p>The precision, recall, and F1-Score of Conv-LSTM method on our own dataset are depicted in <xref ref-type="table" rid="T5">Table 5</xref>.</p>
<table-wrap position="float" id="T5">
<label>TABLE 5</label>
<caption><p>Precision, Recall, and F1-Score of Conv-LSTM Method on Our Own Dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">Recall</td>
<td valign="top" align="center">F1 Score</td>
<td valign="top" align="center">Support</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Arrest</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Burglary</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">48</td>
</tr>
<tr>
<td valign="top" align="left">Fighting</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">50</td>
</tr>
<tr>
<td valign="top" align="left">Ill-Treatment</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">32</td>
</tr>
<tr>
<td valign="top" align="left">Violence</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">26</td>
</tr>
<tr>
<td valign="top" align="left">Attack</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">38</td>
</tr>
<tr>
<td valign="top" align="left">Explosion</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Normal Videos</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.99</td>
<td valign="top" align="center">41</td>
</tr>
<tr>
<td valign="top" align="left">Fire Raising</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Traffic Irregularities</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">13</td>
</tr>
<tr>
<td valign="top" align="left">Accuracy</td>
<td/>
<td/>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">395</td>
</tr>
<tr>
<td valign="top" align="left">Micro avg</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">395</td>
</tr>
<tr>
<td valign="top" align="left">Weighted avg</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.97</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">395</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
<sec id="S5.SS2.SSS2">
<title>5.2.2. LRCN: Confusion matrix, precision, recall, and F1-score</title>
<p>The confusion matrix of LRCN method on our dataset is shown in <xref ref-type="fig" rid="F16">Figure 16</xref>.</p>
<fig id="F16" position="float">
<label>FIGURE 16</label>
<caption><p>Confusion matrix of LRCN method on our dataset.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijscit-2022-28-g016.tif"/>
</fig>
<sec id="S5.SS2.SSS2.Px1">
<title>5.2.2.1. Confusion matrix</title>
</sec>
<sec id="S5.SS2.SSS2.Px2">
<title>5.2.2.2. Precision, recall, F1-score of LRCN method</title>
<p>The precision, recall, and F1 Score of LRCN method of our dataset are mentioned in <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<table-wrap position="float" id="T6">
<label>TABLE 6</label>
<caption><p>Precision, Recall, and F1 Score of LRCN Method of Our Dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">Recall</td>
<td valign="top" align="center">F1 Score</td>
<td valign="top" align="center">Support</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Arrest</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Burglary</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">48</td>
</tr>
<tr>
<td valign="top" align="left">Fighting</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.80</td>
<td valign="top" align="center">50</td>
</tr>
<tr>
<td valign="top" align="left">Ill-Treatment</td>
<td valign="top" align="center">0.91</td>
<td valign="top" align="center">0.94</td>
<td valign="top" align="center">0.92</td>
<td valign="top" align="center">32</td>
</tr>
<tr>
<td valign="top" align="left">Violence</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.69</td>
<td valign="top" align="center">0.75</td>
<td valign="top" align="center">26</td>
</tr>
<tr>
<td valign="top" align="left">Attack</td>
<td valign="top" align="center">0.91</td>
<td valign="top" align="center">0.82</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">38</td>
</tr>
<tr>
<td valign="top" align="left">Explosion</td>
<td valign="top" align="center">0.83</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">0.86</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Normal Videos</td>
<td valign="top" align="center">0.95</td>
<td valign="top" align="center">0.98</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">41</td>
</tr>
<tr>
<td valign="top" align="left">Fire Raising</td>
<td valign="top" align="center">0.90</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">49</td>
</tr>
<tr>
<td valign="top" align="left">Traffic Irregularities</td>
<td valign="top" align="center">0.93</td>
<td valign="top" align="center">1.00</td>
<td valign="top" align="center">0.96</td>
<td valign="top" align="center">13</td>
</tr>
<tr>
<td valign="top" align="left">Accuracy</td>
<td/>
<td/>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">395</td>
</tr>
<tr>
<td valign="top" align="left">Micro avg</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">0.88</td>
<td valign="top" align="center">0.88</td>
<td valign="top" align="center">395</td>
</tr>
<tr>
<td valign="top" align="left">Weighted avg</td>
<td valign="top" align="center">0.89</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">0.87</td>
<td valign="top" align="center">395</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
</sec>
</sec>
<sec id="S6">
<title>6. Results and discussion</title>
<p>LRCN has achieved 87% accuracy and Conv-LSTM-GRU has achieved 94% testing accuracy on the dataset. The validation loss versus loss graph decreases as the number of epochs increases, so the model is trying to find the global minimum solution to the problem. The validation accuracy versus accuracy curve increases as the number of epochs increases. The ROC curve indicates our discriminator has almost reached an ideal as AUC is close to 1.</p>
</sec>
<sec id="S7">
<title>7. Challenges in HAR</title>
<p>Modeling and analyzing the interaction between human&#x2013;human and human&#x2013;object is a challenging issue. HAR systems are not yet capable of detecting and recognizing numerous gestures under varying backdrop conditions, and they are not tolerant to gesture scaling and growth. Some activities are challenging to represent due to their complicated structure and wide variety in how they are performed.</p>
<p>There are limitations on scene and human movement in 3D space. Additionally, the constraint of identifying and extracting persons from visual sequences demands knowledge and skill. A real-time HAR system can therefore offer better findings when massive volumes of data are processed simultaneously. Privacy concerns: A person may feel uneasy or obliged to be constantly watched.</p>
</sec>
<sec id="S8">
<title>8. Conclusion and future scope</title>
<sec id="S8.SS1">
<title>8.1. Conclusion</title>
<p>A literature review of research articles published between 2018 and 2021 on HAR technologies, including smartphone sensors, wearable sensors, and vision-based techniques, was carried out. Wearable technology provides greater assistance; however, poorly recognized activities necessitate more research for accuracy and system enlargement. Long training times are a key disadvantage for CNN-based methods since the training dataset is made up of a variety of human actions from movies, requiring intensive processing for proper identification.</p>
<p>Due to limited availability of computational power, we had to train our model with less epochs, and so, our accuracy obtained is low.</p>
</sec>
<sec id="S8.SS2">
<title>8.2. Future scope</title>
<p>The previously discussed challenges to HAR have to be overcome. Selection of a deep learning model with comparable accuracy to detect abnormal behavior using the human activity recognition system has to be done. Future models will use transfer learning, meta learning, new pretrained CNN models, and combined deep learning models to increase accuracy.</p>
</sec>
</sec>
<sec id="S9" sec-type="author-contributions">
<title>Author contributions</title>
<p>All authors agree to be accountable for the content of the work.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Donahue</surname> <given-names>J</given-names></name> <name><surname>Hendricks</surname> <given-names>LA</given-names></name> <name><surname>Rohrbach</surname> <given-names>M</given-names></name> <name><surname>Venugopalan</surname> <given-names>S</given-names></name> <name><surname>Guadarrama</surname> <given-names>S</given-names></name> <name><surname>Saenko</surname> <given-names>K</given-names></name><etal/></person-group> <article-title>Long-term recurrent convolutional networks for visual recognition and description.</article-title> <source><italic>Proc IEEE Conf Comput Vis Pattern Recognit.</italic></source> (<year>2015</year>) <volume>39</volume>:<fpage>677</fpage>&#x2013;<lpage>691</lpage>.</citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sabokrou</surname> <given-names>M</given-names></name> <name><surname>Fayyaz</surname> <given-names>M</given-names></name> <name><surname>Fathy</surname> <given-names>M</given-names></name> <name><surname>Klette</surname> <given-names>R</given-names></name></person-group>. <article-title>Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes.</article-title> <source><italic>IEEE Trans Image Process.</italic></source> (<year>2017</year>) <volume>26</volume>:<fpage>1992</fpage>&#x2013;<lpage>2004</lpage>.</citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hasan</surname> <given-names>M</given-names></name> <name><surname>Choi</surname> <given-names>J</given-names></name> <name><surname>Neumann</surname> <given-names>J</given-names></name> <name><surname>Roy-Chowdhury</surname> <given-names>AK</given-names></name> <name><surname>Davis</surname> <given-names>LS</given-names></name></person-group>. <article-title>Learning temporal regularity in video sequences.</article-title> <source><italic>Proceedings of the IEEE conference on computer vision and pattern recognition.</italic></source> <publisher-loc>Las Vegas, NV</publisher-loc>: (<year>2016</year>).</citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sultani</surname> <given-names>W</given-names></name> <name><surname>Chen</surname> <given-names>C</given-names></name> <name><surname>Shah</surname> <given-names>M</given-names></name></person-group>. <article-title>Real-world anomaly detection in surveillance videos.</article-title> <source><italic>Proceedings of the IEEE conference on computer vision and pattern recognition.</italic></source> <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2018</year>).</citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mart&#x00ED;nez-Mascorro</surname> <given-names>GA</given-names></name> <name><surname>Abreu-Pederzini</surname> <given-names>JR</given-names></name> <name><surname>Ort&#x00ED;z-Bayliss</surname> <given-names>JC</given-names></name> <name><surname>Terashima-Mar&#x2019;in</surname> <given-names>H</given-names></name></person-group>. <article-title>Suspicious behavior detection on shoplifting cases for crime prevention by using 3D convolutional neural networks.</article-title> <source><italic>arXiv</italic></source> <comment>[preprint]</comment>. (<year>2020</year>): <comment>arXiv:2005.02142</comment></citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ullah</surname> <given-names>W</given-names></name> <name><surname>Ullah</surname> <given-names>A</given-names></name> <name><surname>Haq</surname> <given-names>IU</given-names></name> <name><surname>Muhammad</surname> <given-names>K</given-names></name> <name><surname>Sajjad</surname> <given-names>M</given-names></name> <name><surname>Baik</surname> <given-names>SW</given-names></name></person-group>. <article-title>CNN features with bi-directional LSTM for real-time anomaly detection in surveillance networks.</article-title> <source><italic>Multimedia Tools Appl.</italic></source> (<year>2021</year>) <volume>80</volume>:<fpage>16979</fpage>&#x2013;<lpage>16995</lpage>.</citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dubey</surname> <given-names>S</given-names></name> <name><surname>Boragule</surname> <given-names>A</given-names></name> <name><surname>Jeon</surname> <given-names>M</given-names></name></person-group>. <article-title>3d resnet with ranking loss function for abnormal activity detection in videos.</article-title> <source><italic>Proceedings of the international conference on control, automation and information sciences (ICCAIS).</italic></source> <publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE</publisher-name> (<year>2019</year>).</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wan</surname> <given-names>B</given-names></name> <name><surname>Jiang</surname> <given-names>W</given-names></name> <name><surname>Fang</surname> <given-names>Y</given-names></name> <name><surname>Luo</surname> <given-names>Z</given-names></name> <name><surname>Ding</surname> <given-names>G</given-names></name></person-group>. <article-title>Anomaly detection in video sequences: A benchmark and computational model.</article-title> <source><italic>IET Image Process.</italic></source> (<year>2021</year>) <volume>15</volume>:<fpage>3454</fpage>&#x2013;<lpage>3465</lpage>.</citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Khan</surname> <given-names>A</given-names></name> <name><surname>Sohail</surname> <given-names>A</given-names></name> <name><surname>Zahoora</surname> <given-names>U</given-names></name> <name><surname>Qureshi</surname> <given-names>AS</given-names></name></person-group>. <article-title>A survey of the recent architectures of deep convolutional neural networks.</article-title> <source><italic>Artif Intell Rev.</italic></source> (<year>2020</year>) <volume>53</volume>:<fpage>5455</fpage>&#x2013;<lpage>5516</lpage>.</citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Begampure</surname> <given-names>S</given-names></name> <name><surname>Jadhav</surname> <given-names>P</given-names></name></person-group>. <article-title>Intelligent video analytics for human action detection: a deep learning approach with transfer learning.</article-title> <source><italic>Int J Comput Digital Syst.</italic></source> (<year>2021</year>) <volume>11</volume>:<fpage>63</fpage>&#x2013;<lpage>72</lpage>.</citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Nasaruddin</surname> <given-names>N</given-names></name> <name><surname>Muchtar</surname> <given-names>K</given-names></name> <name><surname>Afdhal</surname> <given-names>A</given-names></name> <name><surname>Dwiyantoro</surname> <given-names>AP</given-names></name></person-group>. <article-title>Deep anomaly detection through visual attention in surveillance videos.</article-title> <source><italic>J Big Data.</italic></source> (<year>2020</year>) <volume>7</volume>:<fpage>1</fpage>&#x2013;<lpage>17</lpage>.</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Alexandre</surname> <given-names>X.</given-names></name></person-group> <source><italic>An introduction to ConvLSTM.</italic></source> (<year>2019</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://medium.com/neuronio/an-introduction-to-convlstm-55c9025563a7">https://medium.com/neuronio/an-introduction-to-convlstm-55c9025563a7</ext-link></citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="journal"><collab><italic>Gated recurrent unit (GRU).</italic></collab> Available online at: <ext-link ext-link-type="uri" xlink:href="https://primo.ai/index.php?title=Gated_Recurrent_Unit_(GRU">https://primo.ai/index.php?title=Gated_Recurrent_Unit_(GRU</ext-link>)</citation></ref>
</ref-list>
</back>
</article>