<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Cs.</journal-id>
<journal-title>BOHR International Journal of Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Cs.</abbrev-journal-title>
<issn pub-type="epub">2583-455X</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijcs.2022.05</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Hindi/Bengali sentiment analysis using transfer learning and joint dual input learning with self-attention</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Khan</surname> <given-names>Shahrukh</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Shahid</surname> <given-names>Mahnoor</given-names></name>
</contrib>
</contrib-group>
<aff/>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Shahrukh Khan, <email>shkh00001@stud.uni-saarland.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>20</day>
<month>04</month>
<year>2022</year>
</pub-date>
<volume>1</volume>
<issue>1</issue>
<fpage>32</fpage>
<lpage>37</lpage>
<history>
<date date-type="received">
<day>25</day>
<month>02</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>04</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2023 Khan and Shahid.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Khan and Shahid</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Sentiment analysis typically refers to using natural language processing, text analysis, and computational linguistics to extract effect and emotion-based information from text data. Our work explores how we can effectively use deep neural networks in transfer learning and joint dual input learning settings to effectively classify sentiments and detect hate speech in Hindi and Bengali data. We start by training Word2Vec word embeddings for Hindi HASOC data set and Bengali hate speech (<xref ref-type="bibr" rid="B1">1</xref>) and then train long short-term memory and subsequently employ parameter sharing-based transfer learning to Bengali sentiment classifiers, by reusing and fine-tuning the trained weights of Hindi classifiers, with both classifiers being used as the baseline in our study. Finally, we use BiLSTM with self-attention in a joint dual-input learning setting where we train a single neural network on the Hindi and Bengali data sets simultaneously using their respective embeddings.</p>
</abstract>
<kwd-group>
<kwd>deep learning</kwd>
<kwd>transfer learning</kwd>
<kwd>LSTMs</kwd>
<kwd>sentiment analysis</kwd>
<kwd>opinion mining</kwd>
</kwd-group>
<counts>
<fig-count count="7"/>
<table-count count="2"/>
<equation-count count="8"/>
<ref-count count="7"/>
<page-count count="6"/>
<word-count count="3646"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>There have been certain huge breakthroughs in the field of natural language processing paradigm with the advent of attention mechanism, and its use in transformer sequence-sequence models (<xref ref-type="bibr" rid="B2">2</xref>) coupled with different transfer learning techniques has quickly become state-of-the-art in multiple pervasive natural language processing tasks such as entity recognition classification (<xref ref-type="bibr" rid="B3">3</xref>, <xref ref-type="bibr" rid="B4">4</xref>). In our work, we propose a novel self-attention-based architecture for sentiment analysis and classification on <bold>Hindi HASOC data set</bold> (<xref ref-type="bibr" rid="B1">1</xref>). We use sub-task A which deals with whether or not a given tweet has hate speech. Moreover, this also serves as a source domain in the subsequent task-specific transfer learning task, in which we take the learned knowledge from the Hindi sentiment analysis domain to a similar binary Bengali sentiment analysis task.</p>
<p>Given the similar nature of both Bengali and Hindi sentiment analysis tasks (i.e., binary classification), we conceptualized the problem as a joint dual input learning setting. Lin et al. (<xref ref-type="bibr" rid="B5">5</xref>) suggested how we can integrate self-attention with BiLSTMs and gave a hidden representation containing different aspects for each sequence, which results in sentence embeddings while performing sentiment analysis and text classification more broadly.</p>
<p>One significant beneficial side effect of using such an approach is that the attention attributions can easily be visualized, which show what portions of the sequence attention mechanism have put more impetus on via its generated summation weights; this visualization technique played a pivotal role in selecting the number of attention hops <italic>r</italic>, also referred to as how many attention vectors of summation weights for each sequence in our study. Further, we employed this approach in a joint dual input learning setting where we have a single neural network that is trained on Hindi and Bengali data simultaneously.</p>
<p>Our proposed approach, which is a variant of the multitask learning, offers an alternative framework for training text classification-based neural networks on low-resource corpora. Also, since we train a single joint network on multiple tasks simultaneously, it essentially allows for better generalization and task-specific weight sharing, which can be a reasonable alternative for pretraining-based transfer learning approaches.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Train loss convergence for different values of window size with fixed embedded size = 300 Hindi data set.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g001.tif"/>
</fig>
</sec>
<sec id="S2">
<title>Experimental setting</title>
<sec id="S2.SS1">
<title>Word embeddings</title>
<p>Starting with the Hindi data set, we prepared the training data set employing sub-sampling technique by first computing the probability of keeping the word using the following formula:</p>
<disp-formula id="S2.Ex1">
<mml:math id="M1">
<mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msqrt>
<mml:mfrac>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>0.000001</mml:mn>
</mml:mfrac>
</mml:msqrt>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x00D7;</mml:mo>
<mml:mfrac>
<mml:mn>0.000001</mml:mn>
<mml:mrow>
<mml:mi>z</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where z(w<sub><italic>i</italic></sub>) is the relative frequency of the word in the corpus. Hence, we used P<sub><italic>keep</italic></sub>(w<sub><italic>i</italic></sub>) for each context word while sampling context words for a given word and randomly dropped frequent context words by comparing them against a random threshold sampled each time from a uniform distribution. If we kept all the frequent words in our context for training data, we may not get the rich semantic relationship between the domain-specific words as frequent words like &#x201C;the,&#x201D; &#x201C;me,&#x201D; etc. do not necessarily carry much semantic meaning in a given sequence. Hence, randomly dropping them made more sense as compared to keeping or dropping all of them. Also, another important design decision that we made here was to curate the train set for Word2Vec only once before training the model as opposed to creating a different one for each epoch as we were randomly sub-sampling context words because the earlier mentioned approach gives faster execution time for training the model while the model also converged well to a relatively low train loss value. Furthermore, for choosing hyperparameters, we performed the following analysis (Figure).</p>
<p>It is apparent from the above visualization that Word2Vec models with smaller context windows converged faster and had better train loss at the end of the training process. However, to retain some context-based information, we selected the window size 2 as it has contextual information and the model had better train loss.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Hindi and Bengali joint dual input learning architecture.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g002.tif"/>
</fig>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Attention vectors for a relatively longer Hindi sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Attention vectors for a relatively medium Hindi sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g004.tif"/>
</fig>
<p>After testing different values for hyperparameters with different combinations, it was observed that for the better performance of the model, they should be set to <bold>epochs = 500, window size = 2, embedded size = 300, and learning rate = 0.05</bold> in the case of our study.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Attention vectors for a short Hindi sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g005.tif"/>
</fig>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Attention vectors for a short Bengali sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g006.tif"/>
</fig>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>Attention vectors for a short Bengali sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-05-g007.tif"/>
</fig>
<p>Also, we have set <bold>cross-entropy loss</bold> as the criterion used for adjusting the weights during the training phase. When softmax converts logits into probabilities cross-entropy takes those output probabilities (p) and measures the distance from the truth values to estimate the loss. Cross-entropy loss inherently combines <bold>log softmax and negative log-likelihood loss</bold>, so we did not apply log softmax on the output of our Word2Vec model.</p>
<p>For optimization, we have selected Adam (Adaptive Moment Estimation algorithm) (<xref ref-type="bibr" rid="B6">6</xref>), which is an optimization technique that, at present, is very much recommended due to its computational efficiency, low memory requirement, invariant to diagonal rescale of the gradients, and extremely better results for problems that are large in terms of data/parameters or for problems with sparse gradients. Adam provides us with the combination of best properties from both AdaGrad and RMSProp and is often used as an alternative for SGD + Nesterov Momentum Adam (<xref ref-type="bibr" rid="B6">6</xref>).</p>
</sec>
<sec id="S2.SS2">
<title>Baseline models</title>
<p>For baseline, we reproduced the work by Wang et al. (<xref ref-type="bibr" rid="B7">7</xref>), which primarily focuses on performing sentiment classification on short social media texts by using long short-term memory (LSTM) neural networks using distributed representations of Word2Vec skip-gram approach. We chose these authors concepts because they used Word2Vec Skip-gram based distributed representation of words and also our data sets were sourced from social media. Moreover, the neural network LSTM is an upgraded variant of the recurrent neural network (RNN) model that serves as the remedy, to some extent, of the problems that require learning long-term temporal dependencies; due to vanishing gradients, LSTM uses a gate mechanism and memory cell to control the memory process.</p>
</sec>
<sec id="S2.SS3">
<title>Hindi neural sentiment classifier baseline</title>
<p>We then implemented the architecture for the LSTM classifier using pretrained 300-dimensional word embeddings obtained as described in section &#x201C;Word Embeddings.&#x201D; We used the Adam optimizer with the initial learning rate of 10 <sup>&#x2013;4</sup>, which helped the train and validation loss to converge at a relatively fast rate; the optimizer did not optimize the weight of the embedding layer via gradient optimization since they were pretrained. Moreover, we chose the binary cross entropy loss function as we are dealing with binary classification. In model architecture, we used eight layers of LSTMs, with each having a hidden dimension of 64 followed by a dropout layer with a dropout probability of 0.5 to counterbalance overfitting and then fully connected the output layer wrapped by a sigmoid activation function, since our target is binary and sigmoid is the ideal choice for binary classification given its mathematical properties. We kept a batch size of 32 and trained the model for 30 epochs while monitoring for its accuracy and loss on the validation set. The choice of hyperparameters was made after trying different combinations and we chose the best set of hyperparameters while monitoring the validation set accuracy.</p>
</sec>
<sec id="S2.SS4">
<title>Bengali neural transfer learning-based sentiment classifier baseline</title>
<p>Similarly to the Hindi sentiment classification pipeline, we first obtained the word embeddings for Bengali data using the Word2Vec Skip-gram approach. The same set of hyperparameters that we chose for the Hindi data set, worked fine here as well, so we did not tune the hyperparameters. Also, the model&#x2019;s train loss converged to the similar value we had for the Hindi data set. Subsequently, we then create the same architecture for LSTM-based classifier architecture as explained in section &#x201C;Hindi Neural Sentiment Classifier Baseline.&#x201D; Our goal was to perform transfer learning and reuse and fine-tune the learned weights of the Hindi classifier. We replaced the Hindi embeddings layer with a Bengali 300-dimensional embedding layer and not optimized its weights during training. Next, we loaded the weights from the Hindi classifier for LSTM layers and fully connected the layers applying sharing-based task-specific transfer learning technique. Additionally, we trained the Bengali classifier for 30 epochs with a batch size of 32 and used the Adam optimizer with an initial learning rate of 10<sup>&#x2013;4</sup> while using the binary cross-entropy function for computing loss on the training and validation set. The choice of batch size hyperparameter was made after trying different values and we chose the best hyperparameter while monitoring the accuracy of validation set. After training the classifier using the pretrained weights from the Hindi classifier, we got better performance results than the Hindi baseline; this implies task-based transfer learning actually boosted the performance of the Bengali classifier and provides better results.</p>
</sec>
</sec>
<sec id="S3">
<title>Proposed method</title>
<p>The LSTM-based classifier, coupled with transfer learning in the Bengali domain, does a fairly good job of providing the baselines in our study. However, one main prominent short-coming of RNN based architectures is they lack the ability capture the dependencies between words that are too distant from each other. The forget gate of LSTM enables to retain information about the historical words in the sequence; however, it does not completely resolve the RNN-based networks&#x2019; vanishing gradients problem. We wanted to investigate whether using self-attention with LSTMs would improve our model&#x2019;s performance. Also, we propose the joint dual input learning setting where both Hindi and Bengali classification tasks can benefit from each other rather than the transfer learning setting where only the target task takes the advantage of pretraining.</p>
</sec>
<sec id="S4">
<title>Hindi and Bengali self-attention-based joint dual input learning BiLSTM classifier</title>
<p>Instead of training two separate neural networks for Hindi and Bengali, here we simultaneously trained a joint neural network with the same architecture on Hindi and Bengali data in parallel and optimized its weights using the combined binary cross-entropy loss over Hindi and Bengali data sets, respectively. We also added the Hindi and Bengali batches&#x2019; attention loss to the joint loss in order to avoid overfitting, which we present in detail in the subsequent sections. Here we switched between the embedding layers based on the language of the batch data. A block architecture we proposed is explained herein.</p>
<p>One major benefit of using such technique is that it increases the model capability of generalization since the size of the training data set roughly doubles given that both languages have an equal number of training examples. Consequently, it reduces the risk of overfitting.</p>
<p>We propose an extension of the work of (<xref ref-type="bibr" rid="B5">5</xref>) where they proposed the method of <bold>a structured self-attentive sentence embedding</bold> on the Hindi data set. The key idea was to propose document-level embeddings by connecting the self-attention mechanism right after a BiLSTM, which leverages information of both past and future in the sequence as opposed to unidirectional LSTM which only relies on past information in the sequence. The self-attention mechanism results in a matrix of attention vectors, which are then used to produce sentence embeddings, each of them equivalent to the length of the sequence and the number of vectors depend on the value of <italic>r</italic>, which is the output dimension of the self-attention mechanism, where each vector represents how attention mechanism is putting more relative weight on different tokens in the sequence. He key takeaways on how self-attentive document embeddings are produced are discussed as follows:</p>
<p>We start with an input text <italic>T</italic> of (<italic>n</italic>, <italic>d</italic>) dimension, where <italic>n</italic> are the number of tokens. Each token is represented by its embedding <italic>e</italic> in the sequence and <italic>d</italic> is the embedding dimension.</p>
<disp-formula id="S4.Ex2">
<mml:math id="M2">
<mml:mrow>
<mml:mmultiscripts>
<mml:mo>=</mml:mo>
<mml:mprescripts/>
<mml:none/>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:none/>
</mml:mmultiscripts>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo>,</mml:mo>
<mml:mprescripts/>
<mml:mn>1</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo>,</mml:mo>
<mml:mprescripts/>
<mml:mn>2</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo>,</mml:mo>
<mml:mprescripts/>
<mml:mn>3</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo stretchy="false">]</mml:mo>
<mml:mprescripts/>
<mml:mi>n</mml:mi>
<mml:none/>
</mml:mmultiscripts>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S4.Ex3">
<mml:math id="M3">
<mml:mrow>
<mml:mmultiscripts>
<mml:mo>=</mml:mo>
<mml:mprescripts/>
<mml:none/>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>e</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>g</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>l</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:none/>
</mml:mmultiscripts>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>e</mml:mi>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mn>3</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo stretchy="false">]</mml:mo>
<mml:mprescripts/>
<mml:mi>n</mml:mi>
<mml:none/>
</mml:mmultiscripts>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Based on the source language of the input text, the corresponding embedding lookup is performed.</p>
<p>Token embeddings are then fed into the BiLSTM, which individually processes each token the from left to right and from the left to right direction, with each BiLSTM cell/layer producing two vectors of hidden states equivalent to the length of the sequence.</p>
<disp-formula id="S4.Ex4">
<mml:math id="M4">
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo rspace="5.8pt">,</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mpadded lspace="2.8pt" width="+2.8pt">
<mml:mi>h</mml:mi>
</mml:mpadded>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="S4.Ex5">
<mml:math id="M5">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>B</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>T</mml:mi>
<mml:mi>M</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo>,</mml:mo>
<mml:mprescripts/>
<mml:mn>1</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mi>e</mml:mi>
<mml:mmultiscripts>
<mml:mo>,</mml:mo>
<mml:mprescripts/>
<mml:mn>2</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="normal">&#x03B8;</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p><italic><bold>H</bold></italic> is the concatenate form of bidirectional hidden states. If there are l LSTM layers/cells, then the dimension of <italic><bold>H</bold></italic> is going to be (<italic>n</italic>, 2<italic>l</italic>).</p>
<disp-formula id="S4.Ex6">
<mml:math id="M6">
<mml:mrow>
<mml:mpadded width="+2.8pt">
<mml:mi>H</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.3pt">=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
</mml:mover>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo rspace="5.8pt">,</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mpadded lspace="2.8pt" width="+2.8pt">
<mml:mi>h</mml:mi>
</mml:mpadded>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2026;</mml:mi>
<mml:mo rspace="4.2pt">,</mml:mo>
<mml:mover accent="true">
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x2190;</mml:mo>
</mml:mover>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Different combinations of hyperparameters from Hindi data set.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Embedded size</td>
<td valign="top" align="center">Learning rate</td>
<td valign="top" align="center">Window size</td>
<td valign="top" align="center">Min. loss score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">300</td>
<td valign="top" align="center">0.05</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">0.841</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1.559</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">3</td>
<td valign="top" align="center">1.942</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">4</td>
<td valign="top" align="center">2.151</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">5</td>
<td valign="top" align="center">2.321</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">10</td>
<td valign="top" align="center">2.792</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">0.01</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1.298</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">3.295</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">10</td>
<td valign="top" align="center">2.747</td>
</tr>
<tr>
<td/>
<td valign="top" align="center">0.1</td>
<td valign="top" align="center">1</td>
<td valign="top" align="center">1.311</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">2</td>
<td valign="top" align="center">1.557</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="center">10</td>
<td valign="top" align="center">3.551</td>
</tr>
</tbody>
</table></table-wrap>
<p>For self-attention, Lin et al. (<xref ref-type="bibr" rid="B5">5</xref>) proposed having two weight matrices: <italic>Ws</italic><sub>1</sub> with dimension (<italic>d</italic><sub><italic>a</italic></sub>, 2<italic>l</italic>) and <italic>Ws</italic><sub>2</sub>. with dimension (<italic>r</italic>, <italic>d</italic><sub><italic>a</italic></sub>), where <italic>d</italic><sub><italic>a</italic></sub> is the hidden dimension of self-attention mechanism and <italic>r</italic> is the number of attention vectors for given text input. We then apply following set of operations to produce the attention matrix for input text <italic>T</italic>.</p>
<disp-formula id="S4.Ex7">
<mml:math id="M7">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mmultiscripts>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mprescripts/>
<mml:mi>a</mml:mi>
<mml:none/>
</mml:mmultiscripts>
<mml:mi>t</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mi>s</mml:mi>
<mml:mmultiscripts>
<mml:mi>H</mml:mi>
<mml:mprescripts/>
<mml:mn>1</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mmultiscripts>
<mml:mo stretchy="false">)</mml:mo>
<mml:mprescripts/>
<mml:none/>
<mml:mi>T</mml:mi>
</mml:mmultiscripts>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Here <italic>H</italic><sub><italic>a</italic></sub> has dimensions (<italic>d</italic><sub><italic>a</italic></sub>, <italic>n</italic>).</p>
<disp-formula id="S4.Ex8">
<mml:math id="M8">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>A</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>W</mml:mi>
<mml:mi>s</mml:mi>
<mml:mmultiscripts>
<mml:mi>H</mml:mi>
<mml:mprescripts/>
<mml:mn>2</mml:mn>
<mml:none/>
</mml:mmultiscripts>
<mml:mmultiscripts>
<mml:mo stretchy="false">)</mml:mo>
<mml:mprescripts/>
<mml:mi>a</mml:mi>
<mml:none/>
</mml:mmultiscripts>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>Finally, we compute sentence/document-level embeddings</p>
<disp-formula id="S4.Ex9">
<mml:math id="M9">
<mml:mrow>
<mml:mpadded width="+3.3pt">
<mml:mi>M</mml:mi>
</mml:mpadded>
<mml:mo rspace="5.8pt">=</mml:mo>
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x2062;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p><italic>A</italic> has dimensions (<italic>r</italic>, <italic>n</italic>) and <italic>M</italic> has dimensions (<italic>r</italic>, 2<italic>l</italic>): earlier the softmax applied along the second dimension of <italic>A</italic> normalizes attention weights so they sum up 1 for each attention vector of length <italic>n</italic>.</p>
<p>Lin et al. (<xref ref-type="bibr" rid="B5">5</xref>) also proposed penalization term in place of regularization to counterbalance redundancy in embedding matrix <italic>M</italic> when attention mechanism results in the same summation weights for all <italic>r</italic> hops. We first started by setting this penalization term to 0.0; however, as self-attention generally works well for finding long-term dependencies, the neural network started to overfit after few epochs of training on train data. We started with the same hyperparameters setting of self-attention block as described by Lin et al. (<xref ref-type="bibr" rid="B5">5</xref>) while setting <italic>r</italic> = 30; however, we started with no penalization to start with and found the best values for them while monitoring the validation set accuracy, which is a hidden dimension of 300 for self-attention, with eight layers of BiLSTM with a hidden dimension of 32 and the output of self-attention mechanism (sentence embeddings <italic>M</italic>) goes into a fully connected layer with its hidden dimension set to 2000. Finally, we feed the fully connected layer&#x2019;s results to output layer wrapped with sigmoid activation. The choice of the loss function, learning rate, and optimizer remains unchanged from the baseline, and the number of epochs is 20. After training the model with hyperparameters, we observed the model started to overfit on train data after a few epochs and almost achieved 99% train accuracy and loss less than 0.5 average epoch train loss; in order to add the remedy for this, we visually inspected the few of the examples from the test set in attention matrix with confidence &#x003E;0.90 and observed the attention mechanism worked as expected for longer sequences; however, as the sequence length decreased, the attention mechanism started producing roughly equal summation weights on all <italic>r</italic> hops, which intuitively makes sense in short sequences that all tokens would carry more semantic information. This results in redundancy in attention matrix <italic>A</italic> and in embedding matrix <italic>M</italic>. Below we present some of the examples from the Hindi test set. Since showing all the vectors would make it redundant, so we present only five vectors for a given sequence even though we had set <italic>r</italic> to 30; thus, we had 30 vectors for each sequence.</p>
<p>We performed the same analysis for Bengali data. Following are examples for Bengali sequences.</p>
<p>To counterbalance the redundancy among the attention matrix, we started increasing the value of the penalization coefficient of the attention mechanism and found the value 0.6 produced the best validation set accuracy. Next, we aimed to reduce the number of attention hops, i.e., varying the hyperparameter <italic>r</italic>, and we observed network with <italic>r</italic> = 20 had better performance on validation, alongside setting the hidden size of attention mechanism to 150, as compared to <italic>r</italic> = 30 and hidden size = 200 as suggested in the original work. Also, to avoid any overfitting during the BiLSTM block, we used dropout in BiLSTM layers with a value of <italic>p</italic> = 0.5.</p>
</sec>
<sec id="S5" sec-type="results">
<title>Results</title>
<p><italic>LSTM Bengali + Pret</italic> in <xref ref-type="table" rid="T2">Table 2</xref> refers to the model which shares task-specific pretrained weights from LSTM Hindi classifier. <italic>SA + JDIL</italic> is our method which uses self-attention with joint dual input learning to train a joint neural network. Results in <xref ref-type="table" rid="T2">Table 2</xref> empirically show that joint learning can benefit both the pretraining task performance and the downstream task due to the joint training procedure. Moreover, since the pretraining model <italic>LSTM Hindi</italic> has to perform training starting with randomly initialized weights, it is not possible in that setting for the pretraining network to benefit from downstream task; however, our proposed approach makes this possibles which results in meaningful performance gain for the pretraining task on the performance metrics.</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Results of evaluating the binary neural Hindi and Bengali sentiment classifiers on their respective test sets.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Model</td>
<td valign="top" align="center">Accuracy</td>
<td valign="top" align="center">Precision</td>
<td valign="top" align="center">Recall</td>
<td valign="top" align="center">F-1 score</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">LSTM-Hindi</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.74</td>
<td valign="top" align="center">0.74</td>
</tr>
<tr>
<td valign="top" align="left">LSTM-Bengali + Pret</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.77</td>
<td valign="top" align="center">0.77</td>
</tr>
<tr>
<td valign="top" align="left">SA + JDIL (Hindi)</td>
<td valign="top" align="center"><bold>0.76</bold></td>
<td valign="top" align="center"><bold>0.76</bold></td>
<td valign="top" align="center"><bold>0.76</bold></td>
<td valign="top" align="center"><bold>0.76</bold></td>
</tr>
<tr>
<td valign="top" align="left">SA + JDIL (Bengali)</td>
<td valign="top" align="center"><bold>0.78</bold></td>
<td valign="top" align="center"><bold>0.78</bold></td>
<td valign="top" align="center"><bold>0.78</bold></td>
<td valign="top" align="center"><bold>0.78</bold></td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S6" sec-type="conclusion">
<title>Conclusion</title>
<p>In our study, we investigate whether self-attention can enhance significantly the performance over unidirectional LSTM in the binary classification task setting. We also investigated how to perform transfer learning and joint dual input learning setting when the tasks are the same in binary classification in Hindi and Bengali languages. First, we found that if the length of sequence is not long enough, LSTM can be performed using self-attention since there are no very distant dependencies in sequences in most of the cases. Second, we observed that transfer learning in similar or same tasks can be a beneficial way of increasing the performance of the target task, which in our case was Bengali binary classification. However, by introducing the joint learning setting where we trained a single network for both tasks, the Hindi classification task, which was the source in the transfer learning setting, benefited with improved performance. Therefore, such architecture provides an implicit mechanism to avoid overfitting as it roughly doubled the data set size when we trained a single network.</p>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mandla</surname> <given-names>T</given-names></name> <name><surname>Modha</surname> <given-names>S</given-names></name> <name><surname>Shahi</surname> <given-names>GK</given-names></name> <name><surname>Jaiswal</surname> <given-names>AK</given-names></name> <name><surname>Nandini</surname> <given-names>D</given-names></name> <name><surname>Patel</surname> <given-names>D</given-names></name><etal/></person-group> <article-title>Overview of the hasoc track at fire 2020: hate speech and offensive content identification in Indo-European languages.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2021</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2108.05927">https://doi.org/10.48550/arXiv.2108.05927</ext-link></citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A</given-names></name> <name><surname>Shazeer</surname> <given-names>N</given-names></name> <name><surname>Parmar</surname> <given-names>N</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J</given-names></name> <name><surname>Jones</surname> <given-names>L</given-names></name> <name><surname>Gomez</surname> <given-names>AN</given-names></name><etal/></person-group> <article-title>Attention is all you need.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2017</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.1706.03762">https://doi.org/10.48550/arXiv.1706.03762</ext-link></citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kalyan</surname> <given-names>KS</given-names></name> <name><surname>Rajasekharan</surname> <given-names>A</given-names></name> <name><surname>Sangeetha</surname> <given-names>S</given-names></name></person-group>. <article-title>Ammus: a survey of transformer-based pretrained models in natural language processing.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2021</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2108.05542">https://doi.org/10.48550/arXiv.2108.05542</ext-link></citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tay</surname> <given-names>Y</given-names></name> <name><surname>Dehghani</surname> <given-names>M</given-names></name> <name><surname>Bahri</surname> <given-names>D</given-names></name> <name><surname>Metzler</surname> <given-names>D</given-names></name></person-group>. <article-title>Efficient transformers: a survey.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2020</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2009.06732">https://doi.org/10.48550/arXiv.2009.06732</ext-link></citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Z</given-names></name> <name><surname>Feng</surname> <given-names>M</given-names></name> <name><surname>Nogueira dos Santos</surname> <given-names>C</given-names></name> <name><surname>Yu</surname> <given-names>M</given-names></name> <name><surname>Xiang</surname> <given-names>B</given-names></name> <name><surname>Zhou</surname> <given-names>B</given-names></name><etal/></person-group> <article-title>A structured self-attentive sentence embedding.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2017</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.1703.03130">https://doi.org/10.48550/arXiv.1703.03130</ext-link></citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>DP</given-names></name> <name><surname>Ba</surname> <given-names>J</given-names></name></person-group>. <article-title>Adam: a method for stochastic optimization.</article-title> <source><italic>arXiv</italic></source> [<comment>Preprint</comment>]. (<year>2017</year>). Available online at: <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.1412.6980">https://doi.org/10.48550/arXiv.1412.6980</ext-link></citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>J-H</given-names></name> <name><surname>Liu</surname> <given-names>T</given-names></name> <name><surname>Luo</surname> <given-names>X</given-names></name> <name><surname>Wang</surname> <given-names>L</given-names></name></person-group>. <article-title>An LSTM approach to short text sentiment classification with word embeddings.</article-title> <source><italic>Proceedings of the 30th conference on computationa linguistics and speech processing (ROCLING 2018).</italic></source> <publisher-loc>Hsinchu</publisher-loc>: <publisher-name>The Association for Computationa Linguistics and Chinese Language Processing</publisher-name> (<year>2018</year>). p. <fpage>214</fpage>&#x2013;<lpage>223</lpage>.</citation></ref>
</ref-list>
</back>
</article>
