<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Bohr. Cs.</journal-id>
<journal-title>BOHR International Journal of Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Bohr. Cs.</abbrev-journal-title>
<issn pub-type="epub">2583-455X</issn>
<publisher>
<publisher-name>BOHR</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.54646/bijcs.2022.04</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Methods</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>White-box attacks on hate-speech BERT classifiers in German with explicit and implicit character-level defense</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Khan</surname> <given-names>Shahrukh</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Shahid</surname> <given-names>Mahnoor</given-names></name>
</contrib>
<contrib contrib-type="author">
<name><surname>Singh</surname> <given-names>Navdeeppal</given-names></name>
</contrib>
</contrib-group>
<author-notes>
<corresp id="c001">&#x002A;Correspondence: Shahrukh Khan, <email>shkh00001@stud.uni-saarland.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>15</day>
<month>03</month>
<year>2022</year>
</pub-date>
<volume>1</volume>
<issue>1</issue>
<fpage>26</fpage>
<lpage>31</lpage>
<history>
<date date-type="received">
<day>24</day>
<month>02</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>03</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2022 Khan, Shahid and Singh.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Khan, Shahid and Singh</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by-nc-nd/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Attention-based transformer models have achieved state-of-the-art results in natural language processing (NLP). However, recent work shows that the underlying attention mechanism can be exploited by adversaries to craft malicious inputs designed to induce spurious outputs, thereby harming model performance and trustworthiness. Unlike in the vision domain, the literature examining neural networks under adversarial conditions in the NLP domain is limited and most of it focuses mainly on the English language. In this article, we first analyze the adversarial robustness of Bidirectional Encoder Representations from Transformers (BERT) models for German data sets. Second, we introduce two novel NLP attacks: a character-level and a word-level attacks, both of which utilize attention scores to calculate where to inject character-level and word-level noise, respectively. Finally, we present two defense strategies against the attacks above. The first implicit character-level defense is a variant of adversarial training, which trains a new classifier capable of abstaining/rejecting certain (ideally adversarial) inputs. The other explicit character-level defense learns a latent representation of the complete training data vocabulary and then maps all tokens of an input example to the same latent space, enabling the replacement of all out-of-vocabulary tokens with the most similar in-vocabulary tokens based on the cosine similarity metric.</p>
</abstract>
<counts>
<fig-count count="7"/>
<table-count count="4"/>
<equation-count count="0"/>
<ref-count count="23"/>
<page-count count="6"/>
<word-count count="2807"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1" sec-type="intro">
<title>Introduction</title>
<p>Natural language processing (NLP) has achieved tremendous progress in surpassing human-level baselines in a plethora of language tasks with the help of attention-based neural architectures (<xref ref-type="bibr" rid="B1">1</xref>). However, recent studies (<xref ref-type="bibr" rid="B2">2</xref>&#x2013;<xref ref-type="bibr" rid="B4">4</xref>) show that such neural models trained via transfer learning are susceptible to adversarial noise. However, this also presents new challenges against adversaries which pose a realistic threat to machine learning system&#x2019;s utility if present. This is because attention attributions can be potentially be exploited by an adversary to craft attacks that require least perturbation budget and compute to carry out a successful attack on the victim neural network model. Moreover, to the best of our knowledge, most work concentrates on English language corpora.</p>
<p>Adversarial attacks on machine learning models are possible to defend against while also minimizing risks to degradation of model&#x2019;s utility and performance. Two novel defense strategies are proposed: implicit and explicit character-level defenses. Implicit character-level defense introduces a variant of adversarial training where the adversarial text sequences are generated via white-box character-level attack and are mapped to a new abstain class and then the model is retrained. Whereas explicit character-level defense performs adversarial pre-processing of each text sequence prior to inference to eliminate adversarial signals, hence resulting in transformation of adversarial input to benign.</p>
</sec>
<sec id="S2">
<title>Literature survey</title>
<p>Hsieh et al. (<xref ref-type="bibr" rid="B2">2</xref>) proposed using self-attention scores for computing token importances in order to rank potential candidate tokens for perturbation. However, one potential shortcoming of their idea is that they replace the potential token candidate with random tokens from vocabulary, which may result in changing the semantic meaning of perturbed sample. Garg et al. (<xref ref-type="bibr" rid="B3">3</xref>) proposed BERT-based Adversarial Examples for Text Classification in which they employ Mask Language Modeling (MLM) for generating potential word replacements in a black-box setting. Finally, Pruthi et al. (<xref ref-type="bibr" rid="B4">4</xref>) showed susceptibility of BERT (<xref ref-type="bibr" rid="B5">5</xref>) based models to character-level misspellings also in a black-box setting. In our study, we employ both character-level and word-level attacks in a white-box setting.</p>
</sec>
<sec id="S3">
<title>Problem statement</title>
<p>To use attention mechanism in transfer learning setting to craft word and character-level adversarial attacks on neural networks. Also, evaluate and compare the robustness of two novel character-level adversarial defenses.</p>
</sec>
<sec id="S4">
<title>Experimental setting</title>
<sec id="S4.SS1">
<title>Undefended models</title>
<sec id="S4.SS1.SSS1">
<title>Data sets</title>
<p>We present our work based on HASOC 2019 (German Language) (<xref ref-type="bibr" rid="B6">6</xref>) and GermEval 2021 (<xref ref-type="bibr" rid="B7">7</xref>). Both of the sub-tasks are binary classification tasks where the positive labels correspond to hate-speech and negative labels correspond to non-hate-speech examples (<xref ref-type="table" rid="T1">Table 1</xref>).</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Data set statistics.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Data set</td>
<td valign="top" align="center">Train</td>
<td valign="top" align="center">Validation</td>
<td valign="top" align="center">Test</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">3054</td>
<td valign="top" align="center">765</td>
<td valign="top" align="center">850</td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td valign="top" align="center">2594</td>
<td valign="top" align="center">650</td>
<td valign="top" align="center">944</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
<sec id="S4.SS1.SSS2">
<title>Training</title>
<p>For training, the undefended models, we fine-tune GBERT (<xref ref-type="bibr" rid="B8">8</xref>) language model for German language which employs training strategies, namely, Whole Word Masking (WWM) and evaluation-driven training and currently achieves SoTA performance for document classification task for German language. We obtain the following accuracy scores for each data set, respectively (<xref ref-type="table" rid="T2">Table 2</xref>).</p>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Undefended models.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Dataset</td>
<td valign="top" align="center">Accuracy(%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">84</td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td valign="top" align="center">69</td>
</tr>
</tbody>
</table></table-wrap>
</sec>
</sec>
<sec id="S4.SS2">
<title>Attacks</title>
<sec id="S4.SS2.SSS1">
<title>Baseline word-level white-box attack</title>
<p>The baseline word-level attack is composed by enhancing token candidate proposed by Hsieh et al. (<xref ref-type="bibr" rid="B2">2</xref>) which prominently replaces tokens sorted in order of their attention scores with random tokens from vocabulary, which may lead to perturbed sequence being semantically dissimilar to the source sequence. In the baseline attack, we address this potential shortcoming by using a language model MLM to generate potential candidate for each token ranked in the order of attention scores. Furthermore, instead of just performing the replacement operation, we employ the perturbation scheme proposed by Garg et al. (<xref ref-type="bibr" rid="B3">3</xref>) and insert generated tokens to the left/right of the target token where the candidate tokens are generated via MLM.</p>
</sec>
<sec id="S4.SS2.SSS2">
<title>Word-level White-box attack</title>
<p>The main motivation behind this attack is based on the fact using only language models to ensure semantic correctness in the adversarial sequences is not enough. Since it highly depends on the vocabulary of the pretrained language model. We improve the baseline attack for the preserving more semantic and syntactic correctness of the source sequence by introducing further constraints on the generated sequence by the baseline attack. First, we compute the document-level embeddings for both perturbed and source sequences and then compute cosine similarity with a minimum acceptance threshold of 0.9363 as originally suggested by Jin et al. (<xref ref-type="bibr" rid="B9">9</xref>), since Garg et al. (<xref ref-type="bibr" rid="B3">3</xref>) developed their work using the same threshold value. Finally, we further add another constraint that Part of Speech (POS) tag of both candidate and target tokens should be same.</p>
</sec>
<sec id="S4.SS2.SSS3">
<title>Character-level White-box attack</title>
<p>In this white-box character-level attack, attention scores are obtained in order to get the word importance, similar to earlier white-box word-level attacks. Then, by ordering the word importance in the order of higher to lower, we employ the character perturbation scheme employed by Pruthi et al. (<xref ref-type="bibr" rid="B4">4</xref>) since they evaluated this in the black-box setting only. In our study, we perform character-level perturbation within a target token by token modification of character (e.g., swap, insert, and delete) applied to cause perturbations such adversarial examples are utilized to maximize the change in model&#x2019;s original prediction confidence with limited numbers of modifications. However, these modifications prove to be significantly effective, as outlined in section &#x201C;Results.&#x201D;</p>
</sec>
</sec>
<sec id="S4.SS3">
<title>Defenses</title>
<sec id="S4.SS3.SSS1">
<title>Abstain-based training</title>
<p>In several past evaluations and benchmarks of defenses against adversarial examples (<xref ref-type="bibr" rid="B10">10</xref>&#x2013;<xref ref-type="bibr" rid="B15">15</xref>), adversarial training(<xref ref-type="bibr" rid="B16">16</xref>) has been found tobeoneofthebestways of conferring robustness. However, it is computationally expensive due to the need of creating adversarial examples during training. Thus, we chose to employ a detection-based defense, which we call abstain-based training. Although detection-based defenses are known to be not as effective as adversarial training (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B15">15</xref>), we still believe our method will deliver insights into the capability of BERT models in recognizing adversarial examples similar to adversarial training due the way it works. In contrast to other detection-based defenses in the literature (<xref ref-type="bibr" rid="B17">17</xref>&#x2013;<xref ref-type="bibr" rid="B21">21</xref>), the approach is much simpler. It works as follows.</p>
<p>Let C be the trained undefended classifier. We create a new (untrained) classifier <italic>C</italic> from C by extending the number of classes it is able to predict by one. The new class is labeled &#x201C;ABSTAIN,&#x201D; representing that the classifier abstains from making a prediction. Using C we create the adversarial examples. We mix these with the normal examples from the data set (of C), where the adversarial examples have the abstain label, to create a new data set. We then simply train on this data set. We applied this defense strategy on the models from Section &#x201C;Training&#x201D; and present the results in <xref ref-type="table" rid="T3">Table 3</xref>. We also show the classification attributions in <xref ref-type="fig" rid="F1">Figure 1</xref> to try to interpret the models&#x2019; behavior.</p>
<table-wrap position="float" id="T3">
<label>TABLE 3</label>
<caption><p>Character-level attack on defended models.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Data set</td>
<td valign="top" align="center">Defense</td>
<td valign="top" align="center">Attack success rate(%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">Explicit character level</td>
<td valign="top" align="center">9.5</td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td/>
<td valign="top" align="center"><bold>5.3</bold></td>
</tr>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">Implicit abstain based</td>
<td valign="top" align="center"><bold>1</bold></td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td/>
<td valign="top" align="center">11.1</td>
</tr>
</tbody>
</table></table-wrap>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>Visualization of the classification attributions of the abstain-based trained models, which correctly classify the examples. The perturbed examples shown above fool the normally trained models. We observe that the attributions are much more spread out when models encounters a perturbed example. (Words were split by the tokenizer, thus a single word can have different sub-attributions).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g001.tif"/>
</fig>
</sec>
<sec id="S4.SS3.SSS2">
<title>Explicit character-level defense</title>
<p>Abstain-based training defense achieves high success in defending against the adversarial character-level perturbed inputs. However, this results in degraded system utility since the model does not make any useful prediction when the input is perturbed at character level. To overcome this drawback, we propose the explicit characterlevel defense, which is an unsupervised approach which makes an assumption that</p>
<p>&#x2200; t &#x2208; <italic>T<sub><italic>input</italic></sub></italic> : t &#x2208; <italic>V<sub><italic>train.</italic></sub></italic></p>
<p>Here, <italic>V</italic><sub><italic>train</italic></sub> is the set of all tokens present in the training set. However, replacing this set with set of words in the given language, i.e., set of all words in German language etc., would result in better results. <italic>T</italic><sub><italic>input</italic></sub> refers to set of tokens present in the input sequence and we assume the worstcase, which means <italic>Tinput</italic> is perturbed with character-level noise.</p>
<p>In this defense method, we first re-purpose the Sentence-BERT (<xref ref-type="bibr" rid="B22">22</xref>) architecture, which originally trained sentence pairs to compute semantic vector representations and achieved SoTA results on multiple information retrieval data sets. However, we change input to character level by inputting word pairs to the network. Concretely, we labeled the Birkbeck spelling error corpus (<xref ref-type="bibr" rid="B23">23</xref>) has word pairs with one correct and the other misspelled word and we label each pair based on the Levenshtein distance between each pair. The schematics of our neural approach are given in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>Sentence-BERT for character level similarity.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g002.tif"/>
</fig>
<p>The main idea behind using the neural approach is to project similarly spelled words close to each other in the vector space. Algorithm 1 outlines main idea of our approach for explicit character-level defense.</p>
<table-wrap position="float" id="A1">
<label>Algorithm 1</label>
<caption><p>Explicit character-level defense.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<tbody>
<tr>
<td valign="top" align="left"><monospace><bold>begin:</bold></monospace><break/> <monospace><italic>V<sub><italic>tra</italic></sub>i<sub><italic>n</italic></sub></italic> ti&#x2026; t<sub><italic>m</italic></sub> &#x003E; Set of tokens in</monospace><break/> <monospace>vocabulary</monospace><break/> <monospace>E<sub><italic>v</italic></sub> el&#x2026; <italic>em</italic> &#x003E; Embeddings of vocabulary</monospace><break/> <monospace><italic>Tinput ti</italic>&#x2026; tj &#x003E; Set of tokens in input</monospace><break/> <monospace><bold>for</bold> k 1 to <italic>j</italic> <bold>do</bold></monospace><break/> <monospace>ek vi&#x2026; <italic>v</italic><sub><italic>n</italic></sub> &#x003E; Get embedding of input</monospace><break/> <monospace>token k</monospace><break/> <monospace><italic>scores cos(E<sub><italic>v</italic></sub></italic>, ek)</monospace><break/> <monospace>&#x00A0;&#x003E; Cosine similarity with vocabulary</monospace><break/> <monospace>embeddings</monospace><break/> <monospace><bold>if</bold>max scores &#x003E; 0.7 <italic>and</italic> max scores &#x003C; 1.0</monospace><break/> <monospace><bold>then</bold> <italic>vocabi</italic><sub><italic>n</italic></sub><italic>d</italic><sub><italic>ex</italic></sub> arg max scores;</monospace><break/> <monospace><sup><italic>T</italic></sup><italic>input</italic><sup>[k]</sup> <sup><italic>V</italic></sup><italic>train</italic> <sup>[<italic>vocab</italic></sup><italic>index</italic><sup>]</sup></monospace><break/> <monospace><bold>end for</bold></monospace></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="S5" sec-type="results">
<title>Results</title>
<sec id="S5.SS1">
<title>Attack results</title>
<p><xref ref-type="table" rid="T4">Table 4</xref> shows the character-level attacks to be most effective on both models.</p>
<table-wrap position="float" id="T4">
<label>TABLE 4</label>
<caption><p>Attacks result on undefended models.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left">Data set</td>
<td valign="top" align="center">Attack</td>
<td valign="top" align="center">Successrate(%)</td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">Baseline</td>
<td valign="top" align="center">8.49</td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td/>
<td valign="top" align="center">60.3</td>
</tr>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">Word-level</td>
<td valign="top" align="center">4.03</td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td/>
<td valign="top" align="center">49.8</td>
</tr>
<tr>
<td valign="top" align="left">HASOC 2019</td>
<td valign="top" align="center">Character-level</td>
<td valign="top" align="center"><bold>73.1</bold></td>
</tr>
<tr>
<td valign="top" align="left">GermEval 2021</td>
<td/>
<td valign="top" align="center"><bold>93.5</bold></td>
</tr>
</tbody>
</table></table-wrap>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> illustrates how the number of queries required per sample for a successful attack depends on the data set and attack type, we further show in <xref ref-type="fig" rid="F4">Figure 4</xref> that both word-level attacks require more queries for a longer sequence as compared to character-level attack, which is slightly agnostic to the sequence length. <xref ref-type="fig" rid="F5">Figure 5</xref> shows that the character-level attack requires minimal amount of perturbation since the changes are at word level; moreover from <xref ref-type="fig" rid="F6">Figure 6</xref>, it can be concluded that character-level attack also makes the highest difference in model prediction confidence in case of a successful attack.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>Average number of queries per successful attack.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g003.tif"/>
</fig>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>Pearson correlation between original text length and number of queries for attack success.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g004.tif"/>
</fig>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption><p>Levenshtein distance-based similarity between original and perturbed sequences.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g005.tif"/>
</fig>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption><p>Confidence Delta between original and perturbed sequences caused by each attack.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g006.tif"/>
</fig>
</sec>
<sec id="S5.SS2">
<title>Defense results</title>
</sec>
</sec>
<sec id="S6" sec-type="conclusion">
<title>Conclusion</title>
<p>We show that self-attentive models are more susceptible to character-level adversarial attacks than word-level attacks on text classification NLP task. We provide two potential ways to defend against character-level attacks. Future work can be done to enhance the explicit character-level defense using supervised sequence-to-sequence neural approaches, as can be seen in <xref ref-type="fig" rid="F7">Figure 7</xref> where current approach enhances the jaccard similarity of defended sequences with original sequences when compared to jaccard similarity between original sequence and perturbed sequence in case of GermEval 2021. However, for HASOC 2019 data set because of abundance of out-of-vocabulary tokens in the unseen test set, the defense degrades the quality of defended sequences. However, even then the defense proves to be quiet robust against character-level adversarial examples, as can be seen in <xref ref-type="table" rid="T3">Table 3</xref>.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption><p>Jaccard similarity between original and perturbed text vs. the original and defended text.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="bijcs-2022-04-g007.tif"/>
</fig>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A</given-names></name> <name><surname>Shazeer</surname> <given-names>N</given-names></name> <name><surname>Parmar</surname> <given-names>N</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J</given-names></name> <name><surname>Jones</surname> <given-names>L</given-names></name> <name><surname>Gomez</surname> <given-names>A</given-names></name><etal/></person-group> <article-title>Attention is all you need.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2017</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1706.03762</pub-id></citation></ref>
<ref id="B2"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hsieh</surname> <given-names>Y</given-names></name> <name><surname>Cheng</surname> <given-names>M</given-names></name> <name><surname>Juan</surname> <given-names>D</given-names></name> <name><surname>Wei</surname> <given-names>W</given-names></name> <name><surname>Hsu</surname> <given-names>W</given-names></name> <name><surname>Hsieh</surname> <given-names>C</given-names></name></person-group>. <article-title>On the robustness of self-attentive models.</article-title> <source><italic>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.</italic></source> <publisher-loc>Florence</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2019</year>). p. <fpage>1520</fpage>&#x2013;<lpage>9</lpage>.</citation></ref>
<ref id="B3"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garg</surname> <given-names>S</given-names></name> <name><surname>Ramakrishnan</surname> <given-names>G</given-names></name></person-group>. <article-title>BAE: BERT-based adversarial examples for text classification.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2020</year>): <pub-id pub-id-type="doi">10.48550/arXiv.2004.01970</pub-id></citation></ref>
<ref id="B4"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pruthi</surname> <given-names>D</given-names></name> <name><surname>Dhingra</surname> <given-names>B</given-names></name> <name><surname>Lipton</surname> <given-names>Z</given-names></name></person-group>. <article-title>Combating adversarial misspellings with robust word recognition.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2019</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1905.11268</pub-id></citation></ref>
<ref id="B5"><label>5.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Devlin</surname> <given-names>J</given-names></name> <name><surname>Chang</surname> <given-names>M</given-names></name> <name><surname>Lee</surname> <given-names>K</given-names></name> <name><surname>Toutanova</surname> <given-names>K</given-names></name></person-group>. <article-title>BERT: pre-training of deep bidirectional transformers for language understanding.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2018</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1810.04805</pub-id></citation></ref>
<ref id="B6"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mandl</surname> <given-names>T</given-names></name> <name><surname>Modha</surname> <given-names>S</given-names></name> <name><surname>Majumder</surname> <given-names>P</given-names></name> <name><surname>Patel</surname> <given-names>D</given-names></name> <name><surname>Dave</surname> <given-names>M</given-names></name> <name><surname>Mandlia</surname> <given-names>C</given-names></name><etal/></person-group> <article-title>Overview of the hasoc track at fire 2019: hate speech and offensive content identification in indo-european languages.</article-title> <source><italic>Proceedings of the 11th Forum for Information Retrieval Evaluation, FIRE &#x2019;19.</italic></source> <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name> (<year>2019</year>). p. <fpage>14</fpage>&#x2013;<lpage>7</lpage>.</citation></ref>
<ref id="B7"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Risch</surname> <given-names>J</given-names></name> <name><surname>Stoll</surname> <given-names>A</given-names></name> <name><surname>Wilms</surname> <given-names>L</given-names></name> <name><surname>Wiegand</surname> <given-names>M</given-names></name></person-group>. <article-title>Overview of the GermEval 2021 shared task on the identification of toxic, engaging, and fact-claiming comments.</article-title> <source><italic>Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments.</italic></source> <publisher-loc>Duesseldorf</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2021</year>). p. <fpage>1</fpage>&#x2013;<lpage>12</lpage>.</citation></ref>
<ref id="B8"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chan</surname> <given-names>B</given-names></name> <name><surname>Schweter</surname> <given-names>S</given-names></name> <name><surname>Meiller</surname> <given-names>T</given-names></name></person-group>. <article-title>German&#x2019;s next language model.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2020</year>): <pub-id pub-id-type="doi">10.48550/arXiv.2010.10906</pub-id></citation></ref>
<ref id="B9"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>D</given-names></name> <name><surname>Jin</surname> <given-names>Z</given-names></name> <name><surname>Zhou</surname> <given-names>J</given-names></name> <name><surname>Szolovits</surname> <given-names>P</given-names></name></person-group>. <article-title>Is BERT really robust? Natural language attack on text classification and entailment.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2019</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1907.11932</pub-id></citation></ref>
<ref id="B10"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Athalye</surname> <given-names>A</given-names></name> <name><surname>Carlini</surname> <given-names>N</given-names></name> <name><surname>Wagner</surname> <given-names>D</given-names></name></person-group>. <article-title>Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples.</article-title> <source><italic>Proceedings of the International conference on machine learning.</italic></source> <publisher-name>PMLR</publisher-name> (<year>2018</year>). p. <fpage>274</fpage>&#x2013;<lpage>83</lpage>.</citation></ref>
<ref id="B11"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carlini</surname> <given-names>N</given-names></name> <name><surname>Wagner</surname> <given-names>D</given-names></name></person-group>. <article-title>Adversarial examples are not easily detected: bypassing ten detection methods.</article-title> <source><italic>Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security.</italic></source> <publisher-loc>New York, NY</publisher-loc>: (<year>2017</year>).</citation></ref>
<ref id="B12"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carlini</surname> <given-names>N</given-names></name> <name><surname>Wagner</surname> <given-names>D</given-names></name></person-group>. <article-title>Towards evaluating the robustness of neural networks.</article-title> <source><italic>Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP).</italic></source> <publisher-loc>San Jose, CA</publisher-loc>: (<year>2017</year>). p. <fpage>39</fpage>&#x2013;<lpage>57</lpage>.</citation></ref>
<ref id="B13"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Croce</surname> <given-names>F</given-names></name> <name><surname>Hein</surname> <given-names>M</given-names></name></person-group>. <article-title>Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks.</article-title> <source><italic>Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research.</italic></source> <publisher-name>PMLR</publisher-name> (<year>2020</year>). p. <fpage>2206</fpage>&#x2013;<lpage>16</lpage>.</citation></ref>
<ref id="B14"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Croce</surname> <given-names>F</given-names></name> <name><surname>Andriushchenko</surname> <given-names>M</given-names></name> <name><surname>Sehwag</surname> <given-names>V</given-names></name> <name><surname>Flammarion</surname> <given-names>N</given-names></name> <name><surname>Chiang</surname> <given-names>M</given-names></name> <name><surname>Mittal</surname> <given-names>P</given-names></name><etal/></person-group> <article-title>Robustbench: a standardized adversarial robustness benchmark.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2020</year>): <pub-id pub-id-type="doi">10.48550/arXiv.2010.09670</pub-id></citation></ref>
<ref id="B15"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bryniarski</surname> <given-names>O</given-names></name> <name><surname>Hingun</surname> <given-names>N</given-names></name> <name><surname>Pachuca</surname> <given-names>P</given-names></name> <name><surname>Wang</surname> <given-names>V</given-names></name> <name><surname>Carlini</surname> <given-names>N</given-names></name></person-group>. <article-title>Evading adversarial example detection defenses with orthogonal projected gradient descent.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2021</year>): <pub-id pub-id-type="doi">10.48550/arXiv.2106.15023</pub-id></citation></ref>
<ref id="B16"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Madry</surname> <given-names>A</given-names></name> <name><surname>Makelov</surname> <given-names>A</given-names></name> <name><surname>Schmidt</surname> <given-names>L</given-names></name> <name><surname>Tsipras</surname> <given-names>D</given-names></name> <name><surname>Vladu</surname> <given-names>A</given-names></name></person-group>. <article-title>Towards deep learning models resistant to adversarial attacks.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2018</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1706.06083</pub-id></citation></ref>
<ref id="B17"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grosse</surname> <given-names>K</given-names></name> <name><surname>Manoharan</surname> <given-names>P</given-names></name> <name><surname>Papernot</surname> <given-names>N</given-names></name> <name><surname>Backes</surname> <given-names>M</given-names></name> <name><surname>McDaniel</surname> <given-names>P</given-names></name></person-group>. <article-title>On the (statistical) detection of adversarial examples.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2017</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1702.06280</pub-id></citation></ref>
<ref id="B18"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>Z</given-names></name> <name><surname>Wang</surname> <given-names>W</given-names></name> <name><surname>Ku</surname> <given-names>W</given-names></name></person-group>. <article-title>Adversarial and clean data are not twins.</article-title> <source><italic>ArXiv [Preprint]</italic></source> (<year>2017</year>): <pub-id pub-id-type="doi">10.48550/arXiv.1704.04960</pub-id></citation></ref>
<ref id="B19"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bendale</surname> <given-names>A</given-names></name> <name><surname>Boult</surname> <given-names>T</given-names></name></person-group>. <article-title>Towards open set deep networks.</article-title> <source><italic>Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016.</italic></source> <publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name> (<year>2016</year>). p. <fpage>1563</fpage>&#x2013;<lpage>72</lpage>.</citation></ref>
<ref id="B20"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sotgiu</surname> <given-names>A</given-names></name> <name><surname>Demontis</surname> <given-names>A</given-names></name> <name><surname>Melis</surname> <given-names>M</given-names></name> <name><surname>Biggio</surname> <given-names>B</given-names></name> <name><surname>Fumera</surname> <given-names>G</given-names></name> <name><surname>Feng</surname> <given-names>X</given-names></name><etal/></person-group> <article-title>Deep neural rejection against adversarial examples.</article-title> <source><italic>EURASIP J Inform Secur.</italic></source> (<year>2020</year>) <volume>2020</volume>:<fpage>1</fpage>&#x2013;<lpage>10</lpage>.</citation></ref>
<ref id="B21"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Metzen</surname> <given-names>J</given-names></name> <name><surname>Genewein</surname> <given-names>T</given-names></name> <name><surname>Fischer</surname> <given-names>V</given-names></name> <name><surname>Bischoff</surname> <given-names>B</given-names></name></person-group>. <article-title>On detecting adversarial perturbations.</article-title> <source><italic>Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.</italic></source> <publisher-loc>Toulon</publisher-loc>: (<year>2017</year>).</citation></ref>
<ref id="B22"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Reimers</surname> <given-names>N</given-names></name> <name><surname>Gurevych</surname> <given-names>I</given-names></name></person-group>. <article-title>Sentence-BERT: sentence embeddings using Siamese BERT-networks.</article-title> <source><italic>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.</italic></source> <publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name> (<year>2019</year>). <fpage>11</fpage> p.</citation></ref>
<ref id="B23"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mitton</surname> <given-names>R.</given-names></name></person-group> <source><italic>Birkbeck Spelling Error Corpus.</italic></source> <publisher-loc>Oxford</publisher-loc>: <publisher-name>University of Oxford</publisher-name> (<year>1980</year>).</citation></ref>
</ref-list>
</back>
</article>
