White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense

Mahnoor Shahid; Shahrukh Khan; Navdeeppal Singh

doi:10.54646/bijcs.2022.04

PDF HTML XML EPUB

Abstract Views: 134

PDF Views/Downloads: 66

HTML Views/Downloads: 13

XML Views/Downloads: 44

EPUB Views/Downloads: 45

How to Cite

Shahid, M., Khan, S., & Singh, N. (2022). White-Box Attacks on Hate-speech BERT Classifiers in German with Explicit and Implicit Character Level Defense. BOHR International Journal of Computer Science, 1(1), 26–31. https://doi.org/10.54646/bijcs.2022.04

Published: Mar 15, 2022

Updated: 2022-03-15

DOI: https://doi.org/10.54646/bijcs.2022.04

Dimensions Citation count:

Authors

Mahnoor Shahid

Shahrukh Khan

Navdeeppal Singh

Abstract

Attention-based transformer models have achieved state-of-the-art results in natural language processing (NLP). However, recent work shows that the underlying attention mechanism can be exploited by adversaries to craft malicious inputs designed to induce spurious outputs, thereby harming model performance and trustworthiness. Unlike in the vision domain, the literature examining neural networks under adversarial conditions in the NLP domain is limited and most of it focuses mainly on the English language. In this article, we first analyze the adversarial robustness of Bidirectional Encoder Representations from Transformers (BERT) models for German data sets. Second, we introduce two novel NLP attacks: a character-level and a word-level attacks, both of which utilize attention scores to calculate where to inject character-level and word-level noise, respectively. Finally, we present two defense strategies against the attacks above. The first implicit character-level defense is a variant of adversarial training, which trains a new classifier capable of abstaining/rejecting certain (ideally adversarial) inputs. The other explicit character-level defense learns a latent representation of the complete training data vocabulary and then maps all tokens of an input example to the same latent space, enabling the replacement of all out-of-vocabulary tokens with the most similar in-vocabulary tokens based on the cosine similarity metric.

Share This Article On Social Media

Usage Statistics

Downloads

Download data is not yet available.

Issue

Vol. 1 No. 1 (2022): BOHR International Journal of Computer Science (BIJCS)

Section

Methods

Article Sidebar

Main Article Content

Authors

Abstract

Downloads

Article Details