Back-Tracking of Sentiment Drivers in NLP

This blog post is a noteworthy contribution to the QuriousWriter Blog Contest.

Sentiment analysis in natural language processing (NLP) helps to determine whether the input data is positive, negative or neutral. It is generally performed on textual data to help businesses track customer feedback through sentiment, understand customer needs and monitor brand perception and product reputation.

Digital media has provided a platform to the users to voice their opinions. This makes it important for businesses to capture these opinions, needs and intent that users share on social media. A sentiment refers to a hidden emotion, thought and overall context for a provided supply. These attributes are given in the text format by the customers. Many organisations use this practice to perform text analysis and obtain meaningful insights.

The drivers in the text are the sections of text that are mainly responsible for the associated sentiment. In this blog, we cover back-tracking of the drivers that are going to result in an associated sentiment in the paragraph. This is helpful in understanding and extracting sentiment from a given text.

For instance, “So many tests todayyy I don`t feel confident about anyy.” This is a tweet and the corresponding sentiment is “Negative”, so the sentiment driver for this tweet is “I don’t feel confident” as this section mainly plays the role for the Negative sentiment.

Problem statement

We consider here the example of a tweet sentiment-driver extraction in which tweets as well as corresponding sentiment is given. The task is to extract the sentiment-driver in the tweet.

The punctuation and special characters are also a part of hidden sentiment in the tweeted text. Hence, these too are conserved for the extraction. For example, a tweet stating, “Happy b-day! Just woke up on this side of Earth, so wishes are bit late” and the corresponding sentiment are “Positive”. Here the extracted sentiment-driver for this tweet is “Happy b-day!”. The exclamation mark (!) is a part of the positive sentiment here and hence it is also conserved.

Data-set and descriptions

In this case, we use the kaggle-dateset for our modelling. The training data has tweets, corresponding sentiment and sentiment-driver, and the test set has only the tweets and corresponding sentiment.

Here is the snippet for the training set:

274764eac33d1c0wish we could come see u on Denver husband l…d lostnegative
274774f4c4fc327I`ve wondered about rake to. The client has …, don`t forcenegative
27478f67aae2310Yay good for both of you. Enjoy the break – y…Yay good for both of you.positive
27479ed167662a5But it was worth it ****.But it was worth it ****.positive
274806f7127d9d7All this flirting going on – The ATG smiles…All this flirting going on – The ATG smiles. Y…neutral

Text: Tweet, selected_text: sentiment-driver
Here is the snippet for the test set:

text: Tweet

Used Libraries:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import nltk
from nltk.tokenize import TweetTokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import *
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input, RepeatVector, add, BatchNormalization
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras import regularizers
Data pre-processing

Drop the null data sets.

  1. 1. Split the tweet into three parts – beginning, driver, which is the selected text, and the ending. Since the selected text (driver) is embedded into the tweet so we split the tweet into these three parts. Index of the “beginning part” will be from 0 to index of the first character of selected-text. Index of the “ending part” will be from the index of last character of selected-text to the end. Here is the code snippet for the same:
beg = []
out = []
for i in tqdm(range(len(t_data))):
  spr = t_data.iloc[i]['text']
  sub = t_data.iloc[i]['selected_text']
  idx = spr.index(sub)
100%|██████████| 27480/27480 [00:06<00:00, 4573.03it/s]
t_data['beg'] = beg
t_data['out'] = out

beg: beginning part, out: ending part
The result after running the snippet is given below:

  • 2. Next part is to tokenize the texts (tweets). Since the special characters and punctuation are important, we use the Tweet-Tokenizer to tokenize the texts. For the labelling perspective, we will set “O” for the parts that are outside of selected text and “Y” for those within the selected-text in the tweet. Here is the snippet:
xt = []
yt = []
for j in tqdm(range(len(t_data))):
  bg = tknzr.tokenize(t_data.iloc[j]['beg'])
  em = tknzr.tokenize(t_data.iloc[j]['selected_text'])
  ot = tknzr.tokenize(t_data.iloc[j]['out'])
  arr = bg+em+ot
  tarr = ["O" for i in range(len(bg))]+ ["Y" for i in range(len(em))] + ["O" for i in range(len(ot))]
100%|██████████| 27480/27480 [00:14<00:00, 1934.98it/s]
  • 3. Machine learning models only deal with the numerical data. So we convert these tokens to numerical values. For this purpose, we list down the unique words used in the whole corpus. Here is the snippet:
words = []
for k in range(len(xt)):
  words = words + xt[k]
  • 4. We will append a word “ PADDING” in this set of words as it will be used in the next part of modelling.
_words = list(set(words))
_words = _words + ['PADDING']
  • 5. Now we need the label tag for the training. So here we split the tweet in three parts “beginning”, “selected-text” and “ending”. According to this, we have “Y” for selected-text and “O” for the rest of two. The key point here is to add one more label for the padding as we will be padding the sentences to get the same length. To overcome the padding, we will append “PAD” as the label and create the dictionary for unique-words and unique-labels. Here is code snippet:
tags = ["O", "Y"]
word2idx = {w: i for i, w in enumerate(_words)}
tag2idx = {t: i for i, t in enumerate(tags)}
  • 6. Next, we pad the sentences to get the same length required for the ML model. We need the maximum length for the whole corpus. For the padding of the text corpus, we pad with the value of maximum index+1 in the dictionary of unique-words. Similarly, for the label-padding we use the “ PAD” created in the previous step. Here is the snippet for the maximum length and padding:
X = [[word2idx[w] for w in s] for s in xt]
Y = [[tag2idx[w] for w in s] for s in yt]
mx_len = 0
for ar in tqdm(xt):
  if(mx_len < len(ar)):
    mx_len = len(ar)
100%|██████████| 27480/27480 [00:00<00:00, 1420256.23it/s]
X = pad_sequences(maxlen=mx_len, sequences=X, padding="post", value = N_words-1)
Y = pad_sequences(maxlen=mx_len, sequences=Y, padding="post", value=tag2idx["O"])
  • 7. This completes the pre-processing of the text and selected-text (driver). We now move to the sentiment part which goes into the training as the input. We draw a column for sentiment with three unique values – positive, negative and natural. Here is the snippet for converting these to numerical format:
senti = ['neutral', 'negative', 'positive']
sent2idx = {sen: i for i, sen in enumerate(senti)}
X1 = t_data['sentiment'].values
X1 = [sent2idx[w] for w in X1]
X1 = np_utils.to_categorical(X1, num_classes = 3)
  • 8.  We have converted the text and labels into input format compatible with ML models. As there are only three unique values in sentiment and label (tag) data, we convert them into one hot encoded vector. Here is the snippet and shapes for input and labels:
y = np.array([np_utils.to_categorical(i, num_classes=2) for i in Y])
X.shape, X1.shape, y.shape
((27480, 66), (27480, 3), (27480, 66, 2))

Deep learning modelling

We now have two inputs – X and X1- for tweet (text) and sentiment and corresponding label for selected-text (driver). With respect to each token in the text, we have a label, if the word is part of selected-text the label is “Y” otherwise its “O”. There is an extra label for padding that is “PAD”.

Modelling Logic 

This is a seq2seq problem. Sequential RNN models give better results in this scenario. We have one input dense vector (shape=(3,)) for sentiment, irrespective of the time stamp, and another input (shape=(66,)) for the tweet that is a sequence with respect to time. The key point in this kind of modelling is to distribute the dense vector in the given time stamps. Here, we use the Repeat Vector to convert the dense vector to a time distributed dense vector. Custom word embedding has been used to get the continuous vector for each word and correlation between them.

Word Embedding

Word2Vec word embedding can be used to convert each index-token to corresponding vector format. Each word is mapped with a vector representation of its semantic and context like words containing the same semantic have lowest cosine distance from each other. Here is the view:

Conditional Random Field

CRF (Conditional Random Fields) layer has been used on the top of Bi-directional LSTM layers to get a better understanding of context in the sequence.

There are two inputs that are imported into the model.

  1. Input1: one hot encoded vector for sentiment value
  2. Input2: sequence of maximum length (here 66)

Embedding layers is only applicable for sequences of words. Hence, input2 is mapped with Word2Vec embedding before going to Bidirectional-LSTM layers. The notable point here is to keep the mask_zero = False as we have already padded the sequence with a word “PADDING” and we have also imported a label “PAD” for the same. Here is the architecture for the solution:

input1 = Input(shape=(3,))
fe1 = Dense(128, activation='relu')(input1)
fe2 = RepeatVector(mx_len)(fe1)
fe3 = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True)(fe2)

input2 = Input(shape=(mx_len,))
se1 = Embedding(N_words, 1000, mask_zero=True, input_length=mx_len)(input2)
se2 = Bidirectional(LSTM(128, return_sequences = True, recurrent_dropout=0.1))(se1)
se3 = Dropout(0.1)(se2)
se4 = Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout=0.1))(se3)
se5 = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True)(se4)

decoder1 = add([se5, fe3])
dec = LSTM(128, return_sequences = True, recurrent_dropout=0.1)(decoder1)
dec2 = TimeDistributed(Dense(64, activation='relu'))(dec)
output = TimeDistributed(Dense(len(tags), activation='softmax'))(dec2)

The summary of the parameters is produced below:

We have used the “Adam” optimiser for loss convergence. The output layer is CRF encoded so we have used the CRF loss as the loss function. For the training, we have used batch size of 128 and the validation set is 10% of the total training data. Here is the snippet :

model.compile(loss = 'binary_crossentropy', optimizer = 'adam')
filepath = root_path+"checkpoint_s"
callback = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode = 'auto', save_freq=1)
early = EarlyStopping(monitor='val_loss', mode='auto')
hist = model.fit(x=[X1, X], y= y, batch_size = 128, epochs=15, callbacks=[callback, early],  validation_split=0.1,  validation_steps = 42)

After the training starts, we can track the loss and accuracy. Here is the snippet after 5 epoch of the training:

We can save the model and go through the visualisation for the trained model.

model.save(root_path + '100_model.h5')
import matplotlib.pyplot as plt
plt.title('Model loss')
plt.legend(['Train', 'Test'], loc='upper left')

After the training is completed, we can check the prediction on the validation set. One of the examples for a prediction is given below:

XX = np.reshape(X[d], (1,66))
X11 = np.reshape(X1[d], (1,3))
pred = model.predict([X11, XX])
pred1 = np.argmax(pred, axis=2)
idx2tag = {i:w for i, w in enumerate(tags)}
pred11 = [idx2tag[idx] for idx in pred1[0]]

We can see here the predicted and ground truth for this example. The predicted truth is close to the truth value:

The model can be made more robust by further enhancement. The following steps will enhance the model:

  1. We can use BERT layers in the beginning of the LSTM layers to gain a better understanding of the context and meaning of the sentence.
  2. Hyper parameters like number of layers, number of neurons in each of the layers and optimizers can be tuned to get better accuracy.
  3. We can also use some regularisation techniques to prevent the model from over fitting, as the training accuracy is 98.89% and validation accuracy is 95.91%. We can add kernel_decay rate, combination of L1 and L2 regularizer and add Dropout layers. Once we apply these regularizers, we need to increase the layers of LSTM so that the model does not result in under-fitting.

The analysis of data from social media messages, tweets, blogs and conversations enables businesses to gather insights from NLP. Read this case study to learn how we built a customized model for a customer to transcribe the audio and perform sentiment analysis. 

Get in touch with us to integrate insights from NLP into your business.

Written byAjay Kumar Gond

Get your digital transformation started

Let's Talk