This blog post is a noteworthy contribution to the QuriousWriter Blog Contest.
Sentiment analysis in natural language processing (NLP) helps to determine whether the input data is positive, negative or neutral. It is generally performed on textual data to help businesses track customer feedback through sentiment, understand customer needs and monitor brand perception and product reputation.
Digital media has provided a platform to the users to voice their opinions. This makes it important for businesses to capture these opinions, needs and intent that users share on social media. A sentiment refers to a hidden emotion, thought and overall context for a provided supply. These attributes are given in the text format by the customers. Many organisations use this practice to perform text analysis and obtain meaningful insights.
The drivers in the text are the sections of text that are mainly responsible for the associated sentiment. In this blog, we cover back-tracking of the drivers that are going to result in an associated sentiment in the paragraph. This is helpful in understanding and extracting sentiment from a given text.
For instance, “So many tests todayyy I don`t feel confident about anyy.” This is a tweet and the corresponding sentiment is “Negative”, so the sentiment driver for this tweet is “I don’t feel confident” as this section mainly plays the role for the Negative sentiment.
We consider here the example of a tweet sentiment-driver extraction in which tweets as well as corresponding sentiment is given. The task is to extract the sentiment-driver in the tweet.
The punctuation and special characters are also a part of hidden sentiment in the tweeted text. Hence, these too are conserved for the extraction. For example, a tweet stating, “Happy b-day! Just woke up on this side of Earth, so wishes are bit late” and the corresponding sentiment are “Positive”. Here the extracted sentiment-driver for this tweet is “Happy b-day!”. The exclamation mark (!) is a part of the positive sentiment here and hence it is also conserved.
In this case, we use the kaggle-dateset for our modelling. The training data has tweets, corresponding sentiment and sentiment-driver, and the test set has only the tweets and corresponding sentiment.
Here is the snippet for the training set:
textID | text | selected_text | sentiment | |
---|---|---|---|---|
27476 | 4eac33d1c0 | wish we could come see u on Denver husband l… | d lost | negative |
27477 | 4f4c4fc327 | I`ve wondered about rake to. The client has … | , don`t force | negative |
27478 | f67aae2310 | Yay good for both of you. Enjoy the break – y… | Yay good for both of you. | positive |
27479 | ed167662a5 | But it was worth it ****. | But it was worth it ****. | positive |
27480 | 6f7127d9d7 | All this flirting going on – The ATG smiles… | All this flirting going on – The ATG smiles. Y… | neutral |
Text: Tweet, selected_text: sentiment-driver
Here is the snippet for the test set:
text: Tweet
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import nltk
from nltk.tokenize import TweetTokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import *
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional, Input, RepeatVector, add, BatchNormalization
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras import regularizers
Drop the null data sets.
beg = []
out = []
for i in tqdm(range(len(t_data))):
spr = t_data.iloc[i]['text']
sub = t_data.iloc[i]['selected_text']
idx = spr.index(sub)
beg.append(spr[:idx])
out.append(spr[idx+len(sub):])
100%|██████████| 27480/27480 [00:06<00:00, 4573.03it/s]
t_data['beg'] = beg
t_data['out'] = out
beg: beginning part, out: ending part
The result after running the snippet is given below:
xt = []
yt = []
for j in tqdm(range(len(t_data))):
bg = tknzr.tokenize(t_data.iloc[j]['beg'])
em = tknzr.tokenize(t_data.iloc[j]['selected_text'])
ot = tknzr.tokenize(t_data.iloc[j]['out'])
arr = bg+em+ot
tarr = ["O" for i in range(len(bg))]+ ["Y" for i in range(len(em))] + ["O" for i in range(len(ot))]
xt.append(arr)
yt.append(tarr)
100%|██████████| 27480/27480 [00:14<00:00, 1934.98it/s]
words = []
for k in range(len(xt)):
words = words + xt[k]
_words = list(set(words))
_words = _words + ['PADDING']
_words[-1]
'PADDING'
tags = ["O", "Y"]
word2idx = {w: i for i, w in enumerate(_words)}
tag2idx = {t: i for i, t in enumerate(tags)}
X = [[word2idx[w] for w in s] for s in xt]
Y = [[tag2idx[w] for w in s] for s in yt]
mx_len = 0
for ar in tqdm(xt):
if(mx_len < len(ar)):
mx_len = len(ar)
100%|██████████| 27480/27480 [00:00<00:00, 1420256.23it/s]
X = pad_sequences(maxlen=mx_len, sequences=X, padding="post", value = N_words-1)
Y = pad_sequences(maxlen=mx_len, sequences=Y, padding="post", value=tag2idx["O"])
senti = ['neutral', 'negative', 'positive']
sent2idx = {sen: i for i, sen in enumerate(senti)}
X1 = t_data['sentiment'].values
X1 = [sent2idx[w] for w in X1]
X1 = np_utils.to_categorical(X1, num_classes = 3)
y = np.array([np_utils.to_categorical(i, num_classes=2) for i in Y])
X.shape, X1.shape, y.shape
((27480, 66), (27480, 3), (27480, 66, 2))
We now have two inputs – X and X1- for tweet (text) and sentiment and corresponding label for selected-text (driver). With respect to each token in the text, we have a label, if the word is part of selected-text the label is “Y” otherwise its “O”. There is an extra label for padding that is “PAD”.
This is a seq2seq problem. Sequential RNN models give better results in this scenario. We have one input dense vector (shape=(3,)) for sentiment, irrespective of the time stamp, and another input (shape=(66,)) for the tweet that is a sequence with respect to time. The key point in this kind of modelling is to distribute the dense vector in the given time stamps. Here, we use the Repeat Vector to convert the dense vector to a time distributed dense vector. Custom word embedding has been used to get the continuous vector for each word and correlation between them.
Word2Vec word embedding can be used to convert each index-token to corresponding vector format. Each word is mapped with a vector representation of its semantic and context like words containing the same semantic have lowest cosine distance from each other. Here is the view:
CRF (Conditional Random Fields) layer has been used on the top of Bi-directional LSTM layers to get a better understanding of context in the sequence.
There are two inputs that are imported into the model.
Embedding layers is only applicable for sequences of words. Hence, input2 is mapped with Word2Vec embedding before going to Bidirectional-LSTM layers. The notable point here is to keep the mask_zero = False as we have already padded the sequence with a word “PADDING” and we have also imported a label “PAD” for the same. Here is the architecture for the solution:
input1 = Input(shape=(3,))
fe1 = Dense(128, activation='relu')(input1)
fe2 = RepeatVector(mx_len)(fe1)
fe3 = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True)(fe2)
input2 = Input(shape=(mx_len,))
se1 = Embedding(N_words, 1000, mask_zero=True, input_length=mx_len)(input2)
se2 = Bidirectional(LSTM(128, return_sequences = True, recurrent_dropout=0.1))(se1)
se3 = Dropout(0.1)(se2)
se4 = Bidirectional(LSTM(64, return_sequences = True, recurrent_dropout=0.1))(se3)
se5 = BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True)(se4)
decoder1 = add([se5, fe3])
dec = LSTM(128, return_sequences = True, recurrent_dropout=0.1)(decoder1)
dec2 = TimeDistributed(Dense(64, activation='relu'))(dec)
output = TimeDistributed(Dense(len(tags), activation='softmax'))(dec2)
The summary of the parameters is produced below:
We have used the “Adam” optimiser for loss convergence. The output layer is CRF encoded so we have used the CRF loss as the loss function. For the training, we have used batch size of 128 and the validation set is 10% of the total training data. Here is the snippet :
model.compile(loss = 'binary_crossentropy', optimizer = 'adam')
filepath = root_path+"checkpoint_s"
callback = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode = 'auto', save_freq=1)
early = EarlyStopping(monitor='val_loss', mode='auto')
hist = model.fit(x=[X1, X], y= y, batch_size = 128, epochs=15, callbacks=[callback, early], validation_split=0.1, validation_steps = 42)
After the training starts, we can track the loss and accuracy. Here is the snippet after 5 epoch of the training:
We can save the model and go through the visualisation for the trained model.
model.save(root_path + '100_model.h5')
import matplotlib.pyplot as plt
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
After the training is completed, we can check the prediction on the validation set. One of the examples for a prediction is given below:
XX = np.reshape(X[d], (1,66))
X11 = np.reshape(X1[d], (1,3))
pred = model.predict([X11, XX])
pred1 = np.argmax(pred, axis=2)
idx2tag = {i:w for i, w in enumerate(tags)}
pred11 = [idx2tag[idx] for idx in pred1[0]]
We can see here the predicted and ground truth for this example. The predicted truth is close to the truth value:
The analysis of data from social media messages, tweets, blogs and conversations enables businesses to gather insights from NLP. Read this case study to learn how we built a customized model for a customer to transcribe the audio and perform sentiment analysis.
Get in touch with us to integrate insights from NLP into your business.