State Wise Sentiment Analysis Of CAA, NRC & NPR Tweets
“The Citizenship Amendment Act, National Population Register and, National Register of Citizens act took the entire nation by storm. Be it due to the heated newsroom debates, or the protests and riots on the streets, the entire nation was talking about it. As the row over these acts brought thousands of Indians to the streets, caused nationwide controversies and protests, their state-wise analysis helped us better understand their impact across the nation.”
Authors: Harshita Pandey(MT19012), Shivani Mittal(MT19128), Rupali(MT19095) (IIIT, Delhi)
Introduction
Social Networks are the main resources to gather information about people’s opinions and sentiments towards different topics. Twitter, being one such popular social website allows people to let loose and talk to strangers, share information and random thoughts, and get involved with people from all walks of life. It thus contains an abundance of data that can be processed in order to obtain meaningful results like sentiment scores, product reviews, and be used for predictive analysis as well.
Our project aims to use Twitter data to study the nation-wide opinion and views on CAA, NRC, and NPR using sentiment analysis. Also termed as opinion mining, sentiment analysis is primarily used for analyzing the opinions and conversations of the public and using this data to classify the sentiment as positive, negative, or neutral.
The figure below shows the process flow of our entire project.
Data Information
For this project, we extracted the twitter data for the month of February. The tweets have been extracted with the help of a Twitter provided API Tweepy on the basis of the hashtags. The data collected contained 5000 unique tweets along with other information including the Date, Location, and Retweet Count.
Pre-processing Of Data
The preprocessing of data included removing duplicate tweets, removing references and hashtags, language translation, expanding abbreviations, spelling corrections, data normalization, and storing the tweets corresponding to their states as the data analysis is performed state-wise.
Methodology
1. Manual Annotations
The tweets were manually annotated and were assigned a sentiment score of +1 if they were found in favor of CAA, NRC, and NPR. A sentiment score of -1was assigned if found against it. We also assigned a separate category for the tweets which were mainly against the violence caused across the country and had criticism against the media and the government and assigned them a sentiment score of 0. These tweets were
2. Encoding the Data
Since the data cannot be directly fed as the text data into the Machine Learning Models, thus before training the data, the text data was converted into numerical data using the below-mentioned techniques.
A. Text to sequence: The text was first tokenized using the tokenizer class and later each word of the tweet was replaced with its corresponding integer value from the “word index” dictionary. Before feeding the data into the Machine learning model, the input sequence was padded with 0, using the “pad sequences” method to have the same length sequence.
B. Tf-idf vectorizer: The Tf-idf vectorizer converts a collection of raw documents into a matrix of tf-idf scores which is the product of the tf value and idf value for each term.
TF (Term Frequency): The tf value of each term is the frequency of the term in the document.
IDF (Inverse Document Frequency): The idf value of each term is the number of documents that contain the given term.
C. Count Tokenizer: Count Tokenizer converts the collection of text data into a vector of token counts, which represents each token with its term frequency
Machine Learning Models
We used various machine learning models to predict the sentiments of the tweets after being converted into numerical form. This prediction will help us to analyze the views of the public on the implementation of CAA, NRC, and NPR.
- Random Forests
The Random Forest model consists of a large number of decision trees, which work together as an Ensemble model. It is based on the concept that the uncorrelated models, when combined together, may produce more accurate results as compared to the individual predictions. The steps of the Random Forest are:
- It begins with the original dataset as the root node.
- For each attribute, it finds the entropy or information gain and selects the attribute having the smallest value.
- The selected attribute acts as the split node and the remaining dataset is split based on the condition.
- The process stops when either the tree reaches the maximum depth or the remaining data points reach a threshold value.
- The predicted value is the average value of the predictions made by various decision trees.
2. Support Vector Machines
Support Vector Machines is a supervised algorithm that is used for finding a hyper-plane that clearly classifies the data points of the two classes plotted in an n-dimensional space.
In case of data that is not linearly separable, it makes use of the Kernel Trick. It takes the data in lower dimensional input space and transforms it into a high dimensional space, thus converting the non-separable data to a linearly separable data.
3. Linear Regression
Linear Regression is a supervised Machine Learning Algorithm, which tries to predict the value of the independent variable, given a set of dependent variables. The different Regression models differ on the type of relationship that exists between the variables and the number of independent variables. The algorithm basically tries to find out a linear relationship between the input and output variables and hence the name linear Regression.
4. Recurrent Neural Network
In neural networks, the input and output are generally independent of each other, however, in the case of Recurrent Neural Network, the output of the previous step is used as the input for the next step. RNN is generally used with sentiment analysis, which creates a neural network, models it, and predicts the probability of each class.
The accuracies achieved on the prediction data using different Machine Learning Models are shown in the table given below.
For the state-wise analysis of tweets, it is clearly seen from the table above that SVM Classifier performs best with an accuracy of 0.67
Challenges Faced During the Analysis
1. Most of the tweets did not show the impact of NRC, CAA, or NPR implementation on the public instead they showed their criticism towards the media, and the government. These tweets were mainly against the violence and protests caused across the country and the chaos created among the public. This made it difficult for us to understand their views on CAA, NRC, and NPR.
2. Some of the tweets showed a sarcastic tone, and thus it was a challenging task for the model to accurately predict the sentiment of the user.
User Interface
Libraries Used: Geopandas, Flask, and Dash
GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geopandas adds a geometry column to the DataFrame. This geometry column contains Shapely (A python library with geometry types and operations) geometries, which allow us to access all the properties and methods directly on the DataFrame and the feature as well.
A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. In our project, we have used the “Indian States.shapefile” for storing the geographic features of India since we are performing the analysis of the Indian States.
Visualization
We tried to show the impact of CAA, NRC, and NPR across different states with the help of the Indian Map. The visualization is represented with the help of two maps which show the positive and negative impact of CAA, NRC, and NPR in different states respectively.
The maps were created using the geopandas library and the interface was created using the Dash.
The next gif represents the use of Dash to present the drop-down and the positive and negative impact of CAA, NRC, and NPR in different states respectively.
We also tried making an interactive user interface using Python, HTML, and flask. Flask’s design is lightweight and modular. Therefore, it is easy to transform it into the web applications or framework when one needs very few extensions without weighing much. Users can click on “Generate Results” to see the final impact of CAA-NRC on different states.
The state-wise analysis of tweets in favor of CAA, NRC, and NPR
The dark shaded regions showed more support of people towards the government for bringing these acts in the country, whereas the lightly shaded areas showed less support of people in the respective states.
The state-wise analysis of tweets against CAA, NRC, and NPR
The dark shaded regions showed less support of people towards the government for bringing these acts in the country, whereas the lightly shaded areas showed greater support of people in the respective states.
Contributions
Harshita Pandey: Data Preprocessing, State-wise Segmentation of tweets, Applied the RNN model, Manual Annotation Of Tweets, User Interface using Dash, Documentation.
Shivani Mittal: Manual Annotation Of Tweets, Text Vectorization, Applied the SVM and Random Forest Model, Hyper-Parameter Tuning , User Interface using Dash, Documentation.
Rupali: Manual Annotation of Tweets, Data Encoding, Applied the Logistic Regression Model, CNN and SGD Optimizer, User Interface using flask and Geopandas, Documentation.
Acknowledgments:
Asst. Prof. Tanmoy Chakraborty https://www.iiitd.ac.in/tanmoy
Jasmeet Kaur (Ph.D. Scholar, IIITD) https://jasmeetk6.wixsite.com/jasmeetk
Anubhav Shrimal (Mtech. IIITD) http://anubhavshrimal.me/
Abhinav Gupta (Mtech. IIITD), Vrutti Daxeshbhai Patel (Mtech IIITD), Hridoy Sankar Dutta (Ph.D. Scholar, IIITD)
References
[1] Symeon Symeonidis, Dimitrios Effrosynidis, and Avi Arampatzis. “A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis”. In: Expert Systems with Applications 110 (2018), pp. 298–310. ISSN: 0957–4174.