Sentimental Analysis of Tweets using Python.
This is my initial research on Sentimental Analysis and libraries TextBlob and NLTK. Also this content is solely for educational purpose, please do not use it for any commercial application. Hope you get some insight about sentimental analysis by this article.
NOTE: Replace ' or " with ' (Apostrophe')
a. What is sentimental analysis
Sentimental Analysis is the process to determine the sentiments
of a piece of writing. It can be positive negative or neutral.
Sentimental Analysis refers to the use of Text Analysis, Natural
Language Processing and Computational Linguistics to identify
and extract subjective information in the given text. For Example:
Text 1: “I love eating at McDonald’s”
Text 2: “That dress is really bad. Do not buy it.”
Reading the Text 1 we can conclude that the sentiments behind
this text are positive. The text expresses the love of the writer
who eats and McDonald’s. And psychologically, Love can be
classified in positive feelings. Similarly, the sentiments behind
Text 2 can be classified as negative because it expresses the
distaste of the user towards a particular dress.
b. How is it done
Sentimental Analysis can be performed by looking for the particular
words that makes the text positive, negative or neutral. If the polarity
of the words is positive, they are assigned a positive integer
according to its intensity. (Example: “Brilliant” can be described as
more positive word than “Good”. Hence Brilliant will be assigned
higher degree than Good). There are Libraries in python programing
language like TextBlob and Panda which already have the word
sentiments stored in them. By the use of these libraries we can
perform sentimental analysis on the text.
Sentimental analysis can also be performed on internet data by
analyzing the likes and dislikes on the internet data. Example, In
websites like Amazon.com the product performance can be
estimated by the number of stars given to it by users. The sentiment
analysis on websites like facebook.com can also be performed by
analyzing the smiley’s. etc. Despite the stock market is very volatile
the sentimental analysis can also be performed by mathematical
calculations on Stock Data. It can suggest that which stock have been
more beneficial for users. And if they are buying it more, it means the
sentiments related to that stock are more positive.
Hence, there are a number of ways to apply sentimental analysis
effectively. It varies from the type of data to the analyzing algorithm.
c. How is it useful to companies?
Business cannot survive without profit and profit is made by creating
the products that people like and buy. So it’s essential for the
company to know what its customers like or dislike. And to know
that, sentimental analysis is very effective tool. Sentimental analysis
empowers the corporations to know the market trends and plan
their future products accordingly.
d. Significance of twitter data in Sentimental Analysis:
We have chosen twitter.com to perform the sentimental analysis on
its data for the following reasons:
i. Twitter data is easier to process due to its simplicity. Most of
the tweets are a collection of limited words that people use to
express their emotions. And hence, tweets are short and
simple.
ii. Twitter data is easily available by using twitter applications.
iii. Almost all the famous people in the world use twitter for
expressing their views. It makes their fans to follow and
retweet on their tweets, in turn giving rich data on a subject.
a. What is Contextual Analysis?
Contextual analysis is simply the analysis of a data (it can be in text,
image or multimedia formats) that helps us to assess that text within
the context of its historical and cultural setting, but also in terms of
its textuality or the qualities that characterize the text as a text. It
helps us to find out the trends and topics in unstructured data,
including documents, social media, e-mail.
Through contextual analysis we can define the following things about
the text:
i. What does the text reveal about itself as a text?
ii. What does the text tell us about its apparent intended
audience(s)?
iii. What seems to have been the author’s intention?
iv. What is the occasion for this text? Etc.
b. How is it significant to this project?
Contextual analysis is very important for the accuracy of the
sentimental analysis. The companies that have developed their own
effective sentiment analyzers and provide services to other
companies, make sure that their data is first contextually analyzed
and then fed into the software for sentiment analysis. The reason
behind it is that there is always an uncertainty when it comes to the
accuracy of the system in terms of predicting the exact sentiments of
the text. Hence, assuming that the software will be 100% accurate in
predicting the exact meaning and sentiments of the text is purely
hypothetical. So to assist the Sentimental Analysis process until the
system accuracy becomes moderately mature the Contextual
analysis is extremely important. Especially when the sentimental
analysis is done through machine learning.
Contextual analysis was very important in the third phase of this
project which incorporates the Natural Language Processing Toolkit
(nltk) library for python. Since we had to train the system before
applying the sentimental analysis on data.
c. How is it applied here?
To train the system well, we fed the tweets that were simpler for the
system to understand and distinguish as positive or negative. Then
we went on with the more complex tweets to improve the system
accuracy. For Example:
This is a single tweet which comprise of User ID, Text of the Tweet,
Time and Time zone, Image, Image ID, hashtags, Image source and
location, User screen name etc.
To process a tweet for sentimental analysis we do not need all the
above information and just the text of the tweet. So to improve
accuracy of Sentiment prediction of program, we eliminate such
queries by the method of contextual analysis. Before running the
final sentimental analysis on twitter, the data analyzer goes through
the downloaded tweets and performs essential changes in the raw
dataset to improve the final output. This process is called contextual analysis.
Machine Learning and Natural. Language Processing (NLP)
a) What is Machine learning and Natural Language Processing?
Machine learning is a type of artificial intelligence (AI) that provides
computers with the ability to learn without being explicitly
programmed. Machine learning focuses on the development of
computer programs that can teach themselves to grow and change
when exposed to new data.
The process of machine learning is similar to that of data mining.
Both systems search through data to look for patterns. However,
instead of extracting data for human comprehension -- as is the case
in data mining applications -- machine learning uses that data to
detect patterns in data and adjust program actions accordingly.
Natural language processing on the other hand is a field of computer
science, artificial intelligence, and computational linguistics
concerned with the interactions between computers and human
(natural) languages. As such, NLP is related to the area of
human–computer interaction. Many challenges in NLP involve:
natural language understanding, enabling computers to derive
meaning from human or natural language input; and others involve
natural language generation.
a. Significance of NLP in relation to Sentimental Analysis:
Digital media represents a huge opportunity for businesses
of any type to capture the opinions, needs and intent that
users share on social media. In fact, the number of Google
searches, WhatsApp messages and emails sent in 60 seconds
is truly impressive (2.315.000 Google searches, 44.000.000
WhatsApp messages, more than 150.000.000 emails). Truly
listening to a customer’s voice requires deeply
understanding what they have expressed in natural
language: Natural Language Processing (NLP) is the best way
is to understand the language used and uncover the
sentiment behind it.
While people often consider sentiment (in terms of positive
or negative) as the most significant value of the opinions
users express via social media, the reality is that emotions
provide a richer set of information that addresses consumer
choices and, in many cases, even determines their decisions.
Because of this, Natural Language Processing for sentiment
analysis focused on emotions reveals itself extremely useful.
Thanks to NLP combined with a powerful social media
monitoring strategy, organizations can understand customer
reactions and act accordingly in improving customer
experience, quickly resolving customer issues and changing
their market positioning.
However, without NLP and without access to the right data,
it is difficult to discover and collect insight necessary for
driving business decisions. NLP makes it easier. For example,
if a customer sends an email about a problem they’re
experiencing with a product or service, a NLP system would
recognize the emotion (angry, disappointed, annoyed) and
mark it for a quick automatic response or forward the email
to the right person. Similarly, a financial services company
could use an NLP application to identify the sentiment in
articles associated with specific stocks, or analyze reports to
judge a stock’s performance and recommend whether to
buy or sell the stock.
Natural Language Processing for sentiment analysis is being
widely adopted by different types of organizations to extract
insight from social data and acknowledge the impact of
social media on brands and products.
b. NLP using python
Because of the rich libraries of python, NLP can be easily
performed.
Libraries such as nltk, TextBlob, Panda etc can be used.
The Natural Language Toolkit, or more commonly NLTK, is a
suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for English written in the
Python programming language. It was developed by Steven
Bird and Edward Loper in the Department of Computer and
Information Science at the University of Pennsylvania. NLTK
includes graphical demonstrations and sample data. It is
accompanied by a book that explains the underlying
concepts behind the language processing tasks supported by
the toolkit, plus a cookbook. NLTK is intended to support
research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial
intelligence, information retrieval, and machine learning.
NLTK has been used successfully as a teaching tool, as an
individual study tool, and as a platform for prototyping and
building research systems.
TextBlob is a Python library for processing textual data. It
provides a simple API for diving into common natural
language processing (NLP) tasks such as part-of- speech
tagging, noun phrase extraction, sentiment analysis,
classification, translation etc. These are the two libraries
that we will be using in this project of sentimental analysis.
A few other libraries will be mentioned further.
Code and Outputs:
The analysis is performed here in three phases. In the first phase of the
project the analysis is done on a simple dataset that we have entered
manually in the system.
a. This code will perform the sentimental analysis entered in the list and will
return the sentiments of each string entered.
Phase-1
Code:
import nltk
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
list =["I hate sandwich","I love you","This sandwich is horrible","You are an idiot","Are you
crazy?"]
for i in list:
sent=TextBlob(i, analyzer=NaiveBayesAnalyzer())
classification = sent.sentiment.classification
p_pos = sent.sentiment.p_pos
p_neg = sent.sentiment.p_neg
sent = TextBlob(i)
polarity=sent.sentiment.polarity
subjectivity=sent.sentiment.subjectivity
tokens = nltk.word_tokenize(i)
print("Text: " + str(i))
print(tokens)
print ("Classification:" + (classification))
print("positivity: " + str(p_pos))
print("negativity: " + str(p_neg))
print ("Polarity:" + str(polarity))
print ("Subjectivity:" + str(subjectivity))
print(" ")
Output:
Explanation:
Here the classification of text is clear in terms of its positivity and
negativity. Other aspects of sentimental analysis of the text such as the
Polarity, Subjectivity, Positivity, Negativity are also clear.
Phase-2: Sentimental Analysis of tweets
In this phase of the project, we will apply the sentimental analysis on the
tweets which are streamed live through twitter using this program. The
sentimental analysis will be applied in real time streaming.
Polarity:
Polarity of a text tell about its inclination towards an emotion. It can
be positive and negative. If the text is neutral the polarity often
comes to zero.
Subjectivity:
Subjectivity focuses on determining subjective words and texts that
mark the presence of opinions and evaluations vs. objective words
and texts, used to present factual information.
Classification :
Classification of a text simply tells if the text is positive or negative.
Positivity:
Positivity of a text tells about its rating as a positive text. If 1 is most
positive text score and 0 is the least positivity of a text.
Negativity:
Negativity of a text tells about its negativity scale 0 to -1. -1 being most
negative sentence. Example:
“ I hate you, you are the most horrible person.” Might get a score of
-0.8 or -0.9.
Libraries Used:
Tweepy :
Tweepy is used to import data from twitter using Consumer and other secret
keys. It also support OAuthHandler library that helps the program to get
authenticated by twitter to make connection and import data.
NLTK:
The Natural Language Toolkit, or more commonly NLTK, is a suite
of libraries and programs for symbolic and statistical natural
language processing (NLP) for English written in the Python
programming language.
CODE:
from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import time, json
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
import nltk
class listener(StreamListener):
def __init__(self, api=None):
super(listener, self).__init__()
self.num_tweets = 0
def on_data(self, data):
self.num_tweets += 1
self.limit = 10
try:
if self.num_tweets <= self.limit:
tweet = data.split(',"text":"')[1].split('","source')[0]
tweetSen = TextBlob(tweet)
sent= TextBlob(tweet, analyzer=NaiveBayesAnalyzer())
p_pos= sent.sentiment.p_pos
p_neg= sent.sentiment.p_neg
P="Polarity : "+ str(tweetSen.sentiment.polarity)
S="Subjectivity : "+ str(tweetSen.sentiment.subjectivity)
C="Classification : "+ (sent.sentiment.classification)
Po="Positivity : "+ str(p_pos)
N="Negativity : "+ str(p_neg)
print '***'
print tweet
print P
print S
print C
print Po
print N
print '***'
saveThis = str(time.time())+':'+tweet+'| |'+P+'| |'+S+'| |'+C+'| |'+Po+'| |'+N
f = open('Twitdata.txt', 'a')
if self.num_tweets == 1:
f.write('\n' + str(self.limit) + ' set of tweets: \n\n')
else:
f.write(saveThis)
f.write('\n')
f.close()
with open('Twitdata.txt', 'r') as f:
lines = f.read().splitlines()
ls = []
sa = {}
for i, line in enumerate(lines):
vals =line.split()
sa['_id'] = i
sa['Uid'] = vals[0]
ls.append(dict(sa))
for i, p in enumerate(ls):
if i <= 20:
print p
sa2 = open('sadict1.txt','w')
for element in ls:
print>>sa2, element
sa2.close()
return True
else:
return False
except BaseException, e:
print 'failed ondata, ', str(e)
time.sleep(5)
def on_error(self, status):
if status == 420:
print 'rate limited:', status, 'error'
else:
print 'other error:', status, 'error'
if __name__ == '__main__':
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track=['Dr Strange'])
Explaination:
This code imports the data from twitter (in the sets of 10) and performs real time
sentimental analysis on it. The sentiments of a text are calculated in terms of its
Polarity, Subjectivity, Classification, Positivity and Negativity. The tweets are
stored it in a file named Twitdata.txt which can be used afterwards for further
analysis.
Time based Approach:
In the following program we will automate the importing time of tweet in terms
of time. The program will not terminate according to the number of tweets as
above, but will run for two minutes downloading as many number of tweets in
that time.
Code:
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import time, json
import logging
_log = logging.getLogger(__name__)
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
class TwitterListener(StreamListener):
def __init__(self, phrases):
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
self.__stream = Stream(auth, listener=self)
self.__stream.filter(track=phrases, async=True)
def disconnect(self):
self.__stream.disconnect()
def on_data(self, data):
print(data)
def on_error(self, status):
print(status)
if __name__ == '__main__':
import time
logging.basicConfig(level=logging.INFO)
phrases = ['python']
listener = TwitterListener(phrases)
# listens for 120 seconds then stop
time.sleep(120)
listener.disconnect()
Dictionaries in Python:
Dictionary is a very useful data type built into Python. Unlike sequences, which
are indexed by a range of numbers, dictionaries are indexed by keys, which can be
any immutable type; strings and numbers can always be keys. Tuples can be used
as keys if they contain only strings, numbers, or tuples; if a tuple contains any
mutable object either directly or indirectly, it cannot be used as a key.
It is best to think of a dictionary as an unordered set of key: value pairs, with the
requirement that the keys are unique (within one dictionary). A pair of braces
creates an empty dictionary: {}. Placing a comma-separated list of key:value pairs
within the braces adds initial key:value pairs to the dictionary; this is also the way
dictionaries are written on output.
Example:
In the presented example we will take an AFINN-111.txt file which is used for
sentiment analysis and natural language processing and convert it into dictionary
of words. The initial AFINN-111.txt is in a list format.
Initial Afinn-111.txt file contains words and their sentiment scores
After converting the file into dictionary it looks like
Code for Creating Dictionary:
with open('AFINN-111.txt', 'r') as f:
lines = f.read().splitlines()
ls = []
sa = {}
for i, line in enumerate(lines):
vals =line.split()
sa['_id'] = i
sa['text'] = vals[0]
sa['sentiment'] = vals[1]
ls.append(dict(sa))
for i, p in enumerate(ls):
if i <= 20:
print p
'''
write list of dict element
'''
sa2 = open('sadict.txt','w')
for element in ls:
print>>sa2, element
sa2.close()
Code to read the dictionary:
import ast
def read_file(f):
with open(f, 'r') as f:
records = f.read().splitlines()
return records
def dict_val(f, w, k1, k2, k3):
for i, d in enumerate(f):
dic = ast.literal_eval(d)
if dic.get(k1) == w:
return dic.get(k2), dic.get(k3)
f = 'data/sadict.txt'
lines = read_file(f)
word = 'zealous'
key1 = 'word'
key2 = 'sentiment'
key3 = '_id'
sentiment = dict_val(lines, word, key1, key2, key3)
print 'id:', sentiment[1], 'sentiment:', sentiment[0]
Application in the project:
The twitter data once imported from twitter is saved in the text file and
looks like following.
This data is cleaned to an extent where all the different aspects of a tweet like
UID, User screen name, Text and sentimental analysis of the text can be seen
clearly. To store data in more effective way we can use dictionary and then the
data will look like.
This can help us select any tweet by using user ID or User Screen name.
Moving to NLP using NLTK:
NLTK (Natural Language Tool Kit) is one of the most powerful libraries in python
for Natural Language Processing. Its also the backbone of a lot of other libraries
that helps implementing NLP. We will use the NLTK library in our program to
show that how a system can be trained.
Code:
import nltk
Pos_Sentences = [('I love Dr.Strange', 'positive'),('This painting is amazing', 'positive'),('The class was very
good and informative', 'positive'),('I am really excited about my trip', 'positive'),('I love cookies',
'positive')]
Neg_Sentences = [('I did not like Dr.Strange ', 'negative'),('The painting is horrible', 'negative'),('The class
was a waste', 'negative'),('I dont want to go to the trip', 'negative'),('I hate Cookies', 'negative')]
MyWordList = []
for (words, sentiment) in Pos_Sentences + Neg_Sentences:
Filtered_Words = [e.lower() for e in words.split() if len(e) >= 4]
MyWordList.append((Filtered_Words, sentiment))
def get_words(MyWordList):
all_words = []
for (words, sentiment) in MyWordList:
all_words.extend(words)
return all_words
def word_feature(wordlist):
wordlist = nltk.FreqDist(wordlist)
word_features = wordlist.keys()
return word_features
word_features = word_feature(get_words(MyWordList))
def extract_feature(Text):
Text_words = set(Text)
Final_List = {}
for word in word_features:
Final_List['contains(%s)' % word] = (word in Text_words)
print Final_List
return Final_List
training_set = nltk.classify.apply_features(extract_feature, MyWordList)
classifier = nltk.NaiveBayesClassifier.train(training_set)
Explaination:
In this code we are creating two Lists, One that stores positive responces and one that stores Negative
responces. And we give them tag of being negative and positive according to our Human observation.
These responses are then stored in a list in the tokenized format. Sentences are also processed to
lowercase format and the words smaller than 4 letters are removed and stored in another list called
MyWordList[]. Then two functions are created. get_words and word_feature.
get_words consists of all the individual words. And word_feature arranges the words according to
number of their occurrence in the word list. The last function extract_feature(Text) parses through every sentence checks through the word list and gives a boolean output about if the word in word list matches the word in sentence.
References:
www.google.com
www.wikipedia.com
www.python.org
www.github.com
www.nltk.org
https://pypi.python.org/pypi/textblob
www.stackoverflow.com




It is very useful information. I really liked the way it is presented. Good job!!!
ReplyDelete