Sentimental Analysis of Tweets using Python.

This is my initial research on Sentimental Analysis and libraries TextBlob and NLTK. Also this content is solely for educational purpose, please do not use it for any commercial application. Hope you get some insight about sentimental analysis by this article.


NOTE: Replace ' or " with ' (Apostrophe')

a. What is sentimental analysis



Sentimental Analysis is the process to determine the sentiments

of a piece of writing. It can be positive negative or neutral.

Sentimental Analysis refers to the use of Text Analysis, Natural

Language Processing and Computational Linguistics to identify

and extract subjective information in the given text. For Example:

Text 1: “I love eating at McDonald’s”

Text 2: “That dress is really bad. Do not buy it.”

Reading the Text 1 we can conclude that the sentiments behind

this text are positive. The text expresses the love of the writer

who eats and McDonald’s. And psychologically, Love can be

classified in positive feelings. Similarly, the sentiments behind

Text 2 can be classified as negative because it expresses the

distaste of the user towards a particular dress.

b. How is it done

 Sentimental Analysis can be performed by looking for the particular

words that makes the text positive, negative or neutral. If the polarity

of the words is positive, they are assigned a positive integer

according to its intensity. (Example: “Brilliant” can be described as

more positive word than “Good”. Hence Brilliant will be assigned

higher degree than Good). There are Libraries in python programing

language like TextBlob and Panda which already have the word

sentiments stored in them. By the use of these libraries we can

perform sentimental analysis on the text.

 Sentimental analysis can also be performed on internet data by

analyzing the likes and dislikes on the internet data. Example, In

websites like Amazon.com the product performance can be

estimated by the number of stars given to it by users. The sentiment

analysis on websites like facebook.com can also be performed by

analyzing the smiley’s. etc. Despite the stock market is very volatile

the sentimental analysis can also be performed by mathematical

calculations on Stock Data. It can suggest that which stock have been

more beneficial for users. And if they are buying it more, it means the

sentiments related to that stock are more positive.

 Hence, there are a number of ways to apply sentimental analysis

effectively. It varies from the type of data to the analyzing algorithm.



c. How is it useful to companies?



 Business cannot survive without profit and profit is made by creating

the products that people like and buy. So it’s essential for the

company to know what its customers like or dislike. And to know

that, sentimental analysis is very effective tool. Sentimental analysis

empowers the corporations to know the market trends and plan

their future products accordingly.

d. Significance of twitter data in Sentimental Analysis:

 We have chosen twitter.com to perform the sentimental analysis on

its data for the following reasons:

i. Twitter data is easier to process due to its simplicity. Most of

the tweets are a collection of limited words that people use to

express their emotions. And hence, tweets are short and

simple.

ii. Twitter data is easily available by using twitter applications.

iii. Almost all the famous people in the world use twitter for

expressing their views. It makes their fans to follow and

retweet on their tweets, in turn giving rich data on a subject.

a. What is Contextual Analysis?


 Contextual analysis is simply the analysis of a data (it can be in text,

image or multimedia formats) that helps us to assess that text within

the context of its historical and cultural setting, but also in terms of

its textuality or the qualities that characterize the text as a text. It

helps us to find out the trends and topics in unstructured data,

including documents, social media, e-mail.

 Through contextual analysis we can define the following things about

the text:

i. What does the text reveal about itself as a text?

ii. What does the text tell us about its apparent intended

audience(s)?

iii. What seems to have been the author’s intention?

iv. What is the occasion for this text? Etc.

b. How is it significant to this project?

 Contextual analysis is very important for the accuracy of the

sentimental analysis. The companies that have developed their own

effective sentiment analyzers and provide services to other

companies, make sure that their data is first contextually analyzed

and then fed into the software for sentiment analysis. The reason

behind it is that there is always an uncertainty when it comes to the

accuracy of the system in terms of predicting the exact sentiments of

the text. Hence, assuming that the software will be 100% accurate in

predicting the exact meaning and sentiments of the text is purely

hypothetical. So to assist the Sentimental Analysis process until the

system accuracy becomes moderately mature the Contextual

analysis is extremely important. Especially when the sentimental

analysis is done through machine learning.

 Contextual analysis was very important in the third phase of this

project which incorporates the Natural Language Processing Toolkit

(nltk) library for python. Since we had to train the system before

applying the sentimental analysis on data.

c. How is it applied here?

 To train the system well, we fed the tweets that were simpler for the

system to understand and distinguish as positive or negative. Then

we went on with the more complex tweets to improve the system

accuracy. For Example:

 This is a single tweet which comprise of User ID, Text of the Tweet,

Time and Time zone, Image, Image ID, hashtags, Image source and

location, User screen name etc.

 To process a tweet for sentimental analysis we do not need all the

above information and just the text of the tweet. So to improve

accuracy of Sentiment prediction of program, we eliminate such

queries by the method of contextual analysis. Before running the

final sentimental analysis on twitter, the data analyzer goes through

the downloaded tweets and performs essential changes in the raw

dataset to improve the final output. This process is called contextual analysis.



Machine Learning and Natural. Language Processing (NLP)

a) What is Machine learning and Natural Language Processing?



 Machine learning is a type of artificial intelligence (AI) that provides

computers with the ability to learn without being explicitly

programmed. Machine learning focuses on the development of

computer programs that can teach themselves to grow and change

when exposed to new data.

 The process of machine learning is similar to that of data mining.

Both systems search through data to look for patterns. However,

instead of extracting data for human comprehension -- as is the case

in data mining applications -- machine learning uses that data to

detect patterns in data and adjust program actions accordingly.

 Natural language processing on the other hand is a field of computer

science, artificial intelligence, and computational linguistics

concerned with the interactions between computers and human

(natural) languages. As such, NLP is related to the area of

human–computer interaction. Many challenges in NLP involve:

natural language understanding, enabling computers to derive

meaning from human or natural language input; and others involve

natural language generation.

a. Significance of NLP in relation to Sentimental Analysis:

 Digital media represents a huge opportunity for businesses

of any type to capture the opinions, needs and intent that

users share on social media. In fact, the number of Google

searches, WhatsApp messages and emails sent in 60 seconds

is truly impressive (2.315.000 Google searches, 44.000.000

WhatsApp messages, more than 150.000.000 emails). Truly

listening to a customer’s voice requires deeply

understanding what they have expressed in natural

language: Natural Language Processing (NLP) is the best way

is to understand the language used and uncover the

sentiment behind it.

 While people often consider sentiment (in terms of positive

or negative) as the most significant value of the opinions

users express via social media, the reality is that emotions

provide a richer set of information that addresses consumer

choices and, in many cases, even determines their decisions.

Because of this, Natural Language Processing for sentiment

analysis focused on emotions reveals itself extremely useful.

Thanks to NLP combined with a powerful social media

monitoring strategy, organizations can understand customer

reactions and act accordingly in improving customer

experience, quickly resolving customer issues and changing

their market positioning.

 However, without NLP and without access to the right data,

it is difficult to discover and collect insight necessary for

driving business decisions. NLP makes it easier. For example,

if a customer sends an email about a problem they’re

experiencing with a product or service, a NLP system would

recognize the emotion (angry, disappointed, annoyed) and

mark it for a quick automatic response or forward the email

to the right person. Similarly, a financial services company

could use an NLP application to identify the sentiment in

articles associated with specific stocks, or analyze reports to

judge a stock’s performance and recommend whether to

buy or sell the stock.

Natural Language Processing for sentiment analysis is being

widely adopted by different types of organizations to extract

insight from social data and acknowledge the impact of

social media on brands and products.

b. NLP using python

 Because of the rich libraries of python, NLP can be easily

performed.

 Libraries such as nltk, TextBlob, Panda etc can be used.

 The Natural Language Toolkit, or more commonly NLTK, is a

suite of libraries and programs for symbolic and statistical

natural language processing (NLP) for English written in the

Python programming language. It was developed by Steven

Bird and Edward Loper in the Department of Computer and

Information Science at the University of Pennsylvania. NLTK

includes graphical demonstrations and sample data. It is

accompanied by a book that explains the underlying

concepts behind the language processing tasks supported by

the toolkit, plus a cookbook. NLTK is intended to support

research and teaching in NLP or closely related areas,

including empirical linguistics, cognitive science, artificial

intelligence, information retrieval, and machine learning.

NLTK has been used successfully as a teaching tool, as an

individual study tool, and as a platform for prototyping and

building research systems.

 TextBlob is a Python library for processing textual data. It

provides a simple API for diving into common natural

language processing (NLP) tasks such as part-of- speech

tagging, noun phrase extraction, sentiment analysis,

classification, translation etc. These are the two libraries

that we will be using in this project of sentimental analysis.

A few other libraries will be mentioned further.

Code and Outputs:

The analysis is performed here in three phases. In the first phase of the

project the analysis is done on a simple dataset that we have entered

manually in the system.

a. This code will perform the sentimental analysis entered in the list and will

return the sentiments of each string entered.

Phase-1

Code:

import nltk

from textblob import TextBlob

from textblob.sentiments import NaiveBayesAnalyzer

list =["I hate sandwich","I love you","This sandwich is horrible","You are an idiot","Are you

crazy?"]

for i in list:

sent=TextBlob(i, analyzer=NaiveBayesAnalyzer())

classification = sent.sentiment.classification

p_pos = sent.sentiment.p_pos

p_neg = sent.sentiment.p_neg

sent = TextBlob(i)

polarity=sent.sentiment.polarity

subjectivity=sent.sentiment.subjectivity

tokens = nltk.word_tokenize(i)

print("Text: " + str(i))

print(tokens)

print ("Classification:" + (classification))

print("positivity: " + str(p_pos))

print("negativity: " + str(p_neg))

print ("Polarity:" + str(polarity))

print ("Subjectivity:" + str(subjectivity))

print(" ")

Output:

Explanation:

 Here the classification of text is clear in terms of its positivity and

negativity. Other aspects of sentimental analysis of the text such as the

Polarity, Subjectivity, Positivity, Negativity are also clear.

Phase-2: Sentimental Analysis of tweets

 In this phase of the project, we will apply the sentimental analysis on the

tweets which are streamed live through twitter using this program. The

sentimental analysis will be applied in real time streaming.

Polarity:

Polarity of a text tell about its inclination towards an emotion. It can

be positive and negative. If the text is neutral the polarity often

comes to zero.

Subjectivity:

Subjectivity focuses on determining subjective words and texts that

mark the presence of opinions and evaluations vs. objective words

and texts, used to present factual information.

Classification :

Classification of a text simply tells if the text is positive or negative.

Positivity:

Positivity of a text tells about its rating as a positive text. If 1 is most

positive text score and 0 is the least positivity of a text.

Negativity:

Negativity of a text tells about its negativity scale 0 to -1. -1 being most

negative sentence. Example:

“ I hate you, you are the most horrible person.” Might get a score of

-0.8 or -0.9.

Libraries Used:

Tweepy :

Tweepy is used to import data from twitter using Consumer and other secret

keys. It also support OAuthHandler library that helps the program to get

authenticated by twitter to make connection and import data.

NLTK:

The Natural Language Toolkit, or more commonly NLTK, is a suite

of libraries and programs for symbolic and statistical natural

language processing (NLP) for English written in the Python

programming language.

CODE:

from tweepy import Stream

from tweepy import OAuthHandler

from tweepy.streaming import StreamListener

import time, json

from textblob import TextBlob

from textblob.sentiments import NaiveBayesAnalyzer

import nltk

class listener(StreamListener):

def __init__(self, api=None):

super(listener, self).__init__()

self.num_tweets = 0

def on_data(self, data):

self.num_tweets += 1

self.limit = 10

try:

if self.num_tweets <= self.limit:

tweet = data.split(',"text":"')[1].split('","source')[0]

tweetSen = TextBlob(tweet)

sent= TextBlob(tweet, analyzer=NaiveBayesAnalyzer())

p_pos= sent.sentiment.p_pos

p_neg= sent.sentiment.p_neg

P="Polarity : "+ str(tweetSen.sentiment.polarity)

S="Subjectivity : "+ str(tweetSen.sentiment.subjectivity)

C="Classification : "+ (sent.sentiment.classification)

Po="Positivity : "+ str(p_pos)

N="Negativity : "+ str(p_neg)

print '***'

print tweet

print P

print S

print C

print Po

print N

print '***'

saveThis = str(time.time())+':'+tweet+'| |'+P+'| |'+S+'| |'+C+'| |'+Po+'| |'+N

f = open('Twitdata.txt', 'a')

if self.num_tweets == 1:

f.write('\n' + str(self.limit) + ' set of tweets: \n\n')

else:

f.write(saveThis)

f.write('\n')

f.close()

with open('Twitdata.txt', 'r') as f:

lines = f.read().splitlines()

ls = []

sa = {}

for i, line in enumerate(lines):

vals =line.split()

sa['_id'] = i

sa['Uid'] = vals[0]

ls.append(dict(sa))

for i, p in enumerate(ls):

if i <= 20:

print p

sa2 = open('sadict1.txt','w')

for element in ls:

print>>sa2, element

sa2.close()

return True

else:

return False

except BaseException, e:

print 'failed ondata, ', str(e)

time.sleep(5)

def on_error(self, status):

if status == 420:

print 'rate limited:', status, 'error'

else:

print 'other error:', status, 'error'



if __name__ == '__main__':

consumer_key = ''

consumer_secret = ''

access_token = &#39'

access_secret = &#39'

auth = OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_secret)

twitterStream = Stream(auth, listener())

twitterStream.filter(track=['Dr Strange'])

Explaination:

This code imports the data from twitter (in the sets of 10) and performs real time

sentimental analysis on it. The sentiments of a text are calculated in terms of its

Polarity, Subjectivity, Classification, Positivity and Negativity. The tweets are

stored it in a file named Twitdata.txt which can be used afterwards for further

analysis.

Time based Approach:

In the following program we will automate the importing time of tweet in terms

of time. The program will not terminate according to the number of tweets as

above, but will run for two minutes downloading as many number of tweets in

that time.

Code:

from tweepy.streaming import StreamListener

from tweepy import OAuthHandler

from tweepy import Stream

import time, json

import logging

_log = logging.getLogger(__name__)

consumer_key = &#39'

consumer_secret = ''

access_token = ''

access_secret = &#39'

class TwitterListener(StreamListener):

def __init__(self, phrases):

auth = OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_secret)

self.__stream = Stream(auth, listener=self)

self.__stream.filter(track=phrases, async=True)

def disconnect(self):

self.__stream.disconnect()

def on_data(self, data):

print(data)

def on_error(self, status):

print(status)

if __name__ == '__main__':

import time

logging.basicConfig(level=logging.INFO)

phrases = ['python']

listener = TwitterListener(phrases)

# listens for 120 seconds then stop

time.sleep(120)

listener.disconnect()

Dictionaries in Python:

Dictionary is a very useful data type built into Python. Unlike sequences, which

are indexed by a range of numbers, dictionaries are indexed by keys, which can be

any immutable type; strings and numbers can always be keys. Tuples can be used

as keys if they contain only strings, numbers, or tuples; if a tuple contains any

mutable object either directly or indirectly, it cannot be used as a key.

It is best to think of a dictionary as an unordered set of key: value pairs, with the

requirement that the keys are unique (within one dictionary). A pair of braces

creates an empty dictionary: {}. Placing a comma-separated list of key:value pairs

within the braces adds initial key:value pairs to the dictionary; this is also the way

dictionaries are written on output.

Example:

In the presented example we will take an AFINN-111.txt file which is used for

sentiment analysis and natural language processing and convert it into dictionary

of words. The initial AFINN-111.txt is in a list format.

Initial Afinn-111.txt file contains words and their sentiment scores

After converting the file into dictionary it looks like

Code for Creating Dictionary:

with open('AFINN-111.txt', 'r') as f:

lines = f.read().splitlines()

ls = []

sa = {}

for i, line in enumerate(lines):

vals =line.split()

sa['_id'] = i

sa['text'] = vals[0]

sa['sentiment'] = vals[1]

ls.append(dict(sa))

for i, p in enumerate(ls):

if i <= 20:

print p

'''

write list of dict element

'''

sa2 = open('sadict.txt','w')

for element in ls:

print>>sa2, element

sa2.close()

Code to read the dictionary:

import ast

def read_file(f):

with open(f, 'r') as f:

records = f.read().splitlines()

return records

def dict_val(f, w, k1, k2, k3):

for i, d in enumerate(f):

dic = ast.literal_eval(d)

if dic.get(k1) == w:

return dic.get(k2), dic.get(k3)

f = 'data/sadict.txt'

lines = read_file(f)

word = 'zealous'

key1 = 'word'

key2 = 'sentiment'

key3 = '_id'

sentiment = dict_val(lines, word, key1, key2, key3)

print 'id:', sentiment[1], 'sentiment:', sentiment[0]

Application in the project:

The twitter data once imported from twitter is saved in the text file and

looks like following.



This data is cleaned to an extent where all the different aspects of a tweet like

UID, User screen name, Text and sentimental analysis of the text can be seen

clearly. To store data in more effective way we can use dictionary and then the

data will look like.

This can help us select any tweet by using user ID or User Screen name.

Moving to NLP using NLTK:

NLTK (Natural Language Tool Kit) is one of the most powerful libraries in python

for Natural Language Processing. Its also the backbone of a lot of other libraries

that helps implementing NLP. We will use the NLTK library in our program to

show that how a system can be trained.

Code:

import nltk

Pos_Sentences = [('I love Dr.Strange', 'positive'),('This painting is amazing', 'positive'),('The class was very

good and informative', 'positive'),('I am really excited about my trip', 'positive'),('I love cookies',

'positive')]

Neg_Sentences = [('I did not like Dr.Strange ', 'negative'),('The painting is horrible', 'negative'),('The class

was a waste', 'negative'),('I dont want to go to the trip', 'negative'),('I hate Cookies', 'negative')]

MyWordList = []

for (words, sentiment) in Pos_Sentences + Neg_Sentences:

Filtered_Words = [e.lower() for e in words.split() if len(e) >= 4]

MyWordList.append((Filtered_Words, sentiment))

def get_words(MyWordList):

all_words = []

for (words, sentiment) in MyWordList:

all_words.extend(words)

return all_words

def word_feature(wordlist):

wordlist = nltk.FreqDist(wordlist)

word_features = wordlist.keys()

return word_features

word_features = word_feature(get_words(MyWordList))

def extract_feature(Text):

Text_words = set(Text)

Final_List = {}

for word in word_features:

Final_List['contains(%s)' % word] = (word in Text_words)

print Final_List

return Final_List

training_set = nltk.classify.apply_features(extract_feature, MyWordList)

classifier = nltk.NaiveBayesClassifier.train(training_set)

Explaination:

In this code we are creating two Lists, One that stores positive responces and one that stores Negative

responces. And we give them tag of being negative and positive according to our Human observation.

These responses are then stored in a list in the tokenized format. Sentences are also processed to

lowercase format and the words smaller than 4 letters are removed and stored in another list called

MyWordList[]. Then two functions are created. get_words and word_feature.

get_words consists of all the individual words. And word_feature arranges the words according to

number of their occurrence in the word list. The last function extract_feature(Text) parses through every sentence checks through the word list and gives a boolean output about if the word in word list matches the word in sentence.



References:

www.google.com

www.wikipedia.com

www.python.org

www.github.com

www.nltk.org

https://pypi.python.org/pypi/textblob

www.stackoverflow.com

Comments

  1. It is very useful information. I really liked the way it is presented. Good job!!!

    ReplyDelete

Post a Comment

Popular posts from this blog

Gaming in Pygame