I used the Twitter's real time streaming API in order to download a sample of Tweets with keyword "nazi". Within the time period of 2018-04-01 to 2018-05-28 I downloaded exactly 2,256,468 Tweets.

How did I organize the data

Each Tweet could have more than 2000 unique fields (features). To store more than 2 million Tweets with possibly more than 2000 field one can't easily use relational databases like SQL. To organize the data, I used a combination of Rasperry pi computers and Elasticsearch databases. To be more exact I used Raspberry pi computers to download the data. Then I pushed the data to our local Elasticsearch cluster. I will later write an article on the details of data organization prcocess.

Highlight

  • I used the Twitter's real time streaming API in order to download more than 2 million Tweets.
  • The data is stored directly to a local Elasticsearch database.
  • More than 90% of the Tweets are either English, Spanish or German.
  • There is high correlation between different category of emotions.
  • I put the R file for download as well!

Analysis

Language of the Tweets

The first question to address is regarding the language of the retrived Tweets. In the downloaded data I have Tweets from 46 different languages. In the next plot one can see a log-scaled visualization of the number of Tweets per language. As one can see more than 90% of the Tweets are either English, Spanish or German. I lmited the rest of this analysis only to English Tweets. This is 1,617,089 Tweets with English language.

Number of Tweets per day

The second descriptive question would be, how many Nazi-realated Tweets are being posted each day? One can't exactly answer this question since the downloaded Tweets are a sample of the all Tweets and Twitter does not mention how representative the streamed sample is. But we can check the exact date and time of the downloaded Tweets. In the next plot one can see the number of downloaded English Tweets per day. The unsusual number of Tweets on 7th May is due to a technical issue I had with the servers. I assume this would not change the results of my analysis!

Sentiment analysis

"Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information". "sentiment analysis aims to determine the attitude of a speaker, writer, or other subject with respect to some topic or the overall contextual polarity or emotional reaction to a document, interaction, or event. The attitude may be a judgment or evaluation, affective state, or the intended emotional communication (that is to say, the emotional effect intended by the author or interlocutor)". (wiki).

Sentiment analysis is indeed a complex task since the structure of the language is complex. There are many different approaches to sentiment analysis. Word list approaches are methods in which the algorithm looks for each of the words in a list of the tagged words. There are also complex algorithms like neural network methods in which the whole complexity of the language structure is taken into consideraton. In the current analysis I use two simple word list methods. In the first step I use the AFINN word list in order to find if the Tweets published in each day are positive or negative. In the second step I use NRC Word-Emotion Association Lexicon in order to measure different emotions in the Tweets.

AFINN: positive or negative?

Using the AFINN word list, I show in average the Tweets are negatively addressing the Nazi topic.

One should keep in mind that the sentiimment analysis methods are not straightforward to interprete. As sample there are many Tweets which are condemning Nazi culture using a positive language. These Tweets would get a positive sentiment score, while they are negative with regards to Nazi culture. For exmaple check the Tweet below.

The same issue applies to negative Tweets (see below). Existence of negative words in a Tweet does not necessarily mean the Tweet is supporting or condemning the Nazi culture. Negative sentiment score only shows the overal use of negative words within the texts.

One might ask what are the most used negative and positive words in the Tweets. The best way to answer this question is to use the word cloud of the negative and positive words.

As I mentioned before, in the last step I try to measure the other emotions in the downloaded Tweets using NRC Word-Emotion Association Lexicon word list.

The last visualization has some interesting implications. One can see that there is high correation between the different emotions. I would not expect this in advance and I think it is important to understand the reason! I believe in the main reason for the high correlation between different emotions is the existence of bots which post random Tweets regardless of the content only in order to get more audience or to meet a different political goal. One might as well go deeper to the contents and analyze the Tweets specific to a date in order to understand the chain of the events in the real world.

What comes next?

As I metioned I suspect that the Nazi topic on Twitter is highly derived by bot activity and not by usual users. It is important to understand which entity contribute more to this topic: normal users or bots! The topic of bots on Twitter has always been a hot topic in scientific realm. In a next article I will try to distinguish between the human and bot activity on the downloaded Nazi Tweets.

Some interesting resources

Any potentially interesting topics to investigate?
write us an email