Trending Topic Generation Twitter

Youtube Video.

Twitter has atleast 4600 tweets/sec in year 2012 and doubles every year.

04:00 Topic Extraction
We need to parse the entire tweet in different ways. screen-shot-2017-01-03-at-11-31-49-pm

Screen Shot 2017-01-03 at 11.34.40 PM.png

so we tokenize the tweet based on any method we want. Next we count the frequency of each word.
This could be maintained in a hash table and we could just count and show the trending topic based on that. But for stopwords and generic words that won’t be a good idea.
We maintain a dictionary of stopwords and remove these word but then this would be a very big list and difficult to include all possible word as well.

screen-shot-2017-01-03-at-11-35-36-pm

Screen Shot 2017-01-03 at 11.40.48 PM.png

Basically ww analyze the rate of change of data over a topic count over a period of time like a month. We can use the z-score for that.

Suppose a topic is being twitted about a high number of times for past few days.

screen-shot-2017-01-03-at-11-46-50-pm

What’s important is to maintain a threshold. Whatever happens need to maintain that. Suppose a topic is new, so before that it’s frequency was zero and soon it being tweeted like anything. so although it’s rate of change is the highest we will maintain a threshold and not show it to the top directly.  Like we will show a trending topic only if it’s present count is more then 10,000.
We also need to take into consideration the authority of the user like if they have a lot of follower. So the count of a term will be considered along with the follower for a user who tweeted it.

Many tweets have words that are similar in meaning, are mis-spelled, are short form – so would also want to cluster all those topic together.

Screen Shot 2017-01-04 at 12.01.32 AM.png

Screen Shot 2017-01-04 at 12.35.11 AM.png

Screen Shot 2017-01-04 at 12.35.11 AM.png

chi-square will give a score based on the previous count for a topic, which basically shows the variation. chi-square is a better test then simple ratio as it considers the range of variation.

chi-square is a variation b/w the expected data and the observed data.

Screen Shot 2017-01-04 at 12.40.08 AM.png

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s