Twitter has atleast 4600 tweets/sec in year 2012 and doubles every year.
04:00 Topic Extraction
We need to parse the entire tweet in different ways.
so we tokenize the tweet based on any method we want. Next we count the frequency of each word.
This could be maintained in a hash table and we could just count and show the trending topic based on that. But for stopwords and generic words that won’t be a good idea.
We maintain a dictionary of stopwords and remove these word but then this would be a very big list and difficult to include all possible word as well.
Basically ww analyze the rate of change of data over a topic count over a period of time like a month. We can use the z-score for that.
Suppose a topic is being twitted about a high number of times for past few days.
What’s important is to maintain a threshold. Whatever happens need to maintain that. Suppose a topic is new, so before that it’s frequency was zero and soon it being tweeted like anything. so although it’s rate of change is the highest we will maintain a threshold and not show it to the top directly. Like we will show a trending topic only if it’s present count is more then 10,000.
We also need to take into consideration the authority of the user like if they have a lot of follower. So the count of a term will be considered along with the follower for a user who tweeted it.
Many tweets have words that are similar in meaning, are mis-spelled, are short form – so would also want to cluster all those topic together.
chi-square will give a score based on the previous count for a topic, which basically shows the variation. chi-square is a better test then simple ratio as it considers the range of variation.
chi-square is a variation b/w the expected data and the observed data.