Twitter has a very important yet still under-hyped feature: Twitter lists. Twitter list is a manually created set of users who have something in common. For example, one can create a Twitter lists of people who tweet on NoSQL topic (for example: http://twitter.com/al3xandru/nosql), then others can subscribe to this list to get news from NoSQL community. So, Twitter list is a united stream of tweets from all of its members.
For example, here is a list of people supposed to tweet often about NoSQL http://twitter.com/al3xandru/nosql. The list created by @al3xandru. It unites 74 Twitter users, and 22 users have subscribed to read it.
Lists seem to be a new perfect way of getting real-time news on any topic of your interest. Still, the problem is that lists contain too much noise. People included into a list tend to tweet about other topics too (not relevant to the list's topic): their other interests, as well as their personal things, jokes, pics, Foursquare check-ins, etc. Here are, for example, lists where I am included: http://twitter.com/mariagrineva/lists/memberships. I post tweets about: social media, information retrieval, Zurich, NLP, entrepreneurship, startups and other things. Some of them are relevant to one list and not relevant to another.
As a result, an average Twitter list produces stream of tweets most of which are relevant to the main list's topic, and there are still tweets on other topics too. Normally, relevant tweets dominate noise. But there are situations when lists reflect global Twitter trends like #worldcup or #ash for example. Then trending event can dominate main list topic.
In order to filter noise in Twitter list, one needs to identify list's main (niche) topic. The words representing the main topic change over time, new words can arrive. That is, once identified, list's topic should be maintained actual in near real-time. The other thing is that trending news can flood list and even crowd out main topic for some short period of time. To fix it, we should differ two kinds of noise: (1) off-topics – tweets on other topics, not relevant to the main list topic, and very diverse. (2) trendy off-topics – tweets about recent important global event not relevant to the main topic, but most of the list members post about it.
We (Baris Guc and me) developed a prototype of the system that filters out noise in Twitter lists. And here are the main ideas of our approach.
We consider list filtering task as a classification of new coming tweets as relevant or irrelevant.
Real-time identification of list's main topic. Our system monitors list's tweets and creates list's topic signature – a ranked set of words that represent the main topic of the list. Normally, relevant tweets dominate, and it's easy to create topic signature. But we also need to deal with trends. For this reason we consider twitter list in the context of the global twitter stream. That is, if everybody on the list if writing about #WorldCup, but it is also a popular topic in global stream, that means it's trendy off-topic and it should not affect list's topic signature.
Combining textual and social features. Filtering can be improved by using information from Twitter social graph. Members of twitter list organize a social subgraph. Looking at this subgraph one can see who is the center personality in list community and who is an outsider. Usually, members who are most followed by other members of the list subgraph (more central in the list), tend to post relevant tweets often. We have experimented with this idea and found out that combining textual (topic signature) with social features gives best classification accuracy.
User feedback. The system starts classification without any training. It fetches list's tweets back and identifies topic signature. But it can be improved by giving feedback: you can mark the erroneously filtered tweets and you can mark not filtered noise tweet. The system improves very quickly with the user feedback: on average, it needs around 5 labeled tweets to achieve best classification accuracy (86%-90%).Read the full thesis on Filtering Twitter Lists.