Twitter has a very important yet still under-hyped feature: Twitter lists. Twitter list is a manually created set of users who have something in common. For example, one can create a Twitter lists of people who tweet on NoSQL topic (for example: http://twitter.com/al3xandru/nosql), then others can subscribe to this list to get news from NoSQL community. So, Twitter list is a united stream of tweets from all of its members.
For example, here is a list of people supposed to tweet often about NoSQL http://twitter.com/al3xandru/nosql. The list created by @al3xandru. It unites 74 Twitter users, and 22 users have subscribed to read it.
Lists seem to be a new perfect way of getting real-time news on any topic of your interest. Still, the problem is that lists contain too much noise. People included into a list tend to tweet about other topics too (not relevant to the list's topic): their other interests, as well as their personal things, jokes, pics, Foursquare check-ins, etc. Here are, for example, lists where I am included: http://twitter.com/mariagrineva/lists/memberships. I post tweets about: social media, information retrieval, Zurich, NLP, entrepreneurship, startups and other things. Some of them are relevant to one list and not relevant to another.
As a result, an average Twitter list produces stream of tweets most of which are relevant to the main list's topic, and there are still tweets on other topics too. Normally, relevant tweets dominate noise. But there are situations when lists reflect global Twitter trends like #worldcup or #ash for example. Then trending event can dominate main list topic.
In order to filter noise in Twitter list, one needs to identify list's main (niche) topic. The words representing the main topic change over time, new words can arrive. That is, once identified, list's topic should be maintained actual in near real-time. The other thing is that trending news can flood list and even crowd out main topic for some short period of time. To fix it, we should differ two kinds of noise: (1) off-topics – tweets on other topics, not relevant to the main list topic, and very diverse. (2) trendy off-topics – tweets about recent important global event not relevant to the main topic, but most of the list members post about it.
We (Baris Guc and me) developed a prototype of the system that filters out noise in Twitter lists. And here are the main ideas of our approach.
We consider list filtering task as a classification of new coming tweets as relevant or irrelevant.
Real-time identification of list's main topic. Our system monitors list's tweets and creates list's topic signature – a ranked set of words that represent the main topic of the list. Normally, relevant tweets dominate, and it's easy to create topic signature. But we also need to deal with trends. For this reason we consider twitter list in the context of the global twitter stream. That is, if everybody on the list if writing about #WorldCup, but it is also a popular topic in global stream, that means it's trendy off-topic and it should not affect list's topic signature.
Combining textual and social features. Filtering can be improved by using information from Twitter social graph. Members of twitter list organize a social subgraph. Looking at this subgraph one can see who is the center personality in list community and who is an outsider. Usually, members who are most followed by other members of the list subgraph (more central in the list), tend to post relevant tweets often. We have experimented with this idea and found out that combining textual (topic signature) with social features gives best classification accuracy.
User feedback. The system starts classification without any training. It fetches list's tweets back and identifies topic signature. But it can be improved by giving feedback: you can mark the erroneously filtered tweets and you can mark not filtered noise tweet. The system improves very quickly with the user feedback: on average, it needs around 5 labeled tweets to achieve best classification accuracy (86%-90%).
Read the full thesis on Filtering Twitter Lists.
The work on identifying the main topic is, of course, interesting per se. However, I'd challenge the assumption that "an average Twitter list produces stream of tweets most of which are relevant to the main list's topic". It's completely unclear to me why most of user's tweets should be relevant to the main list's topic. He may as well tweet on the topic just occasionally, and still can be a valuable member of the list.
It seems to me that lists in Twitter are not very well thought thru, and this is probably the reason why they haven't been widely adopted. For "topical lists" (like NoSQL list in your example) I think what is really needed is a combination of user/hash tag she uses to identify the tweets on topic. Which in turn raises the question why we need lists at all, for filtering by a hash tag is supported straight out of the box...
Posted by: Dmitry Shaporenkov | August 05, 2010 at 10:43 AM
Agree, not all Twitter lists are "topical lists", didn't want to put this issue into the post for not to overload it.
People use Twitter lists differently: some of them use it to organize their follower. For example, I have list "zurich" - for people I know who live in Zurich. Of course, such a list does not contain the main topic.
Still, most topical lists have most tweets about the main topic. We set up a simple experiment for 20+ topical lists of different sizes, on different topics: sort the words in tweets by frequency. And the most frequent words of a list always clearly identifies the main topic.
Posted by: Mariagrineva | August 06, 2010 at 06:07 AM
So does your experiment mean that most of the people included into a topical list tweet mostly about the topic? I find it quite surprising, given the lack of any filtering in Twitter lists.
Posted by: Dmitry Shaporenkov | August 06, 2010 at 07:43 AM
Okay, it probably doesn't mean that, would be incorrect to imply that from your experiment's description. Still I don't see why topical lists should exhibit topic coherence (not that I'm very suprised they often *do*), shouldn't be very hard to find counter examples.
Posted by: Dmitry Shaporenkov | August 06, 2010 at 07:47 AM
BTW here's one counter example: a list of NLP which I'm a member of: http://twitter.com/zelandiya/nlproc At the time of posting this comment, none of the top 20 posts is NLP-related
Posted by: Dmitry Shaporenkov | August 06, 2010 at 07:51 AM
The first one :). Anyway, I think, if you fetch back ~100 tweets - the word frequencies would should it is NLP-related
Posted by: Mariagrineva | August 06, 2010 at 08:00 AM
The experiments mean that even if many of the list's members tweet about the different topics, tweets about the main topic sum up into a distinguishable topic signature. Because other topics are diverse, but the main topic is common to all members, so it dominates in frequencies
Posted by: Mariagrineva | August 06, 2010 at 08:29 AM
The latter makes perfect sense. Now would be interesting to determine automatically whether a list is a topical one, for a suitable definition of topical. Like "most of time, most of its last N tweets pertain to the topic".
Posted by: Dmitry Shaporenkov | August 06, 2010 at 11:29 PM
"Lists" are simply too much work for most.
And, a "positive filter" as suggested (push certain information) rather than a "negative filter" (remove information) is counter-intuitive for most Users.
We know what we don't want, not necessarily what we do want.
If you can just get rid of all Location check-ins (4square, etc.) the reduction in noise with that alone might make Twitter more valuable. Location check-ins populating a stream are the most invasive "noise" yet, and growing (You are one of the only people I haven't unFollowed due to check-ins; but, give it another week).
Posted by: uguest22 | August 29, 2010 at 01:55 PM
The dyndns link doesnt work atm
Posted by: Bibek | September 30, 2010 at 05:08 PM
I hope i can get the working link here.
Posted by: meizitang | March 31, 2011 at 07:37 AM
I am sorry, but the demo has been already shut down
Posted by: Maria Grineva | April 05, 2011 at 12:00 PM
It seems to me that lists in Twitter are not very well thought thru, and this is probably the reason why they haven't been widely adopted. For "topical lists" (like NoSQL list in your example) I think what is really needed is a combination of user/hash tag she uses to identify the tweets on topic. Which in turn raises the question why we need lists at all, for filtering by a hash tag is supported straight out of the box...
Posted by: microsoft office 2010 | July 18, 2011 at 01:16 AM
If you can just get rid of all Location check-ins (4square, etc.) the reduction in noise with that alone might make Twitter more valuable. Location check-ins populating a stream are the most invasive "noise" yet, and growing (You are one of the only people I haven't unFollowed due to check-ins; but, give it another week).
Posted by: microsoft office | July 18, 2011 at 01:16 AM
So does your experiment mean that most of the people included into a topical list tweet mostly about the topic? I find it quite surprising, given the lack of any filtering in Twitter lists.
Posted by: meizitang | July 20, 2011 at 10:31 PM