-- guest post by Dmitry Lizorkin --
In the poster we presented at WWW2009 in addition to our research paper, we proposed an approach to automatically organize Wikipedia knowledge by locating the implicit groupings of related Wikipedia articles into communities.
In this blog post we describe our background ideas behind the poster, as well as our subsequent considerations we discussed during the poster session.
Our initiating goal for the poster was to provide people with better insight of the knowledge available in Wikipedia using machine aid. To address this goal, we decided to consider Wikipedia as a graph (formed by articles and links between them) and to study what structure this graph had. Namely, we supposed that the graph should contain communities - groups of densely interlinked Wikipedia articles. Our inclination was motivated by the following two reasons:
A. First, Wikipedia is becoming really big, so big that no person alone is now able to mentally grasp the complete knowledge contained in Wikipedia as a whole. Even for a person who is an expert in some field, her field of expertise generally fits into some tens/hundreds Wikipedia articles, and that's just a small bit of Wikipedia as a whole. When humans edit Wikipedia to hyperlink articles or to organize articles into categories, each human is actually working on a tiny bit of Wikipedia that she is able to mentally grasp. And due to collaborative nature of Wikipedia, the global effect of such individual edits is a new formation on its own that goes qualitatively beyond comprehension of each individual contributor.So, we decided to analyze this big storage of knowledge globally with automatic techniques to reveal more coarse-granular groups of information there that would make it easier for humans to comprehend.
B. Second, when we studied communities within a graph formed by terms located in some text (for extracting key terms from text), we were intrigued to observe that these communities closely correspond to main topics considered in the text (recall our recent blog post). Encouraged by this observation, we felt eager to study the same approach in the global scope of Wikipedia as a whole.
With this motivation in mind, we chose the community detection algorithm by Girvan and Newman for achieving our task. The Girvan-Newman algorithm automatically detects the appropriate number of communities in a graph and analyses the global link structure in the graph to agglomerate densely linked groups of articles into communities. We applied the algorithm to the set of articles from the English Wikipedia and the subset of links that are used to convey stronger relationships between articles (in particular, links from Wikipedia "See also" sections). Even with the restricted set of links (made for computational reasons) the input graph was so large that it took the algorithm several days to complete the calculation. But the results we obtained were worth this computation effort.
And the first two notable results are:
1. Wikipedia turns out to have distinct division into communities.
That is, Wikipedia articles actually organize themselves into groups, such that articles within a single group are densely interlinked, and articles from different groups are linked more loosely.
2. Communities in Wikipedia turn out to contain more semantically similar Wikipedia articles.
That is, if you take a community and take a look at its member articles, you can observe that these articles are more similar to each other than if they were selected from Wikipedia merely at random.
Moreover, the smaller the community, the more similar member articles it contains. Even more, for a community that is relatively small (consisting of less than about 200 articles) one can usually notice that all articles in the community share a common topic (e.g., "ecology", "rail transport", "beer").
And that's not all:
When we initiate PageRank computation over such a community, the highest PageRank score typically gets assigned to the article that gives in its title the topic of the whole community.
For example, in the community that contains Wikipedia articles about tea, the highest PageRank is owned by the article entitled "Tea" (see the demo http://modis.ispras.ru/wikipedia/Tea.html)
If we stop for a bit and think of these findings, we realise that they are truly amazing ones:
In all the techniques we use above for analysis, we do not use any textual content from Wikipedia at all. What we are using is solely the link structure of Wikipedia (since both community detection and PageRank rely on link information only). And, having solely link information, we are able to (a) organize Wikipedia articles into communities with semantically similar articles, and (b) infer the title for each community, that gives the generalized topic for all the articles in the community.
This means that the link structure of Wikipedia is actually no less important in conveying global knowledge than the textual content of Wikipedia, - the fact that usually goes underestimated and even unnoticed.
The question that further arises is:
how can we now organize larger communities?
We noted above that for a larger community we usually cannot clearly observe that all its articles have a single close topic: either these are multiple topics or the topic is too broad.
However, we discovered that if we once again apply the Girvan-Newman algorithm to such a community on its own, it in turn exhibits distinct division into (sub)communities.
This means that the community (being the group of densely interlinked articles) desintegrates into distinct subgroups of articles that are interlinked even more densely.
That is, a larger community in fact exhibits a distinct nested division of its own!
By repeating this process while the community detection algorithm approves the good quality of division, we are able to automatically build a complete hierarchy.
The hierarchy we discovered provides the intuition on how knowledge in Wikipedia reveals to organize itself as the result of collaborative effort of Wikipedia contributors.
In particular, if we graphically depict each community with a circle, denote nested communities by nested circles and make circle sizes proportional to community sizes, then the hierarchy implied by Wikipedia looks schematically as follows:
The further natural question is:
Since Wikipedia provides its own mechanism for (manually) organizing articles into the categories, how do the located communities correlate with the Wikipedia category structure?
And our observation is:
In some cases, a community contains almost the same set of articles as some Wikipedia category does.
And in some cases, articles within a community are not associated into any common Wikipedia category.
But the second case does not actually mean that Wikipedia categories provide the "ground truth", while any other data organization is incorrect.
In fact, as we noted in the beginning of this post, a person combines Wikipedia articles into a category based on her local knowledge of the subject represented in these articles.
On the other hand, the ability of a machine to analyze the global link structure of Wikipedia reveals that the collaborative effort of individual human contributors sometimes reflects a different grouping, more appropriate according to the relations between articles expressed with links.
So with our findings we rather provide the insight on how knowledge in Wikipedia organises itself implicitly, and the hierarchy we constructed can be considered as an extension to the category structure existing in Wikipedia.
This poster is a joint work with our colleague Olena Medelyan from the University of Waikato.
Online demonstration of the constructed hierarchy is available at:
http://modis.ispras.ru/wikipedia/
We organized full-text search over the constructed hierarchy, by building full-text index over Wikipedia article titles and their abstracts (the piece of text in Wikipedia from the beginning of an article till its first section).
Thus indexed, search produces results for any search string that matches anything in a article title or abstract in Wikipedia.
Searching gives you the individual communities that match your search string and that are the leaves of the hierarchy, ranked by exactness of the match.
The demonstration allows you to find and navigate over the semantically similar Wikipedia articles on the subject of your interest.
Did you try to analyze community structure of Wikipedia category graph? That is, many category inclusions seem to be rather spurious, and perhaps hierarchical clustering of the category graph could help to detect such spurious inclusions.
Posted by: dsha | May 21, 2009 at 07:00 AM
We haven't analyzed Wikipedia category graph alone for detecting communities there, but this would indeed be the instructive thing to experiment with. And computationally it should be more manageable than for the whole set of Wikipedia articles :)
Thank you for your interesting suggestion!
We will think of performing this experiment and we'll let you know of the results if there be any interesting findings.
Posted by: Dmitry Lizorkin | May 21, 2009 at 09:00 AM
Interestingly the use of this material is one of the clothing I most closely associate with the Confederate flag as a poster for the race.
Posted by: affordable banner printing | April 18, 2011 at 12:06 PM
But the second case does not actually mean that Wikipedia categories provide the "ground truth", while any other data organization is incorrect.
Posted by: Dating agencies Review | June 29, 2011 at 05:59 AM
Its insane, that is what i can say concerning this post. Because this def is what the whole thing is about right? Keep it up!
Posted by: Social Kik | November 29, 2011 at 04:06 AM
I don't mind about this algorithm stuff. Using wikipedia as my close reference is still my main concern.
Posted by: how to find mr. right | January 14, 2013 at 05:18 AM
great post!wikipedia is really reliable source.
Posted by: BOokkepper CAloundra | May 31, 2013 at 11:37 PM