Recently, I presented our paper "Extracting Key Terms from Noisy and Multi-theme Documents" at the WWW2009. Now let me put aside academic nuances and decorations and give a few simple words on what this work is about and what was the sequence of ideas that led to this work.
1. Semantic relatedness of terms
Wikipedia itself is a huge opportunity for research and innovations. That was the basic idea of the sequence:). You can relatively easy get semantic information on almost any subject from Wikipedia, it contains very useful disambiguation pages, redirect pages, categories. We are having ongoing project on mining Wikipedia to enhance semantic search and advertisement context match.
One of the most important things that you can get from Wikipedia is the measure of how much two articles are related. All Wikipedia articles are interconnected via hyperlinks and we use these links to compute semantic relatedness for the two articles. We take into considerations type of a link (for example, we give higher weights to see_also links as they by definition refer to semantically close concepts), we see if the two articles refer each other, and how many common neighbors they refer.
So, eventually, for any two Wikipedia articles we can tell their semantic relatedness which is the number in [0,1] - 0 for the not related articles, 1 - when two are articles are the same.
2. Semantic graph of a document
Now as we got semantic relatedness for any two terms, we got interested how terms are interrelated within a document? What is the semantic structure of the text document? Now we can build semantic graph of a document and try to see some regularity in it.
Lets look at this news article: "Apple to make ITunes more accessible to the blind".
Apple will be making iTunes more accessible to blind consumers, under an agreement reached with the Massachusetts attorney general's office and the National Federation of the Blind.
Under the agreement, Apple will make iTunes U--a portion of the iTunes Store dedicated to educational content provided by colleges and universities--fully accessible to the blind by December 31, 2008. It will then work to provide full accessibility of the iTunes application and the remainder of the iTunes Store by June 30, 2009.
The blind and visually impaired will get fuller accessibility to the Apple application and Web site for downloading and purchasing music by means of screen access software that converts on-screen information into Braille or speech.
Apple will also contribute $250,000 to the Massachusetts Commission for the Blind to help the agency buy assistive technology that can make the Internet and computer programs more approachable for the blind.
Future versions of iTunes at the time of their release will have to be fully accessible to the blind, according to the Friday announcements on the agreement. Apple released iTunes 8 earlier this month, along with a revamped iPod Nano and updated iPod Touch.
In August, retailer Target and the National Federation of the Blind settled a class action lawsuit over the accessibility of the Target.com Web site.
We extract all terms of the document, carry some of them through disambiguation procedure (terms disambiguation is a huge topic worth a PhD thesis, so, just skipping it here:)), and eventually map document terms onto Wikipedia article titles.
For every pair of terms we compute semantic relatedness and build semantic graph: vertices are terms, edge between two terms identifies a fact that these two terms are semantically related, the weight of the edge is the measure of semantic relatedness of the two connected terms.
Now the next idea. We look at the semantic graph and see very interesting regularity: the terms related to the main topics of the document tend to bunch up into densely interconnected subgraphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. In the figure you can see that terms related to Apple Inc. and Blindness - the two main topics of the document - organized into the two densely interconnected communities. We just need to detach them automatically now.
So we are going to exploit this feature of the semantic graph, and all we need now is to apply a proper graph analysis technique that would automatically detect such terms communities in our semantic graph.
Here we take widely used Girvan-Newman algorithm for detecting community structure in networks. You can find some detail on why we take this technique and not some other in the paper.
We process semantic graph with Girvan-Newman algorithm (I am using beautiful library igraph for this task, thanks Gábor Csárdi and Tamas Nepusz) the result is a set of terms communities (or groups). Now we can just take the most densely interconnected communities (just sum up all edge weights between terms within a community and divide by the number of terms). These communities correspond to the main topic of the document.
This approach is very good at filtering out noise in the document. That is why we apply it to Web pages as they flooded with navigational bars, menus, headers and often irrelevant contextual ads ;). Knowing the topic of the Web page (or its several topics) can greatly improve advertising. At the same time, absence of pronounced topic in the Web page can mean this page is a spam.
And the last note, the pledge of this method is the quality of semantic relatedness. I mean, if computing semantic relatedness produces some errors, further application of graph analysis techniques will go pell-mell. So, if you want to repeat this, you need to build reliable framework for computing semantic relatedness and be ready to tune it.