In my previous post, I described how we use Wikipedia to extract main topics from a Web page filtering out noise content.
The beautiful thing about Wikipedia is its very good coverage of many different domains. But still sometimes I get questions, are you sure Wikipedia is enough to understand any Web page in WWW? Ok, of course, Wikipedia has its limits. Sometimes, if you need to dig deeper into a domain, you might need more concepts.
To solve this problem we need to extend our Wikipedia-based knowledge base with some other, possibly domain-specific dictionary/encyclopedia. Take for example TechCruch' CrunchBase - a base of startups and investors descriptions, the good part of CrunchBase entries is not present in Wikipedia.
Merging Wikipedia and CrunchBase automatically is a bad idea, since it would spoil the Wikipedia benefits of being a clean human produced resource. So, me and my student Artem Chuzhmarov had a nice idea to process this task half-automatically using Amazon Mechanical Turk.
How exactly we need to merge the two encyclopedias. What we use from Wikipedia is its article titles and hyperlinks between articles. So to merge CrunchBase into Wikipedia we need to add CrunchBase entries and to connect them with Wikipedia concepts via hyperlinks.
The Turk gets CrunchBase entry description (title of startup or name of investor and text describing it) and fulfils the following task:
1) Check is there corresponding article in Wikipedia, if no
2) Read the entry text, identify key terms in it and find corresponding Wikipedia articles for these key terms.
The result of Turk work for a single CrunchBase entry is a list of key terms and their corresponding Wikipedia articles. So now we can automatically add the entry to our knowledge base.
CrunchBase entries become one-way connected with Wikipedia articles: they don't become referenced from Wikipedia articles. But that seems to be enough to compute semantic relatedness between two CrunchBase entries or CrunchBase entry and Wikipedia article.
When experimenting we paid ~15 cents for one CrunchBase entry. So, to include all CrunchBase into Wikipedia we would pay about 3000$ and these are once-only expenses.