The Web 2.0 era is characterized by large amounts of user-generated content. People started generating and sharing data on Web services like blogs, social networks, Wikipedia, photo sharing sites and other.
Today, with the emergence mobile Internet access, the nature of user-generated content has changed. Now people contribute more often, with smaller posts and the life-span of these posts has become shorter. On Twitter people share short posts on what they are doing now or reading now, they discuss breaking news, share their current location on services like Foursquare or Facebook places.
MapReduce/Hadoop has become the state-of-the-art approach for analytical batch processing of user-generated data. But now, processing data in batches is becoming too slow for real-time sensitive data. Accumulated data can lose its importance in several hours or, even, minutes. Real-time Web brings new requirements for analytical systems: they must aggregate values in real-time, incrementally, as new data arrives. It follows that workloads are more database-intensive because aggregate values are not produced at once, as in batch processing, but stored in a database constantly being updated.
At Systems group @ ETH Zurich, we are working on Triggy - a system for real-time analytics. Our system is based on Cassandra, distributed key value store. You can find an overview of Cassandra's internals in my presentation embedded below and read about its data model here. We extend Cassandra with push-style procedures and with a serialized access to aggregate values. Push-style processing allows us to immediately propagate the data to the analytical computations. Serialization is used to arrange light-weight transactions for consistent updates of counters (aggregate values), as Cassandra initially does not provide any support for transactions.
In Triggy, we implemented programming model similar to MapReduce, but we modified it to support incremental processing.
Here is my presentation about Triggy where I describe its internals and programming model; compare it to similar systems: Yahoo! S4 and Google Percolator; and discuss applications for Triggy. See presenter notes for slides to get more information.
Triggy will be demostrated at VLDB2011: Max Grinev, Maria Grineva, Martin Hentschel and Donald Kossmann: "Analytics for the Real-Time Web"