Tuesday, October 25, 2011

MongoDB mapreduce for stackoverflow.com recent tags

I was playing with MongoDB's mapreduce, and wanted to write a query that simulates the list of 'Recent Tags' feature on stackoveflow home page.

I am using mongo-csharp-driver for this experiment.

Here I am taking a guess at stackoverflow's domain model. Here is the Question model with enough properties to demonstrate the mapreduce query.
Every question is associated with a list of Tags and has a CreatedOn property.
public class Question{
public ObjectId Id { get; set; }
public DateTime CreatedOn { get; set; }
public ICollection Tags { get; set; }
}


Create a database with some name and add 'Questions' collection to the mongodb.

We are going to make the following call to find the recent tags.

var recentTags = questions.MapReduce(map, reduce).GetResultsAs<ResentTagResult>();


RecentTagResult holds the results of the mapreduce query and defined as

public class RecentTagResult {
public string Id;
[BsonElement("value")]
public RecentTag Value;

}


MongoDb's mapreduce call outputs a result set with two properties _id and value. So here I am defining a mapper class with a property 'Id' and property 'Value' of type RecentTag. BsonElement attribute in the above code simply maps lower case property 'value' from the result set to title case property 'Value'. _id property from result is automatically mapped to Id by 'GetResultsAs' call.

RecentTag is defined as follows. Here I am expecting that the query results are going to contain Tag, count of tags, last time when a question was created with this tag. As you can guess our map/reduce functions must emit the values that match the following class definition.
public class RecentTag {
public string Tag { get; set; }
public int Count { get; set; }
public DateTime LastSeenOn { get; set; }

}

Now coming to the meat of the problem. A question can have more than one tag. We are looking for a list of tags used by questions asked in the last month. Here is the definition of the map function (which is completely written in javascript) that goes over entire collection of questions and emits a result each time a tag is found if that question is asked in the last month.

private string map =
@"function() {
if(this.CreatedOn >= new Date('Oct 1, 2011') && this.CreatedOn {
var lastseen = this.CreatedOn;
this.Tags.forEach(function(tag) {emit(tag, {Tag: tag, LastSeenOn:lastseen, Count: 1});});
}}";


It is very important that the emit function's value should match our RecentTag type definition.

emit(tag, {Tag: tag, LastSeenOn:lastseen, Count: 1})

Now coming to reduce, we have a bunch of emitted results from map function and we simply count them to find the total count of each tag found in last month.
A tag might appear more than once in the last month. If the tag appears only once, reduce function will never be called.

private string reduce =
@"function (key, arr_values) {
var dates = [];
arr_values.forEach(function(val) {dates.push(val.lastseenon)});
var result = {Tag: key, LastSeenOn: new Date(Math.max.apply(Math, dates )), Count:0};
for(var i in arr_values)
{
temp = arr_values[i];
result.Count += temp.Count;
}
return result;
}}";

Here att_values contain all emits for a single tag. Again it is important that our return type must match with ResultTag definition similar to the map function.
We start with that definition first

var result = {Tag: '', LastSeenOn: new Date(Math.max.apply(Math, dates )), Count:0};

And then iterate through arr_values and simply increment the count to get the final result.

Filling LastSeenOn property is little tricky. Here we are trying to find out the Max of the CreatedOn property of all emitted values of a single tag.

var dates = [];
arr_values.forEach(function(val) {dates.push(val.lastseenon)});

Here we are gathering all the dates from lastseenon property in to an array. And then while defining the result we are applying the javascrpt's Math.Max function to find the last seen date.

LastSeenOn: new Date(Math.max.apply(Math, dates )

That is all to it.