.NET
Raiting:
3

MapReduce in three pictures


image

What is MapReduce?

This is an approach, algorithm or pattern of the parallel processing of large volumes of unprocessed data, for example, the results of crawlers or the logs of web queries. According to statistics 80% tasks could be mapped on is mapped on MapReduce, it just drives NoSQL. There are different implementations of MapReduce. Well known and patented implementation of this algorithm and the approach of Google, for example: MySpace Qizmt - MySpace's Open Source Mapreduce Framework, is also used in Hadoop, MongoDb and there are many different examples that we can give. More details can be found in the article MapReduce: Simplified Data Processing on Large Clusters

The algorithm receives at the input 3 arguments: the source collection, Map function, and Reduce function, and it returns a new collection of data.

Collection MapReduce (Collection source, Function map, Function reduce)

The algorithm is composed by few steps; the first one consists to execute the Map function to each item within the source collection. The Map will return zero or may instances of Key/Value objects

ArrayOfKeyValue Map (object itemFromSourceCollection)

So, we can say that Map’s responsibility is to convert an item from the source collection to zero or many instances of Key/Value objects. We can see this at the picture below:

image

At the next step, the algorithm will sort all Key/Value instances and it will create new object instances where all values will be grouped by Key

image

The last step will executes the Reduce function by each grouped Key/Value instance.

ItemResult Reduce(KeyWithArrayOfValues item)

The Reduce function will return a new item instance that will be included into the result collection.

image

Sample

Let us take a look at a very simple c# implementation of this algorithm, it counts all the vocals in an array of strings.

This is a generic MapReduce functon that represents the algorithm’s orchestration, and I also implemented the Map and Reduce functions that are required as inputs of the MapReduce. The implementation of the Map and Reduce functions are specifics for the task that we want to accomplish, in this sample it is “count all vocals within a set of strings”

using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.Linq;
using System.Threading.Tasks;

namespace MapReduceSample
{
// This class represents a collection’s item result
class VocalCount
{
public char Vocal;
public int Count;
}

class Program
{
static void Main(string[] args)
{
// "lines" represents the source collection
var lines = new[] {
"How many vocals do",
"these two lines have?"
};

foreach (var line in lines)
{
Console.WriteLine(line);
}
Console.WriteLine();

// Invokes MapReduce
var results = MapReduce(lines, Map, Reduce);

// Displays result
foreach (var result in results)
{
Console.WriteLine("{0} = {1}", result.Vocal, result.Count);
}

Console.ReadKey();
}

/// <summary>
/// The map function counts vocals in a string
/// </summary>
/// <param name="sourceItem" />A line to be processed</param>
/// <returns>A collection of Key/Value.
/// Where the key is the vocal, and the value is its count.</returns>
static IEnumerable<KeyValuePair<char, int>> Map(string sourceItem)
{
return sourceItem
.ToLower()
.Where(c => "aeiou".Contains(c))
.GroupBy(c => c, (c, instances) => new KeyValuePair<char, int>(c, instances.Count()));

}

/// <summary>
/// The reduce function compute the total count for each vocal
/// </summary>
/// <param name="reduceItem" />Instance Key/Values, where the key is the vocal,
/// and value is an enumeration of all counts</param>
/// <returns>A result instance, VocalCount</returns>
static VocalCount Reduce(KeyValuePair<char, IEnumerable<int>> reduceItem)
{
return new VocalCount
{
Vocal = reduceItem.Key,
Count = reduceItem.Value.Sum() // Computes total count
};
}

/// <summary>
/// A generic implementation of MapReduce
/// </summary>
/// <typeparam name="TSource">Type of the items at the source collection</typeparam>
/// <typeparam name="TKey">Key’s type used by Map and Reduce functions</typeparam>
/// <typeparam name="TValue">Value’s type used by Map and Reduce functions</typeparam>
/// <typeparam name="TResult">Type of the items at the returned collection</typeparam>
/// <param name="source" />Source collection</param>
/// <param name="map" />Map function to apply</param>
/// <param name="reduce" />Reduce function to apply</param>
static IEnumerable<TResult> MapReduce<TSource, TKey, TValue, TResult>(
IEnumerable<TSource> source,
Func<TSource, IEnumerable<KeyValuePair<TKey, TValue>>> map,
Func<KeyValuePair<TKey, IEnumerable<TValue>>, TResult> reduce)
{
// Collection where map’s result will we store
var mapResults = new ConcurrentBag<KeyValuePair<TKey, TValue>>();

// Invokes, in a parallel way, the Map function for each item at the source
Parallel.ForEach(source, sourceItem =>
{
foreach (var mapResult in map(sourceItem))
{
mapResults.Add(mapResult);
}
});

// Groups all values by key
var reduceSources = mapResults.GroupBy(
item => item.Key,
(key, values) => new KeyValuePair<TKey, IEnumerable<TValue>>(key, values.Select(i=>i.Value)));

var resultCollection = new BlockingCollection<TResult>();

// Kick off a reduce task
Task.Factory.StartNew(() =>
{
// Invokes, in a parallel way, the Reduce function for each at reduceSources
Parallel.ForEach(reduceSources,
(reduceItem) => resultCollection.Add(reduce(reduceItem)));

resultCollection.CompleteAdding();
});

return resultCollection.GetConsumingEnumerable();
}
}
}
Tags: .net, mapreduce
ZimerMan 20 september 2011, 10:08
Vote for this post
Bring it to the Main Page
 

Comments

0 jimibt May 16, 2012, 11:28
very nice implementation here, have tried a few and this one is a great starting point for keeoing it simple but effective. thanks for that

jim

btw - the code formatting has got a bit scewed on the line: .Where(c => "aeiou".Contains©)

that should of course be: .Where(c => "aeiou".Contains(c))
0 ZimerMan May 16, 2012, 12:33
You were right, thank you. I've fixed it.

Leave a Reply

B
I
U
S
Help
Avaible tags
  • <b>...</b>highlighting important text on the page in bold
  • <i>..</i>highlighting important text on the page in italic
  • <u>...</u>allocated with tag <u> text shownas underlined
  • <s>...</s>allocated with tag <s> text shown as strikethrough
  • <sup>...</sup>, <sub>...</sub>text in the tag <sup> appears as a superscript, <sub> - subscript
  • <blockquote>...</blockquote>For  highlight citation, use the tag <blockquote>
  • <code lang="lang">...</code>highlighting the program code (supported by bash, cpp, cs, css, xml, html, java, javascript, lisp, lua, php, perl, python, ruby, sql, scala, text)
  • <a href="http://...">...</a>link, specify the desired Internet address in the href attribute
  • <img src="http://..." alt="text" />specify the full path of image in the src attribute