Map Reduce is kinda like “Normalize on the Fly”

One undervalued aspect of Data modeling is that you actually get time to consider the form of the data before you get the data. In a Map reduce job, you kow that your map phase is going to get the data, and that it is not going to be normalized . I could have said, not likely to be normalized, but the reality is that if you are using Map-Reduced, you are not going to get structured data.

The Map step is where you deal with this. You take the data in its CLOB form and you turn it into a series of key-value pairs. Strictly speaking, this isn’t a map, it is a relation. In a map Every element of the domain has a single element in the codomain, or range as I learned it. In Hadoop and Map reduce, there is no restriction that a given key always return a unique value, although I suspect that in practice it probably should. Actually, since all of the values for a given key are collected into a list, technically you do get a map, just not at the end of this stage…and really no where in the system do you ever see all the elements of that map. Just a sublist.

Regardless of the mathematical correctness of the term “map”, a Map reduce program has an step which is responsible for creating a structured representation from an unstructured. This is very similar to what a developer does when they have to take some data format and decide how to store it in a DBMS. The assumption is that the DBMS is necessary for processing the events afterwards.

Thus a Map-Reduce operation both defers the cost of normalizing the data, but then potentially pays it multiple times. When using a RDBMS, you pay the price for normalizing the data upon data entry which is then amortized over all the queries of the data. Thus the comparison between Map Reduce and SQL can be viewed as an economic decision.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.