on 15 October 20
MapReduce in Geographically Distributed Environments
The performance of MAPREDUCE across geographically distributed environments is highly dependent upon the amount of network utilization and the quality of the network bandwidth and latency. Different MapReduce configurations are best suited for different data distribution models. In order to select the appropriate model, it is important to understand the characteristics of the workload of the MapReduce job which you are attempting to complete.
Cardosa, et.al describe three different workload data aggregation schemes for MapReduce jobs (Michael Cardosal, 2011). The first, High Aggregation, occurs when the output of the MapReduce process is magnitudes of order smaller than the input. Jobs like these are where input data is categorized and counted and large amounts of matches will be reduced to simple category counts. Examples include MapReduce grep, where word or HREF counts are performed across large amounts of distributed data. The input is a large list of files, and the output is a much smaller list of word counts. The second MapReduce workload scheme mentioned, Net Zero Aggregation, occurs when the output from a MapReduce process is approximately equal to the input. Sort is a good example of Net Zero Aggregation. With a sort job, the output file structure is typically the same size as the input. The final MapReduce workload scheme discussed, Ballooning Data, occurs when the output of the MapReduce function produces more records and data than what was input. An example of this would be a MapReduce job which converts compact formats such as GIF to larger data formats such as JPEG . The amount of data produced is an important factor to consider when architecting a MapReduce solution. In their study, Michael Cardosa, et. al. found that when workloads are highly aggregated, a geographically distributed environment works well. For zero aggregation or ballooning data, centralizing the data before applying map reduce is preferred.
Text Mining Applications with Map Reduce
MapReduce is gaining attention from the scientific community in the area of natural language processing (Atilla Soner Balkir, 2011). Natural language processing models often involve optimization algorithms across large amounts of data. A constraint with natural language processing has always been high speed access to large members of frequently changing parameter values. In information retrieval, the number of times an index term occurs in a document is called its term frequency. The discovery of recurrent phrases automatically from text in a quick turnaround is key to the natural language application as a key phrase can help identify the intent of the user. Balkir, et al. used MapReduce, implemented with Hadoop, to develop a model for chunking up sentences into smaller phrases to help identify recurrent phrases. Their approach speeds up by 6 times the performance required to identify and match these phrases against large, distributed data banks. The use of MapReduce will continue to drive advances in natural language processing, speech recognition, and other forms of artificial intelligence.
Software Quality Assurance
Another application of MapReduce involves analytics of software repositories. Weiyi et al. proposed a study on the use of the model in mining software repositories (Weiyi Shang, 2010). The field of Mining Software Repositories involves analyzing source code, deployment logs, and bug repositories, to find statistical correlations that can be used to identify and address issues in the code such as potential security flaws. To illustrate their approach, they created a MapReduce program to count the number of source code lines. This process was required to evaluate each line of a program to determine if it was an actual instruction or a comment. This type of program had a 20 fold improvement in performance on extremely large repositories over traditional approaches. They conclude that automated software engineering tools play an important role in the analysis of software repositories.
Share your perspective
Share your achievement or new finding or bring a new tech idea to life. Your IT community is waiting!