An Optimization Framework for Map-Reduce Queries
University:University of Texas at Arlington
Room :A3 (1st floor)
Time:11:00am (coffee: 10:30)
The map-reduce programming model is a popular framework for cloud computing that enables large-scale data analysis on the cloud. The most popular implementation of this framework, called Hadoop, has been used extensively by many companies on a very large scale. Currently, the majority of these map-reduce jobs are specified in declarative query languages, such as Hive and Pig. In this talk, I will present a new framework for optimizing complex SQL-like map-reduce queries. It is based on a novel query algebra and uses a small number of higher-order physical operators that are directly implementable on existing map-reduce systems, such as Hadoop. Although our framework is applicable to any SQL-like map-reduce query language, I will focus on a powerful query language, called MRQL. Current map-reduce query languages, such as Hive, enable users to plug-in custom map-reduce scripts into queries for those jobs that cannot be declaratively coded in the query language, which may result to suboptimal, error-prone, and hard-to-maintain code. In contrast to these languages, MRQL is expressive enough to capture most of these computations in declarative form and at the same time is amenable to optimization. In this talk, I will describe an optimization framework that maps the algebraic forms derived from the MRQL queries to efficient workflows of map-reduce operations that consist of our physical plan operators. Finally, I will report on an open-source implementation of MRQL that can handle line-oriented, XML, and JSON data and can efficiently evaluate very complex data analysis queries, such as pagerank and k-means clustering.