October 27th, 2016

Las Vegas: Using Replication to Speed-up MapReduce

Overview

Las Vegas is an inter-collegiate research project with the goal of dramatically speeding-up MapReduce systems with correlated sorting and partitioning.

Motivation

MapReduce systems such as Hadoop are becoming dominant for their ease of use, high scalability, and fault tolerance. They are designed to scale up to thousands of machines and especially suited for the analysis of large data-sets, such as petabytes of image data taken from telescopes.

shuffle

When a query requires more than one data at once, such as a query involving a JOIN operation, the data management system must transmit a large amount of data over the network. For example, the Shuffle phase in Map-Reduce systems copies file blocks over the network and causes a significant bottleneck due to network-I/O in many cases.

Our goal is to alleviate this bottleneck by exploiting correlations. Our previous project, CORADD, exploited correlations to design fast and compact indexes (see these). Las Vegas, on the other hand, exploits correlations to speed-up, or even eliminate, the Map/Reduce/Shuffle phases for queries in MapReduce systems.

Our Approach

The Hadoop Distributed File System (HDFS) splits the data files into 64MB blocks and replicates each block a few times for fault tolerance. The replicas are identical and simply copied to recover from other nodes when some node fails. By contrast, our project exploits the redundancy not only for fault tolerance but also for speeding-up query execution.

lvfs

In our new file system (Las Vegas File System) for Hadoop, each file is partitioned and sorted by values of the record, then stored in columnar file formats with data compression similar to [Floratou et al. VLDB'11]. The partitioning and sorting speeds-up both the Map phase and the Reduce phase when the grouping column of the query is correlated with the partitioning or sorting. Especially for JOIN queries, correlated partitioning can completely eliminate the expensive Shuffle phase. Further, instead of treating each replica as an identical copy (mirror), each replica can choose its own partitioning and sorting in order to speed up different sets of queries.

The key technical challenges in this project are recovery and speculative query execution using such non-identical replicas as well as a physical design (partitioning, sorting, and replica placement) that balances query performance and recovery time.

Source Code

All source codes, manuscripts and experimental results are freely available at github.
We are in the code development phase and the plan for an official release as a software is not set yet.

People

Brown:

DePaul:

MIT:

UC Berkeley: