August 24th, 2011

A Comparison of Approaches to Large-Scale Data Analysis

Andrew Pavlo
Brown University
Erik Paulson
University of Wisconsin
Alexander Rasin
Brown University
Daniel J. Abadi
Yale University
David J. DeWitt
Microsoft Inc.
Samuel Madden
M.I.T. CSAIL
Michael Stonebraker
M.I.T. CSAIL

Overview

The following information is meant to provide documentation on how others can recreate the benchmark trials used in our SIGMOD 2009 paper. Our experiments were conducted on a 100-node cluster at the University of Wisconsin-Madison; each node had a single 2.40 GHz Intel Core 2 Duo processor running 64-bit Red Hat Enterprise Linux 5 (kernel version 2.6.18) with 4GB RAM and two 250GB SATA-I hard disks.

Publications

  • M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin, "MapReduce and Parallel DBMSs: Friends or Foes?," Communications of the ACM, vol. 53, iss. 1, pp. 64-71, 2010. [PDF] [BIBTEX]
    @article{stonebraker2010cacm,
      author = {Stonebraker, Michael and Abadi, Daniel and DeWitt, David J. and Madden, Sam and Paulson, Erik and Pavlo, Andrew and Rasin, Alexander},
      title = {MapReduce and Parallel DBMSs: Friends or Foes?},
      journal = {Communications of the ACM},
      volume = {53},
      number = {1},
      year = {2010},
      issn = {0001-0782},
      pages = {64--71},
      doi = {http://doi.acm.org/10.1145/1629175.1629197},
      publisher = {ACM},
      address = {New York, NY, USA},
      url = {http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf},
     }
  • A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, "A comparison of approaches to large-scale data analysis," in SIGMOD ’09: Proceedings of the 35th SIGMOD international conference on Management of data, New York, NY, USA, 2009, pp. 165-178. [PDF] [BIBTEX]
    @inproceedings{pavlo09,
      author = {Pavlo, Andrew and Paulson, Erik and Rasin, Alexander and Abadi, Daniel J. and DeWitt, David J. and Madden, Samuel and Stonebraker, Michael},
      title = {A comparison of approaches to large-scale data analysis},
      booktitle = {SIGMOD '09: Proceedings of the 35th SIGMOD international conference on Management of data},
      year = {2009},
      isbn = {978-1-60558-551-2},
      pages = {165--178},
      location = {Providence, Rhode Island, USA},
      doi = {http://doi.acm.org/10.1145/1559845.1559865},
      publisher = {ACM},
      address = {New York, NY, USA},
     url = {http://database.cs.brown.edu/papers/benchmarks-sigmod09.pdf}
    }

Updates

  • August 24th, 2011
    The repository has been moved to a new permanent location: https://database.cs.brown.edu/svn/mr-benchmarks/
  • October 13th, 2009
    The following is the benchmarks.conf configuration that we used for the upcoming CACM paper. Through the feedback that we recieved from the Hadoop community, we were able to now get SequenceFiles+Compression to run slightly faster than plain text files.
    COMBINE_RESULTS=1    ## Combine Final Results into Single Output File
    COMPRESS_DATA=1      ## Compress Input Data and Intermediate Results
    SEQUENCE_FILE=1      ## Use SequenceFile Input Formats
    
  • April 14th, 2009
    Previously we incorrectly stated we used the -client option for our Hadoop configuration; please be aware that we used -server in all of our trials reported in the paper. Note that if you do not enable the JVM reuse option, the you will want to keep the default mapred.child.java.opts to -client instead of -server.

Source Code

All of the data generation and Hadoop source code used in the benchmark study are available via our public SVN repository:

svn co https://database.cs.brown.edu/svn/mr-benchmarks/

There is also a WebSVN interface to browse the source code online.

Hadoop Configuration

We changed the following properties in $HADOOP_CONF_DIR/hadoop-site.xml for our Hadoop 0.19 installation:

dfs.block.size                  268435456 # 256MB
mapred.child.java.opts          -Xmx512m -server
mapred.compress.map.output      true
mapred.job.reuse.jvm.num.tasks  -1
io.sort.factor                  100
io.sort.mb                      256
tasktracker.http.threads        60
topology.script.file.name      /path/to/rack-mapper.pl

Data Generation

There are two data sets that need to be generated: (1) the Grep data and (2) the HTML documents and their corresponding data tables. Please note that although the scripts that we provide here assume the network layout and infrastructure of the University of Wisconsin’s DB cluster, they should be easily adaptable to other configurations.

Grep Data Sets

We use a custom Map/Reduce job that generates the Grep data set in HDFS from directly inside of the map task. We use a modified version of Yahoo!’s TeraGen data generator that was also used for the TeraByte benchmark. Because the SQL systems cannot operate on files stored in HDFS, we provide scripts to download the data into separate files on each node and store them on the local file systems.

To generate the data, execute the following instructions. Please note that our code assumes that there are 100 nodes in the cluster: you can modify teragen.pl to tune the parameters for how much data to create. This will create four different data sets (535MBx100, 10GBx100, 20GBx50, and 40GBx25) in the HDFS directory /data/SortGrep/.

svn co https://database.cs.brown.edu/svn/mr-benchmarks/
cd datagen/teragen
start-dfs.sh
start-mapred.sh
perl ./teragen.pl

You can then use the the script teragen-grabber.pl to pull down the separate data files on each node. Again, please note that this script assumes that each node’s hostname is formatted as d-###, but this can easily be adapted to each node. You can also use our pusher utility (similar to pssh) to execute teragen-grabber.pl in parallel on each node. This program assumes that there is a shared network filesystem available on each node (e.g., NFS or AFS), even though our benchmarks did not use network storage.

pusher "$HOME/MapReduce/benchmarks/datagen/teragen/teragen-grabber.pl /data/SortGrep/ /path/to/local/directory/" \
   --debug --hosts=$HADOOP_CONF_DIR/slaves

Analysis Benchmarks Data Sets

The SVN repository also contains code to generate the random HTML data sets. All you need to do is execute the following command on a single node in your cluster and it fire off the generation program on each node in parallel (again, we assume that there is a shared file system available on each node. Make sure you need to modify config.txt first in order to change the parameters according to your cluster configuration.

cd htmlgen/teragen
nano config.txt
python ./generateData.py

The defaults for this program is to generate a set of random HTML documents, and derive the Rankings data set (1GB/node) and the UserVisits data set (20GB/node). The relevant SQL schema for loading the tables into a database are in CACM_schema.sql.

Hadoop Benchmarks

Execution of the Hadoop benchmarks is completely automated; the main controller of the benchmark execution is through the runner.sh shell script. This program takes a configuration file (see benchmarks.conf-sample) about which benchmarks and cluster configurations to execute. You will also need to modify the global configuration file (see runner.conf-sample) to correspond with your local cluster configuration. Once these files have been changed and after you have downloaded the Grep and Analytical benchmark data onto the local filesystem of each node, you need to then build the JAR files used by the benchmark runner. Once this is completed, you can just execute runner.sh to run all of the bencmarks:

cd mapreduce
cp runner.conf-sample runner.conf
nano runner.conf
cp benchmarks.conf-sample benchmarks.conf
nano benchmarks.conf
ant jars
./runner.sh ./benchmarks.conf

By default, the runner will automatically reformat your HDFS installation for each cluster size (e.g., 10 nodes, 25 nodes, etc) and start/stop the Hadoop HDFS Master and JobTracker each time. This ensures that the data is distributed uniformly across all nodes in the cluster.

You can modify the options in your benchmarks.conf configuration file to change this behavior. The other key options for the benchmarks are found at the top of this file (to enable or disable a particular feature in a benchmark trial, set it’s value to either 1 or 0, respectively):

COMBINE_RESULTS=1    ## Combine Final Results into Single Output File
COMPRESS_DATA=0      ## Compress Input Data and Intermediate Results
TUPLE_DATA=0         ## Use custom tuples for multi-column values
SEQUENCE_FILE=0      ## Use SequenceFile Input Formats
SPLIT_DATA=0         ## Split input files into separate, smaller subsets
INPUT_LIMIT=0        ## How many lines to limit for Input Files when loading

You will also need to set the SLAVES_CONF_DIR option in benchmarks.conf to point to a directory containing the different slave host configuration files (one for each of the different cluster sizes tested):

## Directory of slave files to use
## This directory should contain the following files:
##    slaves-001  (1 host)
##    slaves-010  (10 hosts)
##    slaves-025  (25 hosts)
##    slaves-050  (50 hosts)
##    slaves-100  (100 hosts)
SLAVES_CONF_DIR="$HADOOP_CONF_DIR/slaves-files"

SQL Commands

As taken from the paper, the following are SQL commands that we used for each of the benchmark tasks. You may need to modify them slightly in order to get them to work on different databases systems.

Original MR Benchmark:

SELECT * FROM Data WHERE field LIKE `%XYZ%';

Benchmark #1 – Select Task

SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X;

Benchmark #2 – Aggregate Task

SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP;

Benchmark #2a – Aggregate Task (Variant)

SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

Benchmark #3 – Join Task

SELECT INTO Temp sourceIP,
                  AVG(pageRank) as avgPageRank,
                  SUM(adRevenue) as totalRevenue
  FROM Rankings AS R, UserVisits AS UV
 WHERE R.pageURL = UV.destURL
   AND UV.visitDate BETWEEN Date(`2000-01-15') AND Date(`2000-01-22')
 GROUP BY UV.sourceIP;

SELECT sourceIP, totalRevenue, avgPageRank
  FROM Temp ORDER BY totalRevenue DESC LIMIT 1;

Benchmark #4 – UDF Aggregate

SELECT INTO Temp F(contents) FROM Documents;
SELECT url, SUM(value) AS pageRank
  FROM Temp GROUP BY url;