Welcome to the home page of the Data Management Research Group at Brown University's Department of Computer Science. Our research group is focused on a wide-range of problem domains for database management systems, including analytical (OLAP), transactional (OLTP), and scientific workloads.

Latest News

CIDR 2015 Paper Accepted

April 14th, 2015

The Brown Data Management Group has the following paper in CIDR 2015:

  • TUPLEWARE: ”BIG” DATA, BIG ANALYTICS, SMALL CLUSTERS
       Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, Stan Zdonik

    There is a fundamental discrepancy between the targeted and actual
    users of current analytics frameworks. Most systems are designed
    for the challenges of the Googles and Facebooks of the world—
    processing petabytes of data distributed across large cloud deployments
    consisting of thousands of cheap commodity machines. Yet,
    the vast majority of users analyze relatively small datasets of up
    to several terabytes in size, perform primarily compute-intensive
    operations, and operate clusters ranging from only a few to a few
    dozen nodes. Targeting these users fundamentally changes the way
    we should build analytics systems.
    This paper describes our vision for the design of TUPLEWARE,
    a new system specifically aimed at complex analytics on small
    clusters. TUPLEWARE’s architecture brings together ideas from the
    database and compiler communities to create a powerful end-to-end
    solution for data analysis that compiles workflows of user-defined
    functions into distributed programs. Our preliminary results show
    performance improvements of up to three orders of magnitude over
    alternative systems.

PLANET got accepted at SIGMOD 2014

February 13th, 2014

The Brown Data Management Group has the following paper in SIGMOD:

  • PLANET: making progress with commit processing in unpredictable environments
       Tim Kraska

    Latency unpredictability in a database system can come from many factors, such as load spikes in the workload, inter-query interactions from consolidation, or communication costs in cloud computing or geo-replication. High variance and high latency environments make developing interactive applications difficult, because transactions may take too long to complete, or fail unexpectedly. We propose Predictive Latency-Aware NEtworked Transactions (PLANET), a new transaction programming model and underlying system support to address this issue. The model exposes the internal progress of the transaction, provides opportunities for ap- plication callbacks, and incorporates commit likelihood prediction to enable good user experience even in the presence of significant transaction delays. The mechanisms underlying PLANET can be used for admission control, thus improving overall performance in high contention situations. In the SIGMOD 2014 paper “PLANET: Making Progress with Commit Processing in Unpredictable Environments”, we present this new transaction programming model, demonstrate its expressiveness via several use cases, and evaluate its performance using a strongly consistent geo-replicated database across five data centers.

SDM 2014 Accepted Paper

January 2nd, 2014

The Brown Data Management Group has the following paper in SDM for the BigData project:

  • Finding the True Frequent Itemsets
       Matteo Riondato, Fabio Vandin

    Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specif- ically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation to the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to iden- tify a threshold θˆ such that the collection of itemsets with frequency at least θˆ in D contains only TFIs with probability at least 1 − δ, for some user-specified δ. Our method uses results from statisti- cal learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs with- out including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Cher- noff bounds) of the binomial distribution, and show that our algo- rithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.