Welcome to the home page of the Data Management Research Group at Brown University's Department of Computer Science. Our research group is focused on a wide-range of problem domains for database management systems, including analytical (OLAP), transactional (OLTP), and scientific workloads.
The Brown Data Management Group has the following paper in CIDR 2015:
- TUPLEWARE: ”BIG” DATA, BIG ANALYTICS, SMALL CLUSTERS
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Cetintemel, Stan Zdonik
There is a fundamental discrepancy between the targeted and actual
users of current analytics frameworks. Most systems are designed
for the challenges of the Googles and Facebooks of the world—
processing petabytes of data distributed across large cloud deployments
consisting of thousands of cheap commodity machines. Yet,
the vast majority of users analyze relatively small datasets of up
to several terabytes in size, perform primarily compute-intensive
operations, and operate clusters ranging from only a few to a few
dozen nodes. Targeting these users fundamentally changes the way
we should build analytics systems.
This paper describes our vision for the design of TUPLEWARE,
a new system specifically aimed at complex analytics on small
clusters. TUPLEWARE’s architecture brings together ideas from the
database and compiler communities to create a powerful end-to-end
solution for data analysis that compiles workflows of user-defined
functions into distributed programs. Our preliminary results show
performance improvements of up to three orders of magnitude over
The Brown Data Management Group has the following paper in SIGMOD:
- PLANET: making progress with commit processing in unpredictable environments
Latency unpredictability in a database system can come from many factors, such as load spikes in the workload, inter-query interactions from consolidation, or communication costs in cloud computing or geo-replication. High variance and high latency environments make developing interactive applications difficult, because transactions may take too long to complete, or fail unexpectedly. We propose Predictive Latency-Aware NEtworked Transactions (PLANET), a new transaction programming model and underlying system support to address this issue. The model exposes the internal progress of the transaction, provides opportunities for ap- plication callbacks, and incorporates commit likelihood prediction to enable good user experience even in the presence of significant transaction delays. The mechanisms underlying PLANET can be used for admission control, thus improving overall performance in high contention situations. In the SIGMOD 2014 paper “PLANET: Making Progress with Commit Processing in Unpredictable Environments”, we present this new transaction programming model, demonstrate its expressiveness via several use cases, and evaluate its performance using a strongly consistent geo-replicated database across five data centers.
- Finding the True Frequent Itemsets
Matteo Riondato, Fabio Vandin
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specif- ically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation to the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to iden- tify a threshold θˆ such that the collection of itemsets with frequency at least θˆ in D contains only TFIs with probability at least 1 − δ, for some user-specified δ. Our method uses results from statisti- cal learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs with- out including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Cher- noff bounds) of the binomial distribution, and show that our algo- rithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.