Archive for the ‘Accepted Papers’ Category

CIKM 2012 Accepted Paper

July 17th, 2012
Comments Off

The Brown Data Management Group has the following paper in CIKM for the Longview project:

  • PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce
       Matteo Riondato, Justin DeBrabant, Rodrigo Fonseca, Eli Upfal

    We present a novel randomized parallel technique for mining Frequent Itemsets and Association Rules. Our mining algorithm, PARMA, achieves near-linear speedup while avoiding costly replication of data. PARMA does this by creating multiple small random samples of the transactional dataset and running a mining algorithm on the samples independently and in parallel. The resulting collections of Frequent Itemsets or Association Rules from each sample are aggregated and filtered to provide a single collection in output. Because PARMA mines random subsets of the dataset, the final result is an approximation of the exact solution. Our probabilistic analysis shows that PARMA provides tight guarantees on the quality of the approximation. The end user specifies accuracy and confidence parameters and PARMA computes an approximation of the collection of interest that satisfies these parameters. We formulated and implemented the algorithm in the MapReduce parallel computation framework. Our experimental results show that in practice the quality of the approximation is even higher than what can be analytically guaranteed. We demonstrate the correctness and scalability of PARMA by testing it on several synthetic datasets of varying size and complexity. We compare our results to two previously proposed exact parallel mining algorithms in MapReduce.

st ACM International Conference on Information and Knowledge Management (CIKM 2012) will be held from October 29 to November 2, 2012 in Maui, USA.

Accepted Papers

ECML PKDD 2012 Accepted Paper

July 17th, 2012
Comments Off

The Brown Data Management Group has the following paper in ECML PKDD for the Longview project:

  • Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees
       Matteo Riondato, Eli Upfal

    The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset, namely, the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is the best possible for a large class of datasets.

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from from September 24th to 28th, 2012.

Accepted Papers

HotCDP 2012 Accepted Paper

April 29th, 2012
Comments Off

The Brown Data Management Research Group has the following paper in HotCDP 2012 from the C-MR Project:

  • Managing Parallelism for Stream Processing in the Cloud
       Nathan Backman, Rodrigo Fonseca, Ugur Cetintemel

    We present a framework that parallelizes and schedules workflows of stream operators, in real-time, to meet latency objectives. It supports data- and task-parallel processing of all workflow operators, by all computing nodes, while maintaining the ordering properties of sorted data streams. We show that a latency-oriented operator scheduling policy coupled with the diversification of computing node responsibilities encourages parallelism models that achieve end-to-end latency-minimization goals. We demonstrate the effectiveness of our framework with preliminary experimental results using a variety of real-world applications on heterogeneous clusters.

Accepted Papers