CIKM 2012 Accepted Paper
- PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce
Matteo Riondato, Justin DeBrabant, Rodrigo Fonseca, Eli Upfal
We present a novel randomized parallel technique for mining Frequent Itemsets and Association Rules. Our mining algorithm, PARMA, achieves near-linear speedup while avoiding costly replication of data. PARMA does this by creating multiple small random samples of the transactional dataset and running a mining algorithm on the samples independently and in parallel. The resulting collections of Frequent Itemsets or Association Rules from each sample are aggregated and ﬁltered to provide a single collection in output. Because PARMA mines random subsets of the dataset, the ﬁnal result is an approximation of the exact solution. Our probabilistic analysis shows that PARMA provides tight guarantees on the quality of the approximation. The end user speciﬁes accuracy and conﬁdence parameters and PARMA computes an approximation of the collection of interest that satisﬁes these parameters. We formulated and implemented the algorithm in the MapReduce parallel computation framework. Our experimental results show that in practice the quality of the approximation is even higher than what can be analytically guaranteed. We demonstrate the correctness and scalability of PARMA by testing it on several synthetic datasets of varying size and complexity. We compare our results to two previously proposed exact parallel mining algorithms in MapReduce.