The Brown Data Management Group has the following (students only!) paper in WSDM for the BigData project:

**Fast Estimation of Betweenness Centrality through Sampling**

Matteo Riondato, Evgenios M. Kornaropoulos
Betweenness centrality is a fundamental measure in social network analysis, expressing the importance or influence of individual vertices in a network in terms of the fraction of shortest paths that pass through them. Exact computation in large networks is prohibitively expensive and fast approximation algorithms are required in these cases. We present two efficient randomized algorithms for betweenness estimation. The algorithms are based on random sampling of shortest paths and offer probabilistic guarantees on the quality of the approximation. The first algorithm estimates the betweenness of all vertices: all approximated values are within an additive factor ε from the real values, with probability at least 1 − δ. The second algorithm focuses on the top-K vertices with highest betweenness and approximate their betweenness within a multiplicative factor ε, with probability at least 1 − δ. This is the first algorithm that can compute such approximation for the top-K vertices. We use results from the VC-dimension theory to develop bounds to the sample size needed to achieve the desired approximations. By proving upper and lower bounds to the VC-dimension of a range set associated with the problem at hand, we obtain a sample size that is independent from the number of vertices in the net- work and only depends on a characteristic quantity that we call the vertex-diameter, that is the maximum number of vertices in a short- est path. In some cases, the sample size is completely independent from any property of the graph. The extensive experimental evaluation that we performed using real and artificial networks shows that our algorithms are significantly faster and much more scalable as the number of vertices in the network grows than previously presented algorithms with similar approximation guarantees.

Matteo Riondato Accepted Papers

The Brown Data Management Group has the following paper in CIKM for the Longview project:

**PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce**

Matteo Riondato, Justin DeBrabant, Rodrigo Fonseca, Eli Upfal
We present a novel randomized parallel technique for mining Frequent Itemsets and Association Rules. Our mining algorithm, PARMA, achieves near-linear speedup while avoiding costly replication of data. PARMA does this by creating multiple small random samples of the transactional dataset and running a mining algorithm on the samples independently and in parallel. The resulting collections of Frequent Itemsets or Association Rules from each sample are aggregated and ﬁltered to provide a single collection in output. Because PARMA mines random subsets of the dataset, the ﬁnal result is an approximation of the exact solution. Our probabilistic analysis shows that PARMA provides tight guarantees on the quality of the approximation. The end user speciﬁes accuracy and conﬁdence parameters and PARMA computes an approximation of the collection of interest that satisﬁes these parameters. We formulated and implemented the algorithm in the MapReduce parallel computation framework. Our experimental results show that in practice the quality of the approximation is even higher than what can be analytically guaranteed. We demonstrate the correctness and scalability of PARMA by testing it on several synthetic datasets of varying size and complexity. We compare our results to two previously proposed exact parallel mining algorithms in MapReduce.

st ACM International Conference on Information and Knowledge Management (CIKM 2012) will be held from October 29 to November 2, 2012 in Maui, USA.

Matteo Riondato Accepted Papers

The Brown Data Management Group has the following paper in ECML PKDD for the Longview project:

**Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees**

Matteo Riondato, Eli Upfal
The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies both to absolute and to relative approximations of (top-K) FI’s and AR’s. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a proof that the VC-dimension of this range space is upper bounded by an easy-to-compute characteristic quantity of the dataset, namely, the maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is the best possible for a large class of datasets.

The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) will take place in Bristol, UK from from September 24th to 28th, 2012.

Matteo Riondato Accepted Papers