Author Archive

KDD 2015 Accepted Paper

May 13th, 2015
Comments Off

The Brown Data Management Group has the following paper in KDD 2015:

  • Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages
       Matteo Riondato and Eli Upfal

    We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset. With high probability the approximation is a superset of the FIs, and no itemset with frequency much lower than the threshold is included in it. The algorithm employs progressive sampling, with a stopping condition based on bounds to the empirical Rademacher average, a key concept from statistical learning theory. The computation of the bounds uses characteristic quantities that can be obtained efficiently with a single scan of the sample. Therefore, evaluating the stopping condition is fast, and does not require an expensive mining of each sample. Our experimental evaluation confirms the practicality of our approach on real datasets, outperforming approaches based on one-shot static sampling.

  • Accepted Papers

SDM 2014 Accepted Paper

January 2nd, 2014
Comments Off

The Brown Data Management Group has the following paper in SDM for the BigData project:

  • Finding the True Frequent Itemsets
       Matteo Riondato, Fabio Vandin

    Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specif- ically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation to the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to iden- tify a threshold θˆ such that the collection of itemsets with frequency at least θˆ in D contains only TFIs with probability at least 1 − δ, for some user-specified δ. Our method uses results from statisti- cal learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs with- out including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Cher- noff bounds) of the binomial distribution, and show that our algo- rithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.

  • Accepted Papers

ACM TKDD Accepted Paper

December 2nd, 2013
Comments Off

The Brown Data Management Group has the following paper in ACM Transactions on Knowledge Discovery from Data for the BigData and Longview projects:

  • Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees
       Matteo Riondato, Eli Upfal

    The tasks of extracting (top-K) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses. Sampling techniques can be used for fast discovery of approximate solutions, but works exploring this technique did not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. We circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation of the (top-K) FI’s and AR’s within user-specified parameters. The resulting sample size is linearly dependent on the VC-dimension of a range space associated with the dataset. We analyze the VC-dimension of this range space and show that it is upper bounded by an easy-to-compute characteristic quantity of the dataset, the d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least d such that no one of them is a superset of or equal to another. We show that this bound is tight for a large class of datasets. The resulting sample size is a significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations than what is guaranteed by the analysis.

  • Accepted Papers