QUDE: Quantifying Uncertainty in Data Exploration
It is common practice for data scientists to acquire and integrate multiple data sources for data exploration. Data integration is one thing, but even with a perfectly cleaned and merged data set, two fundamental questions still remain:
- is the integrated data set complete?
- what is the impact of any unknown data on data exploration results?
Answering these questions is key to reason about both correctness and completeness of data exploration results in the open-world. Unfortunately, many traditional query processing techniques and statistical survey methodologies fall short as they either ignore the possibility of incomplete data set (i.e., closed-world assumption) or require assumed knowledge of the population, which is often not readily available.
In this project, we develop and analyze techniques to quantify the uncertainty in such scenario. The key idea is that the overlap between different data sources enables us to estimate the uncertainty and its impact on the data exploration results.
- Estimating the Impact of Unknown Unknowns on Aggregate Query Results
Yeounoh Chung, Michael Lind Mortensen, Carsten Binnig, Tim Kraska. 2016 ACM SIGMOD, 861-876