Ad-Hoc, Interactive Search and Exploration over Big Data

Summary

Exploratory data analysis plays a key role in data-driven discovery in a wide range of domains including science, engineering, and business. This project aims to enable data scientists from many domains to search and explore their large data sets far easier and faster than they do today. Rather than spending a lot of time to set up exploration pipelines by combining multiple software tools, users will work with a single, general purpose and more usable system. Overall, this project will enable fundamentally richer means for data exploration and lead to significant productivity improvements; it will accelerate discovery and breakthroughs in many domains such as e-commerce, finance, and science. This research will be incorporated in undergraduate and graduate coursework. The outreach activities include special research and education-focused programs that are geared towards undergraduates and high-school girls.

This research is building a new prototype database system, called Searchlight, that uniquely integrates constraint solving and data management techniques. The result enables rich, highly-efficient means for generic ad hoc search, exploration and mining over large multidimensional data collections. Searchlight allows Constraint Programming (CP) machinery to run efficiently inside a DBMS without the need to extract, transform and move the data. This marriage offers the rich expressiveness and efficiency of constraint-based search and optimization provided by modern CP solvers with the ability of Database Management Systems (DBMSs) to store and query data at scale.

Searchlight is a transformative step in enriching the functionality of database systems towards new data- and search-intensive applications. We are developing novel approaches for synopsis-based in-memory processing, speculative solving, search query optimization, parallel processing and load balancing, which collectively yield performance and usability levels that far improve those of the state of the art.

We are also studying alternative data summarization techniques that can be used profitably for data exploration and search in main-memory systems. Our early results show that deep learning models offer great promise in representing complex multidimensional data distributions and thus can be used effectively for this purpose.

Participants:

  • Faculty:
    • Ugur Cetintemel
    • Stan Zdonik
  • Students:
    • Alex Kalinin (PhD, now at Vertica)
    • Sam Zhao (PhD student)
    • Emir Ilkhechi (PhD student)
    • Michael Mao (Undergraduate)
    • Grace Fan (Undergraduate)
    • Xiran Shi (Undergraduate)

Products:

  • Alexander Kalinin, Sam Zhao, Ugur Cetintemel, Stan Zdonik, Dynamic Query Refinements for Interactive Data Exploration, EDBT’20.
  • Deep Semantic Compression for Tabular Data. Emir Yilkici, Alex Galakatos, Andrew Crotty, Grace Fan, Xiran Shi, Ugur Cetintemel, SIGMOD’20, (under revision).
  • Deep Compression for Tabular Data, Emir Yilkici, Alex Galakatos, Andrew Crotty, Grace Fan, Xiran Shi, Ugur Cetintemel. Poster. NEDB Symposium, January 2018.
  • Alexander Kalinin, Ugur Cetintemel, Stan Zdonik, Interactive Search and Exploration of Waveform Data with Searchlight (demo), ACM SIGMOD 2016.
  • Alexander Kalinin, Ugur Cetintemel, Stan Zdonik, Searchlight: Enabling Integrated Search and Exploration over Large Multidimensional Data, PVLDB 2015.
  • A. Elmore, J. Duggan, M. Stonebraker, M. Balazinska, U. Cetintemel, V. Gadepally, J. Heer, B. Howe, J. Kepner, T. Kraska, S. Madden, D. Maier, T. Mattson, S. Papadopoulos, J. Parkhurst, N. Tatbul, M. Vartak, S. Zdonik, A Demonstration of the BigDAWG Polystore System (demo), PVLDB 2015. -Alexander Kalinin, Ugur Cetintemel, Stan Zdonik, Interactive Data Exploration Using Semantic Windows, ACM SIGMOD 2014. Ugur Cetintemel, Mitch Cherniack, Justin DeBrabant, Yanlei Diao, Kyriaki Dimitriadou, Alexander Kalinin, Olga Papaemmanouil, Stanley B. Zdonik, Query Steering for Interactive Data Exploration, CIDR 2013.

Acknowledgements:

alt text

The Searchlight project is supported by the NSF grant IIS-1526639.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.