Data Management System for Scientific Applications
Summary
The goal of this project is to build an efficient and scalable DBMS specialized for data-intensive scientific applications. For example, an astronomy analysis that processes petabytes of data per year produced by telescopes. Traditional (One-Size-Fits-All) DBMS can’t efficiently handle such extremely large data. Also, such scientific data is often represented as multi-dimensional arrays and inherently contains uncertainty caused by measurement errors. In 2007, we built a prototype Array storage system, called ASAP (Array Store And Processor). We are further extending the work to address wider ranges of problems.
SciDB
SciDB is an opensource project to build a scalable data management system for scientific applications. Several research institutions including Brown and commercial companies are jointly working on it. Our work is a part of SciDB project. Here is a web page for SciDB developers.
Uncertain Data Management
Essentially all scientific data that results from real-world observations is fundamentally uncertain. Previous work addressed inaccurate or probabilistic data in traditional databases. Our project focuses on uncertainty in multidimensional array processing. In this context, uncertainty can arise in several ways:
- Value uncertainty: An array value invariably has measurement error, which results in the actual value being uncertain. This is one type of typical probabilistic data support in databases.
- Position (dimension value) uncertainty: In certain cases, the very position of the measurement is imprecise, as opposed to the obtained data value. Accordingly, the dimension values in the array are uncertain.
- Result uncertainty of functions or predicates: Some functions or predicates, even when applied to deterministic data, produce uncertain results. For example, the LOCATE operator, which does pattern matching, may introduce uncertainty in the results, due to the exact nature of the data and the matching algorithms.
How to succinctly represent uncertain data and efficiently process it in databases has been an open problem for a long time. Our main focus so far has been dealing with value uncertainty. We also have some ongoing work that deals with correlated uncertain values in arrays.
Cost-Based Design and Evaluation of Compression Techniques for Multidimensional Data
Compression allows scalable storage of large data sets and alleviates the I/O bottleneck for data-intensive applications. Over the years, a large number of compression schemes have been developed to support various data types. Given this vast array of alternatives, there has been relatively little work on decision-support tools and algorithms that help decide which compression scheme(s) best match the expectations and constraints of applications and workloads with respect to compression time (encoding and decoding), size (compression ratio), and quality (lossiness). We are also studying generic decision-support framework that can be used to efficiently and accurately pick the best compression scheme(s) to meet application-specific objectives and constraints defined over the Time-Space-Quality (TSQ) space.
This work also addresses the multi-objective compression problem for multidimensional data. First we compare and analyze how a collection of standard algorithms work on multidimensional data natively on real-world scientific data. In addition to exploring extensions that can recognize and leverage multidimensional and temporal patterns, we also introduce an effective hybrid compression scheme that partitions the data into multidimensional chunks and applies different compression schemes across them.
Publications
- M. Stonebraker, J. Becla, D. Dewitt, K. Lim, D. Maier, O. Ratzesberger, and S. Zdonik, "Requirements for Science Data Bases and SciDB," in Conference on Innovative Data Systems Research (CIDR), 2009. [PDF] [BIBTEX]
@inproceedings{stonebraker09,
author = {Stonebraker, Michael and Becla, Jacek and Dewitt, David and Lim, Kian-Tat and Maier, David and Ratzesberger, Oliver and Zdonik, Stan },
booktitle = {Conference on Innovative Data Systems Research (CIDR)},
location = {Asilomar, CA, USA},
month = {January},
title = {Requirements for Science Data Bases and SciDB},
url = {http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_26.pdf},
year = {2009},
project = {scidb},
} - P. Cudre-Mauroux, H. Kimura, K. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. Dewitt, B. Heath, D. Maier, S. Madden, M. Stonebraker, and S. Zdonik, "A Demonstration of SciDB: A Science-Oriented DBMS," in VLDB’09: Proceedings of the 2009 VLDB Endowment, 2009. [PDF] [BIBTEX]
@inproceedings{cmauroux2009ads,
author = {Cudre-Mauroux, Phillipe and Kimura, Hideaki and Lim, Kian-Tat and Rogers, Jennie and Simakov, Roman and Soroush, Emad and Velikhov, Pavel and Wang, Daniel L. and Balazinska, Magdalena and Becla, Jacek and Dewitt, David and Heath, Bobbi and Maier, David and Madden, Samuel and Stonebraker, Michael and Zdonik, Stan },
booktitle = {VLDB'09: Proceedings of the 2009 VLDB Endowment},
location = {Lyon, France},
month = {August},
project = {scidb},
publisher = {VLDB Endowment},
title = {A Demonstration of SciDB: A Science-Oriented DBMS},
url = {http://database.cs.brown.edu/papers/vldb09/scidb.pdf},
year = {2009}
} - T. Ge, S. Zdonik, and S. Madden, "Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers," in SIGMOD ‘09: Proceedings of the 2009 ACM SIGMOD International Conference, 2009. [PDF] [BIBTEX]
@inproceedings{ge09,
author = {Tingjian Ge and Stan Zdonik and Samuel Madden},
booktitle = {SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference},
location = {Providence, Rhode Island, USA},
month = {June},
organization = {ACM},
title = {{Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers}},
url = {http://db.csail.mit.edu/pubs/sigmod308-ge.pdf},
year = {2009},
project = {scidb},
} - T. Ge and S. B. Zdonik, "Handling Uncertain Data in Array Database Systems.," in ICDE, 2008, pp. 1140-1149. [PDF] [BIBTEX]
@inproceedings{ge2008,
author = {Tingjian Ge and Stanley B. Zdonik},
booktitle = {ICDE},
crossref = {conf/icde/2008},
interHash = {8f22a4d5b12d2db95bb12c59c998f627},
intraHash = {efd67d5723c3d10684a7f86c6a9b6384},
pages = {1140-1149},
publisher = {IEEE},
title = {Handling Uncertain Data in Array Database Systems.},
url = {http://database.cs.brown.edu/papers/icde08_ge.pdf},
year = {2008},
ee = {http://dx.doi.org/10.1109/ICDE.2008.4497523},
project = {scidb},
} - M. Stonebraker, C. Bear, U. Çetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. B. Zdonik, "One Size Fits All? Part 2: Benchmarking Studies," in CIDR ‘07, 2007, pp. 173-184. [PDF] [BIBTEX]
@inproceedings{stonebraker07,
author = {Michael Stonebraker and Chuck Bear and Ugur \c{C}etintemel and Mitch Cherniack and Tingjian Ge and Nabil Hachem and Stavros Harizopoulos and John Lifter and Jennie Rogers and Stanley B. Zdonik},
title = {One Size Fits All? Part 2: Benchmarking Studies},
booktitle = {CIDR '07},
year = {2007},
pages = {173-184},
url = {http://www.cidrdb.org/cidr2007/papers/cidr07p20.pdf},
bibsource = {DBLP, http://dblp.uni-trier.de},
project = {scidb},
}
People
Brown:
MIT:
- Samuel Madden
- Michael Stonebraker