SciDB – Data Management System for Large Scale Scientific Data
SciDB is a multi-institution project that aims to develop a data management platform for data-intensive scientific applications, including astronomy and computational biology. This collaborative project brings together expertise of five research teams at Brown University, Massachusetts Institute of Technology, Portland State University, University of Washington, and University of Wisconsin-Madison.
Scientific data management has traditionally been performed using the file system, at best using files structured according to a low-level data format. Higher-level data management infrastructure has been task-specific and not reusable in different domains, resulting in millions of dollars of duplicated implementation effort by scientists to manage their data. The goal of this project is the development of a scientific database (SciDB), a system designed and optimized for scientific applications. The aim of SciDB is to do for science what relational databases did for the business world, namely to provide a high performance, commercial-quality and scalable data management system appropriate for many science domains.
In contrast to existing database systems, SciDB is based on a multidimensional array data model and includes multiple features specific to science and critical for science: provenance, uncertainty, versions, time travel, science-specific operations, and in situ data processing. No existing system offers all these features in a single, highly scalable engine. SciDB thus significantly advances the state-of-the-art in data management in addition to supporting domain scientists in data-driven knowledge discovery. The intellectual merit of SciDB is in exploring novel, high performance solutions to nested array storage, parallel array query optimization and execution, array language design, and time travel.
The goal of this project is to build an efficient and scalable DBMS specialized for data-intensive scientific applications. For example, an astronomy analysis that processes petabytes of data per year produced by telescopes. Traditional (One-Size-Fits-All) DBMS can’t efficiently handle such extremely large data. Also, such scientific data is often represented as multi-dimensional arrays and inherently contains uncertainty caused by measurement errors.
SciDB is an open source project. Here is a web page for SciDB developers.
University of Washington:
University of Wisconsin-Madison:
- Enabling Integrated Search and Exploration over Large Multidimensional Data.
A. Kalinin, U. Cetintemel, S. Zdonik: Searchlight: PVLDB 8(10): 1094-1105 (2015)
- Automated Analysis of Muscle X-ray Diffraction Imaging with MCMC.
David Williams, Magdalena Balazinska, and Tom Daniel. DMAH workshop with VLDB 2015.
- Data Movement in Hybrid Analytic Systems: A Case for Automation.
Patrick Leyshock, David Maier, Kristin Tufte. SSDBM 2014.
- Minimizing Data Movement Through Query Optimization.
Patrick Leyshock, David Maier, Kristin Tufte. 2014 IEEE International Conference on Big Data, Big Data 2014, Washington, DC, USA, October 2014.
- Dulloor: A Prolegomenon on OLTP Database Systems for Non-Volatile Memory.
Justin DeBrabant, Joy Arulraj, Andrew Pavlo, Michael Stonebraker, Stanley B. Zdonik, Subramanya. ADMS Workshop, collocated with VLDB 2014: 57-63
- S-Store: A Streaming NewSQL System for Big Velocity Applications.
Ugur Çetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, John Meehan, Andrew Pavlo, Michael Stonebraker, Erik Sutherland, Nesime Tatbul, Kristin Tufte, Hao Wang, Stanley B. Zdonik PVLDB 7(13): 1633-1636 (2014)
- Efficient Iterative Processing in the SciDB Parallel Array Engine.
Emad Soroush, Magdalena Balazinska, Simon Krughoff, and Andrew Connolly. SSDBM 2015.
- A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew.
Yinan Li, Craig Chasseur, Jignesh M. Patel. SIGMOD Conference 2015: 1509-1524
- Implications of Emerging 3D GPU Architecture on the Scan Primitive.
Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, David A. Wood. SIGMOD Record 44(1): 18-23 (2015)
- Big data and its technical challenges
H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi, CACM 57(7), 2014.
- Squeezing a Big Orange into Little Boxes: The AscotDB System for Parallel Processing of Data on a Sphere
Jacob Vanderplas, Emad Soroush, Simon Krughoff, Magdalena Balazinska, and Andrew Connolly, IEEE Data Engineering Bulletin 36(4), 2013.
- GenBase: a complex analytics genomics benchmark
Rebecca Taft, Manasi Vartak, Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, and Michael Stonebraker, SIGMOD 2014
- Interactive data exploration using semantic windows
Alexander Kalinin, Ugur Çetintemel, Stanley B. Zdonik, SIGMOD 2014
- Agrios: A hybrid approach to big array analytics
Patrick Leyshock, David Maier, Kristin Tufte, BigData Conference, 2013
- Query Steering for Interactive Data Exploration,
U. Cetintemel, M. Cherniack, J. DeBrabant, Y. Diao, K. Dimitriadou, A. Kalinin, O. Papaemmanouil, S. Zdonik, CIDR 2013.
- SubZero: a Fine-Grained Lineage System for Scientific Databases,
Eugene Wu, Samuel Madden, Michael Stonebraker, ICDE 2013
- Time Travel in a Scientific Array Database
Emad Soroush and Magdalena Balazinska, ICDE 2013
- BitWeaving: Fast Scans for Main Memory Data Processing,
Yinan Li and Jignesh M. Patel: SIGMOD 2013.
Craig Chasseur, Yinan Li and Jignesh M. Patel, WebDB 2013.
- A Demonstration of Iterative Parallel Array Processing in Support of Telescope Image Analysis,
Matthew Moyers, Emad Soroush, Spencer C Wallace, Simon Krughoff, Jake Vanderplas, Magdalena Balazinska, and Andrew Connolly. VLDB 2013. Demonstration.
- WHAM: A High-Throughput Sequence Alignment Method
Yinan Li, Jignesh M. Patel, Allison Terrel, ACM Transactions on Database Systems, 7(4): 28 (2012)
- Scorpion: Explaining Away Outliers in Aggregate Queries (preprint),
Eugene Wu, Samuel Madden, PVLDB 2013.
- A Demonstration of DBWipes: Clean as You Query,
Eugene Wu, Samuel Madden, Michael Stonebraker, VLDB 2012.
- Efficient Versioning for Scientific Array Databases,
Adam Seering, Philippe Cudre-Mauroux, Samuel Madden Samuel, and Michael Stonebraker. ICDE, 2012.
- WHAM: a high-throughput sequence alignment method.
Yinan Li, Allison Terrell, Jignesh M. Patel: SIGMOD 2011.
- ArrayStore: A Storage Manager for Complex Parallel Array Processing
Emad Soroush, Magdalena Balazinska, and Daniel Wang. SIGMOD 2011
- Hybrid Merge/Overlap Execution Technique for Parallel Array Processing
Emad Soroush and Magdalena Balazinska, ArrayDB Workshop (to be held in conjunction with EDBT 2011).
- Overview of SciDB, Large Scale Array Storage, Processing and Analysis, The SciDB Development team, SIGMOD’10, June 6-11, 2010, Indianapolis, Indiana, USA
- A Demonstration of SciDB: A Science-Oriented DBMS, P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D.L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, S. Zdonik, VLDB’09 Volume 2, Number 1, 1534-1537, Lyon, France, August 2009
- SciDB @ U. of Washington
- Agrios website at Portland State
- Wham (High-Throughput Sequence Alignment) @ U. of Wisconsin-Madison
- Intel Science and Technology Center for Big Data
The SciDB project is supported by the NSF grants for Brown University (IIS-1111423), Massachusetts Institute of Technology (IIS-1111371), Portland State University (IIS-1110917), University of Washington (IIS-1110370), and University of Wisconsin-Madison (IIS-1110948).
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.