July 26th, 2015

SciDB – Data Management System for Large Scale Scientific Data

Summary

SciDB is a multi-institution project that aims to develop a data management platform for data-intensive scientific applications, including astronomy and computational biology. This collaborative project brings together expertise of five research teams at Brown University, Massachusetts Institute of Technology, Portland State University, University of Washington, and University of Wisconsin-Madison.

Scientific data management has traditionally been performed using the file system, at best using files structured according to a low-level data format. Higher-level data management infrastructure has been task-specific and not reusable in different domains, resulting in millions of dollars of duplicated implementation effort by scientists to manage their data. The goal of this project is the development of a scientific database (SciDB), a system designed and optimized for scientific applications. The aim of SciDB is to do for science what relational databases did for the business world, namely to provide a high performance, commercial-quality and scalable data management system appropriate for many science domains.

In contrast to existing database systems, SciDB is based on a multidimensional array data model and includes multiple features specific to science and critical for science: provenance, uncertainty, versions, time travel, science-specific operations, and in situ data processing. No existing system offers all these features in a single, highly scalable engine. SciDB thus significantly advances the state-of-the-art in data management in addition to supporting domain scientists in data-driven knowledge discovery. The intellectual merit of SciDB is in exploring novel, high performance solutions to nested array storage, parallel array query optimization and execution, array language design, and time travel.

The goal of this project is to build an efficient and scalable DBMS specialized for data-intensive scientific applications. For example, an astronomy analysis that processes petabytes of data per year produced by telescopes. Traditional (One-Size-Fits-All) DBMS can’t efficiently handle such extremely large data. Also, such scientific data is often represented as multi-dimensional arrays and inherently contains uncertainty caused by measurement errors.

SciDB is an open source project. Here is a web page for SciDB developers.

People

Brown:

MIT:

 Portland State:

University of Washington:

University of Wisconsin-Madison:

Publications:

Links:

Acknowledgements

The SciDB project is supported by the NSF grants for Brown University (IIS-1111423), Massachusetts Institute of Technology (IIS-1111371), Portland State University (IIS-1110917), University of Washington (IIS-1110370),  and University of Wisconsin-Madison (IIS-1110948).

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.