December 14th, 2016

SciDB – Data Management System for Large Scale Scientific Data


SciDB is a multi-institution project that aims to develop a data management platform for data-intensive scientific applications, including astronomy and computational biology. This collaborative project brings together expertise of five research teams at Brown University, Massachusetts Institute of Technology, Portland State University, University of Washington, and University of Wisconsin-Madison.

Scientific data management has traditionally been performed using the file system, at best using files structured according to a low-level data format. Higher-level data management infrastructure has been task-specific and not reusable in different domains, resulting in millions of dollars of duplicated implementation effort by scientists to manage their data. The goal of this project is the development of a scientific database (SciDB), a system designed and optimized for scientific applications. The aim of SciDB is to do for science what relational databases did for the business world, namely to provide a high performance, commercial-quality and scalable data management system appropriate for many science domains.

In contrast to existing database systems, SciDB is based on a multidimensional array data model and includes multiple features specific to science and critical for science: provenance, uncertainty, versions, time travel, science-specific operations, and in situ data processing. No existing system offers all these features in a single, highly scalable engine. SciDB thus significantly advances the state-of-the-art in data management in addition to supporting domain scientists in data-driven knowledge discovery. The intellectual merit of SciDB is in exploring novel, high performance solutions to nested array storage, parallel array query optimization and execution, array language design, and time travel.

The goal of this project is to build an efficient and scalable DBMS specialized for data-intensive scientific applications. For example, an astronomy analysis that processes petabytes of data per year produced by telescopes. Traditional (One-Size-Fits-All) DBMS can’t efficiently handle such extremely large data. Also, such scientific data is often represented as multi-dimensional arrays and inherently contains uncertainty caused by measurement errors.

SciDB is an open source project. Here is a web page for SciDB developers.

Some Significant Results

  • Seachlight has shown that it is possible to scale constraint-based search problems to datasets that exceed available memory.  It has also shown that it is possible to optimize sophisticated search and optimization over large data collections by delivering representative results early so that the user can decide if a given search is on the right track.
  • In his PhD thesis, Abdussalam Alawini developed the ReDiscover system for automatically predicting relationships, such as containment and complement, that exist between datasets embodied in spreadsheets. The methods used include conditional random fields (CRFs) for labelling cells as data, heading or other, pattern-based column extraction, Bloom filters for column summarization, and support vector machines (SVMs) for column matching and relationship prediction. ReDiscover exhibited markedly better precision and recall in predicting relationships than a method based solely on features a human might use.
  • In our SOCC’16 paper, we found that it was possible to automatically generate source-code extensions to enable big data systems to exchange data in parallel and using binary formats. We further found that performance of data exchange can improve by up to 3.8X compared with exporting data in a text format such as CSV to disk and re-importing CSV from disk. We further found that, data exchange directly over the network (without going to disk) easily cuts data exchange times in half. Additional optimizations, which remove delimiters, transfer data in binary format, and transfer data as blocks of column-formatted relations, cut an extra 30% of the data transfer time.
  • In our image-processing benchmark, we found that the array data model and the chunk-based processing approach are good fits for image analytics. However, the support for user-defined operations written in the data scientists’ language of choice (currently Python) are critical to image analytics. Data scientists already have complex pipelines they seek to scale. These pipelines are daunting and extremely expensive to reimplement in a different language. Sometimes the reimplementation is even impossible when critical operations are missing in the databases’ declarative language. We further found that, as expected, SciDB outperformed the other engines on certain array-specific operations such as computing mean values for pixels in an array. Interestingly, however, SciDB was not the fastest engine on all operations, indicating that room for improvement and future research remain.
  • A paper on TileDB was accepted into VLDB 2017.  This paper showed show that TileDB delivers comparable performance to the HDF5 dense array storage manager on dense arrays, while providing much faster random writes. We also show that TileDB offers substantially faster reads and writes than the SciDB array database system with both dense and sparse arrays. Finally, we demonstrate that TileDB is considerably faster than adaptations of relational column-stores for dense array storage management, and at least as fast for the case of sparse arrays.
  • S-Store has shown that it is possible to build a transactional stream processing system with out significant loss in performance.  It also showed how such a system can be used as a data ingestion front-end for a scientific back-end or for a polystore.




 Portland State:

University of Washington:

University of Wisconsin-Madison:




The SciDB project is supported by the NSF grants for Brown University (IIS-1111423), Massachusetts Institute of Technology (IIS-1111371), Portland State University (IIS-1110917), University of Washington (IIS-1110370),  and University of Wisconsin-Madison (IIS-1110948).

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Date last modified: December 14, 2016