October 27th, 2016

CORADD: Correlation Aware Database Designer


CORADD (CORrelation Aware Database Designer)  is an automatic database design tool. The main focus of the project is to design a fast and compact OLAP database  by exploiting correlations in  the database.

Source Code

All source codes, manuscripts and experimental results are provided under the MIT license.  Contact Hideaki for repository access.

Correlation Map (CM)

Correlation Maps Architecture

Correlation Maps Architecture

One key building block of our database designer is Correlation Map (CM), which is a compressed form of secondary index. A CM is a compact data structure that stores mapping between indexed attributes and clustered attributes as shown right. In other words, a CM is a sparse secondary index in contrast to traditional dense secondary indexes such as B+Tree.

Here, the secondary B+Tree index on city is a dense structure, containing an entry for every tuple appearing with each city. In order to satisfy the “Boston” or “Springfield” predicate using a standard B+Tree, the query engine uses the index to look up all corresponding rowids.    The equivalent CM in this example contains only the unique (city, state) pairs. In order to satisfy the same predicate using a CM, the query engine looks up all possible state values corresponding to “Boston” or “Springfield.” The resulting values (“MA”, “NH”, “OH”) correspond to 3 sequential ranges of rowids in the table. These are then scanned and filtered on the original city predicate.

The advantage of CM compared with traditional B+Tree is its small size (usually about 1,000 times smaller) and its low maintenance cost due to the small size. CM further reduces the size by bucketing the indexed and clustered attributes.

CORADD Architecture

CORADD Architecture

CORADD Architecture

The key difference between CORADD and other existing automatic physical database designer (e.g., IBM DB2 Tuning Advisor, Microsoft SQLServer Database Tuning Advisor) is that CORADD is aware of correlations and produces materialized views (MVs) and indexes to maximize the correlations between clusterered and secondary indexes.

This approach has two advantages. First, a well correlated secondary index performs significantly (sometimes 40 times or more) faster than un-correlated case because the correlated secondary index points to small and contiguous area in heap file while that of un-correlated secondary index randomly scatters (imagine a disk access pattern of a query which predicates on city with a clustered index on state and a clustered index on salary). Second, a well correlated secondary index can be  compactly stored as a CM without affecting the query performance. As a CM stores mapping between clustered and unclustered attributes, stronger correlations result in fewer entries and smaller CM size.

After CORADD generates candidate MVs, clustered indexes and CMs based on query workloads, it picks up an optimal set of the candidate objects that maximizess the query performance within a given space budget via Integer Linear Programming (ILP) and the ILP Feedback algorithm.

CORADD produces a substantially (usually 5-6 times) faster database design than existing automatic database designer especially in OLAP (online-analytical processing) setting where a query accesses thousands or millions of tuples in a huge database and returns aggregates over them.