CORADD: Correlation Aware Database Designer
CORADD (CORrelation Aware Database Designer) is an automatic database design tool. The main focus of the project is to design a fast and compact OLAP database by exploiting correlations in the database.
All source codes, manuscripts and experimental results are provided under the MIT license. Contact Hideaki for repository access.
Correlation Map (CM)
One key building block of our database designer is Correlation Map (CM), which is a compressed form of secondary index. A CM is a compact data structure that stores mapping between indexed attributes and clustered attributes as shown right. In other words, a CM is a sparse secondary index in contrast to traditional dense secondary indexes such as B+Tree.
Here, the secondary B+Tree index on city is a dense structure, containing an entry for every tuple appearing with each city. In order to satisfy the “Boston” or “Springfield” predicate using a standard B+Tree, the query engine uses the index to look up all corresponding rowids. The equivalent CM in this example contains only the unique (city, state) pairs. In order to satisfy the same predicate using a CM, the query engine looks up all possible state values corresponding to “Boston” or “Springfield.” The resulting values (“MA”, “NH”, “OH”) correspond to 3 sequential ranges of rowids in the table. These are then scanned and filtered on the original city predicate.
The advantage of CM compared with traditional B+Tree is its small size (usually about 1,000 times smaller) and its low maintenance cost due to the small size. CM further reduces the size by bucketing the indexed and clustered attributes.
The key difference between CORADD and other existing automatic physical database designer (e.g., IBM DB2 Tuning Advisor, Microsoft SQLServer Database Tuning Advisor) is that CORADD is aware of correlations and produces materialized views (MVs) and indexes to maximize the correlations between clusterered and secondary indexes.
This approach has two advantages. First, a well correlated secondary index performs significantly (sometimes 40 times or more) faster than un-correlated case because the correlated secondary index points to small and contiguous area in heap file while that of un-correlated secondary index randomly scatters (imagine a disk access pattern of a query which predicates on city with a clustered index on state and a clustered index on salary). Second, a well correlated secondary index can be compactly stored as a CM without affecting the query performance. As a CM stores mapping between clustered and unclustered attributes, stronger correlations result in fewer entries and smaller CM size.
After CORADD generates candidate MVs, clustered indexes and CMs based on query workloads, it picks up an optimal set of the candidate objects that maximizess the query performance within a given space budget via Integer Linear Programming (ILP) and the ILP Feedback algorithm.
CORADD produces a substantially (usually 5-6 times) faster database design than existing automatic database designer especially in OLAP (online-analytical processing) setting where a query accesses thousands or millions of tuples in a huge database and returns aggregates over them.
- UPI: A Primary Index for Uncertain Databases
Hideaki Kimura, Samuel Madden, Stan Zdonik, Full paper, Proceedings of the 2010 Very Large Data Bases Endowment (VLDB’10), 2010. [Dataset/Sourcecode] [Slides]
- CORADD: Correlation Aware Database Designer for Materialized Views and Indexes
Hideaki Kimura, George Huo, Alexander Rasin, Samuel Madden, Stan Zdonik, Full paper, Proceedings of the 2010 Very Large Data Bases Endowment (VLDB’10), 2010. [Slides]
- Correlation Maps: A Compressed Access Method for Soft Functional Dependencies
Hideaki Kimura, George Huo, Alexander Rasin, Samuel Madden, Stan Zdonik, Full paper, Proceedings of the 2009 Very Large Data Bases Endowment (VLDB’09), 2009.
- Correlation Maps: A Compressed Access Method for Exploiting Attribute Correlation
George Huo, Hideaki Kimura, Alexander Rasin, Samuel Madden, Stan Zdonik, Poster Session, New England Database Day 2009 (NEDBDay’09), January 2009.
- George Huo (Now at Google)
- Samuel Madden