Online analytical processing databases speed up the multi-dimensional queries common in business intelligence applications.
Online analytical processing (OLAP) databases are purpose-built for handling analytical queries. Analytical queries run on online transaction-processing (OLTP) databases often take a long time to return answers. There are several reasons for this.
First, OLTP databases are usually in third normal form, so that analytical queries have to perform complex JOIN operations on many tables, which can be computationally expensive. Second, OLTP databases tend to have relatively few indexes, to optimize write speed, while read-heavy analytical queries often benefit from additional indexes. Third, OLTP databases tend to be constantly busy with small transactions, which can cause contention (mostly for indexes) while long analytical queries are running, slowing down both the transactions and the queries.
OLAP databases solve these issues by providing a separate, optimized database for analytical queries. There are several ways to optimize databases for analysis, as we’ll discuss.
OLAP explained
OLAP databases are designed to speed up multidimensional analysis on large volumes of data from a data warehouse or data mart. High-speed analysis can be accomplished by extracting the relational data into a multidimensional format called an OLAP cube; by loading the data to be analyzed into memory; by storing the data in columnar order; and/or by using many CPUs in parallel (i.e., massively parallel processing, or MPP) to perform the analysis.
ETL and ELT
One barrier to implementing OLAP is establishing a process to get the data out of the transactional database and into the analysis database. That used to be a nightly batch job to extract, transform, and load (ETL) the data. As hardware and software improved, ETL batch jobs were often replaced with continuous data streams, and sometimes the transformation step was deferred to the end of the process, after loading (ELT). ELT is becoming more common, in order to support feature engineering for machine learning running against the analysis database.
Columnar storage
Transactional databases store table rows together, which makes sense when you are constantly accessing whole rows. OLAP databases typically store table columns together, which makes sense when you tend to aggregate field values. In addition, OLAP databases often try to keep active columns in memory, for speed. Another advantage of columnar storage is that columns of similar data compress well.
What is an OLAP cube?
OLAP cubes or hypercubes are a way of organizing data with hierarchical dimensions so that analysis can be performed quickly, without a lot of SQL JOINs and UNIONS. OLAP cubes revolutionized business intelligence (BI) systems. Before OLAP cubes, business analysts would submit queries at the end of the day and then go home, hoping to have answers the next day. After OLAP cubes, the data engineers would run the jobs to create cubes overnight, so that the analysts could run interactive queries against them in the morning.
OLAP cubes support five kinds of “slice and dice” operations. Slicing means extracting a lower-dimensional cube with one dimension set to a single value, for example MONTH=6. Dicing means extracting a sub-cube with multiple dimensions set to single values, for example STORE=95 AND MONTH=6. Drilling down and drilling up allow the analyst to move from viewing summaries (up) to detailed values (down). Roll-up summarizes or aggregates data along a dimension. Pivot rotates a cube to see another perspective on the data. OLAP cube pivoting is much more efficient than pivoting in a spreadsheet. The MDX query language, a variation on SQL, is used to query OLAP cubes.
OLAP cubes have largely been replaced in recent years by data warehouses that use compressed columnar storage (preferably in-memory) and MPP.
What is MOLAP?
Multi-dimensional online analytical processing (MOLAP) is the classic form of OLAP that uses multi-dimensional OLAP cubes. While MOLAP leads to very fast analysis, preprocessing the OLAP cubes can be very time-consuming. MOLAP is most efficient when the facts (data fields) are numeric and can be aggregated.
What is ROLAP?
Relational OLAP (ROLAP) works directly with relational databases, and doesn’t require the creation of OLAP cubes. Usually, the analytical database for ROLAP is separate from the OLTP database, and an ETL or ELT process updates the data warehouse or data mart from the OLTP database periodically, and creates aggregate tables as part of the process. For efficiency, the ETL or ELT process usually works with incremental data rather than recreating the data warehouse from scratch.
Instead of MDX queries, analysts interrogate a ROLAP database with SQL, often relying heavily on the newer analysis operators. The GROUP BY clause groups aggregates by a specified column. The ROLLUP operator extends GROUP BY to multiple columns, essentially calculating subtotals and grand totals. The CUBE operator calculates subtotals and grand totals for all permutations of the specified columns.
What is HOLAP?
Hybrid online analytical processing (HOLAP) is a combination of ROLAP and MOLAP. HOLAP allows storing part of the data in a MOLAP store and another part of the data in a ROLAP store. Typically, there is a cache for aggregates from both the cube and the relational database. Microsoft Analysis Services and SAP BI Accelerator implement HOLAP.
As we’ve discussed, dedicated analytical databases can speed up queries for business intelligence. While OLAP cubes dominated the field for decades, it is more common today for companies to maintain data warehouses that use relational databases with compressed columnar storage and massive parallel processing.