Glossary/Data Warehousing — Architecture, OLTP vs OLAP
Data Warehousing

Data Warehousing — Architecture, OLTP vs OLAP

Storing and analysing historical data at scale for business intelligence.


Definition

A Data Warehouse (DW) is a central repository of integrated, subject-oriented, time-variant, non-volatile data collected from multiple operational systems. Unlike OLTP databases optimised for transactions, data warehouses are optimised for analytical queries (OLAP) over large historical datasets. Modern examples include Amazon Redshift, Google BigQuery, Snowflake, and Databricks. Data warehousing is the foundation of business intelligence, reporting, and machine learning pipelines.

Real-life analogy: The company memory bank

Your operational databases (CRM, ERP, e-commerce) are like the busy front office — constantly processing today's transactions. At day's end, key information is copied into a secure archive — the data warehouse. The warehouse never changes past records, combines data from all systems in one place, and is optimised for questions like 'what was our revenue by region over the past 5 years?' — impossible to answer efficiently from live transactional systems.

OLTP vs OLAP — the fundamental distinction

PropertyOLTP (Transactional)OLAP (Analytical)
PurposeDay-to-day business operationsAnalysis, reporting, decision-making
OperationsINSERT, UPDATE, DELETE (many writes)SELECT with complex aggregations (read-only)
Data volumeCurrent data only (GB scale)Historical data — years (TB to PB scale)
Query typeSimple, fast, predefined (ms)Complex, long-running (seconds to minutes)
TablesNormalised (3NF/BCNF) — avoid redundancyDenormalised (star/snowflake) — optimise reads
UsersMany concurrent users (1000s)Few analytical users (10s)
ExamplesMySQL, PostgreSQL, OracleBigQuery, Redshift, Snowflake, Synapse
Update frequencyReal-time (every transaction)Batch (nightly/weekly ETL loads)

Data Warehouse architecture — 6-layer pipeline

  1. Source systems: Operational databases (CRM, ERP, web analytics, IoT sensors, flat files). Data is raw, inconsistent, in different formats.
  2. ETL Layer (Extract, Transform, Load): Extract from sources, clean/transform (standardise formats, handle NULLs, deduplicate), load into warehouse. Modern variant: ELT — extract and load raw first, transform inside the warehouse using SQL.
  3. Staging Area: Temporary holding area where data is cleaned before loading into the warehouse. Not exposed to end users.
  4. Data Warehouse: Central repository organised in dimensional model (star/snowflake schema). Contains historical, integrated, subject-oriented data.
  5. Data Marts: Subject-specific subsets (Sales Mart, Finance Mart, HR Mart). Faster queries for specific departments.
  6. Presentation Layer: BI tools (Tableau, Power BI, Looker), SQL clients, ML pipelines, analytics notebooks.

Analytical SQL query — the kind only a data warehouse can handle efficiently

-- Revenue performance by region and product — multi-year analysis
-- This query joins 4 tables and aggregates millions of rows in seconds on Redshift/BigQuery
SELECT
    d_time.Year,
    d_time.Quarter,
    d_region.RegionName,
    d_product.ProductCategory,
    SUM(f.Revenue)                   AS TotalRevenue,
    SUM(f.UnitsSold)                 AS UnitsSold,
    ROUND(AVG(f.Revenue), 2)         AS AvgRevenue,
    SUM(f.Revenue)
      / SUM(SUM(f.Revenue)) OVER (PARTITION BY d_time.Year)
                                     AS YearlyRevenueShare,
    SUM(f.Revenue)
      - LAG(SUM(f.Revenue)) OVER (PARTITION BY d_region.RegionName
                                  ORDER BY d_time.Year, d_time.Quarter)
                                     AS RevenueGrowthVsPrevQuarter
FROM fact_sales f
JOIN dim_time    d_time    ON f.DateKey    = d_time.DateKey
JOIN dim_region  d_region  ON f.RegionKey  = d_region.RegionKey
JOIN dim_product d_product ON f.ProductKey = d_product.ProductKey
WHERE d_time.Year BETWEEN 2020 AND 2024
GROUP BY d_time.Year, d_time.Quarter, d_region.RegionName, d_product.ProductCategory
HAVING SUM(f.Revenue) > 50000
ORDER BY d_time.Year, d_time.Quarter, TotalRevenue DESC;

Kimball vs Inmon architecture

Inmon (top-down): Build one enterprise-wide normalised DW first, derive data marts from it. Single source of truth — complex and expensive to build initially. Kimball (bottom-up): Build dimensional data marts (star schemas) first, integrate them via a data warehouse bus. Faster to deliver value. Modern data platforms (Lakehouse) blend both: store raw data in a data lake, model with dimensional schemas on top.

Practice questions

  1. Why is an OLTP database not suitable for analytical queries? (Answer: OLTP is normalised (many tables, many JOINs), optimised for single-row operations. Analytical queries aggregate millions of rows across many tables — catastrophically slow on row-store OLTP with full sequential scans.)
  2. Four characteristics of a data warehouse per Bill Inmon: (Answer: Subject-oriented, Integrated, Time-variant, Non-volatile. Data is organised around business subjects, unified from multiple sources, preserves historical snapshots, and never deleted or updated once loaded.)
  3. Difference between data warehouse and data lake: (Answer: Warehouse: structured, processed, schema-on-write, SQL analytics. Lake: raw, unprocessed, schema-on-read, any format. Lakehouse combines both.)
  4. What is a data mart? (Answer: A subject-specific subset of the warehouse serving one department — Sales Mart, Finance Mart. Faster for specific queries; warehouse provides enterprise-wide single source of truth.)
  5. Purpose of the staging area in DW architecture: (Answer: Temporary ETL work area — data extracted here, cleaned (handle NULLs, fix formats, deduplicate), validated before loading into the actual warehouse. Protects warehouse from bad data.)

On LumiChats

LumiChats can design a data warehouse schema for any business, write ETL pipeline code in Python/SQL, and generate analytical SQL for BigQuery, Redshift, or Snowflake. Describe your data sources and reporting needs in plain English.

Try it free

Try LumiChats for ₹69

39+ AI models. Study Mode with page-locked answers. Agent Mode with code execution. Pay only on days you use it.

Get Started — ₹69/day

Related Terms

4 terms