The Metadata Engine | Case Study

The Monolith vs The Engine

Scaling data engineering is not about hiring more engineers to write more pipelines. It's about writing less code that does more work.

When your data platform grows from 10 tables to 1,000, the "one pipeline per table" approach collapses under its own weight. Maintenance becomes a nightmare, and "Time to Market" for new data explodes.

The Solution? A Metadata-Driven Framework. We separate the "What" (Metadata) from the "How" (Code).

Detailed View

System Architecture

BATCH PROCESSING • MEDALLION ARCHITECTURE

● DETAILED VIEW

This architecture has handled 2TB+ daily ingestion with 99.9% reliability. Let's build it, step by step.

Step 1: The Control Plane (Metadata)

We start by defining our "State of the World" in a dedicated Azure SQL Database. This table acts as the brain of our orchestrator. It tells the system what to ingest.

-- 1. Create the Control Table
CREATE SCHEMA config;
GO

CREATE TABLE config.PipelineMetadata (
    PipelineID INT IDENTITY(1,1) PRIMARY KEY,
    -- Source Definitions
    SourceSystem VARCHAR(50) NOT NULL,   -- e.g. 'SAP_HANA', 'Salesforce'
    SourceDatabase VARCHAR(100),         -- e.g. 'ECC', 'S4'
    SourceTable VARCHAR(100) NOT NULL,   -- e.g. 'KNA1', 'Account'
    
    -- Target Definitions (Unity Catalog)
    TargetCatalog VARCHAR(50) DEFAULT 'raw_catalog',
    TargetSchema VARCHAR(50) DEFAULT 'erp',
    TargetTable VARCHAR(100),
    
    -- Load Configuration
    LoadType VARCHAR(20) CHECK (LoadType IN ('Full', 'Incremental', 'SCD2')),
    WatermarkColumn VARCHAR(50),         -- e.g. 'LastModifiedDate'
    PrimaryKeyList VARCHAR(200),         -- e.g. 'ID,CountryCode' used for MERGE
    
    -- Operational Flags
    IsActive BIT DEFAULT 1,
    CreatedDate DATETIME DEFAULT GETDATE()
);

-- 2. Register a new dataset (Zero code deployment!)
INSERT INTO config.PipelineMetadata 
(SourceSystem, SourceTable, LoadType, WatermarkColumn, PrimaryKeyList)
VALUES 
('SAP', 'KNA1', 'Incremental', 'ERDAT', 'KUNNR');

Step 2: The Orchestrator (ADF)

Azure Data Factory is not used for data movement (Copy Activity). It is purely for logic and orchestration (Notebook Activity).

The Pipeline Logic:

Lookup Activity: SELECT * FROM config.PipelineMetadata to get the list of active tables.
ForEach Loop: Iterate over the output array.
Databricks Notebook Activity: Inside the loop, we call the Generic Ingestion Notebook but pass dynamic parameters.

Security Integration: ADF authenticates to Databricks using Managed Identify (MSI), eliminating the need to manage PAT Tokens.

/* ADF Dynamic Configuration */
{
    "source_system": "@{item().SourceSystem}",
    "source_table": "@{item().SourceTable}",
    "target_path": "abfss://raw@datalake/{@item().SourceSystem}/{@item().SourceTable}",
    "watermark_col": "@{item().WatermarkColumn}",
    "primary_keys": "@{item().PrimaryKeyList}"
}

Step 3: The Processing Engine (PySpark)

This is where the magic happens. A single Generic Notebook (Ingest_Generic.py) handles every single table in the organization.

Secret Management (Key Vault): We never hardcode passwords. The notebook connects to Azure Key Vault at runtime using the dbutils.secrets API to fetch source system credentials securely.

The Ingestion Class

from delta.tables import *
from pyspark.sql.functions import col, current_timestamp

class DataIngestor:
    def __init__(self, config: dict):
        self.config = config
        self.spark = SparkSession.builder.getOrCreate()
        
    def read_source(self):
        """Dynamic JDBC Reader with Key Vault Integration"""
        # Securely fetch credentials from Key Vault Scope
        user = dbutils.secrets.get(scope="kv-core", key=f"{self.config['source_system']}-user")
        pwd  = dbutils.secrets.get(scope="kv-core", key=f"{self.config['source_system']}-pwd")
        url  = dbutils.secrets.get(scope="kv-core", key=f"{self.config['source_system']}-url")
        
        df = (self.spark.read
            .format("jdbc")
            .option("url", url)
            .option("dbtable", self.config['source_table'])
            .option("user", user)
            .option("password", pwd)
            .option("fetchSize", "10000")
            .load())
            
        return df

    def write_sink(self, df):
        """Universal Merge Logic (Upsert)"""
        target_table = f"{self.config['target_catalog']}.{self.config['target_schema']}.{self.config['target_table']}"
        
        # SCHEMA EVOLUTION: Automatically handle new columns
        self.spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
        
        if self.config['load_type'] == 'Full':
            df.write.format("delta").mode("overwrite").option("mergeSchema", "true").saveAsTable(target_table)
            
        elif self.config['load_type'] == 'Incremental':
            # Construct dynamic join condition for MERGE
            pks = self.config['primary_keys'].split(',')
            join_cond = " AND ".join([f"source.{k} = target.{k}" for k in pks])
            
            deltaTable = DeltaTable.forName(self.spark, target_table)
            (deltaTable.alias("target")
                .merge(df.alias("source"), join_cond)
                .whenMatchedUpdateAll()
                .whenNotMatchedInsertAll()
                .execute())

Step 4: Governance (Unity Catalog)

Governance is enforced at the platform level. We use Unity Catalog to ensure that while our pipelines are automated, access control is strict.

Centralized Access Control (ACLs): We define permissions once at the Catalog/Schema level, and they propagate to all dynamically created tables.
```
GRANT USE CATALOG raw_catalog TO `data_engineers`;
GRANT SELECT ON SCHEMA raw_catalog.erp TO `analysts`;
```
Audit Logs: Every read/write action by our Generic Notebook is logged. We can trace exactly which Pipeline Run ID modified the Account table.
Data Lineage: Unity automatically captures the lineage from the specific JDBC Source -> Bronze Table -> Silver Table, visualizing dependencies in real-time.

Step 5: Handling Schema Drift

In data engineering, change is constant. A source system will add a column without telling you. If your pipeline fails, you're doing it wrong.

The Strategy: Evolution, not Failure. We utilize Delta Lake's native Schema Evolution capabilities to gracefully handle upstream changes.

Option 1: Merge Schema (mergeSchema) When enabled, if the incoming dataframe has columns that don't exist in the target Delta table, they are automatically added to the end of the schema.
```
df.write.option("mergeSchema", "true").mode("append").saveAsTable("target")
```
Option 2: Spark Configuration For MERGE INTO operations (Upserts), we must enable it explicitly in the session config:
```
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
```

Safe Guards: We rely on Type Checking. If a column changes type (e.g., Integer to String), we default to failure to protect downstream BI reports from breaking. We do not automagically cast types unless explicitly configured.

Step 6: Performance Architecture

Ingesting data is easy. querying it fast is hard. We employ advanced layout strategies to ensure sub-second query performance.

Liquid Clustering vs Z-Order

For years, Z-ORDER was the gold standard. However, it requires static columns and expensive re-computation. We have migrated to Liquid Clustering.

Feature	Z-Order (Legacy)	Liquid Clustering (Modern)
Definition	Static multi-dimensional sorting	Dynamic, incremental clustering
Maintenance	Requires full Rewrite (`OPTIMIZE`)	Incremental, efficient background jobs
Skew Handling	Poor (Sensitive to data skew)	Excellent (Adapts to data distribution)
Use Case	Static historical tables	High-frequency ingestion & streaming

Recommendation: For all new Bronze/Silver ingestion tables, we enable Liquid Clustering on high-cardinality keys (e.g., Timestamp, CustomerID).

-- Dynamic execution of Liquid Clustering
ALTER TABLE raw_catalog.erp.sap_kna1 
CLUSTER BY (ERDAT, KUNNR);

Optimization Routine

We don't rely on luck. We run a weekly maintenance job to compact small files and clean up old versions.

spark.sql(f"OPTIMIZE raw_catalog.erp.{table_name}")
spark.sql(f"VACUUM raw_catalog.erp.{table_name} RETAIN 7 DAYS") # Compliance

Step 7: Storage Strategy (Managed vs External)

In Unity Catalog, you have two choices for where data lives. Choosing wrong impacts disaster recovery.

1. Managed Tables (The Default)

Location: Root storage of the Metastore/Catalog.
Pros: Easy setup, "Interchangable" compute.
Cons: Harder to access if the Workspace goes down.
Verdict: Use for Bronze and Silver layers (Transient/Internal data).

2. External Tables

Location: Explicit Azure Data Lake Gen2 path (controlled by you).
Pros: Data survives even if Databricks is deleted. Easier to share with non-Databricks tools (e.g., Synapse, Fabric).
Cons: Requires EXTERNAL LOCATION setup.
Verdict: Use for Gold layer (Business Data) and critical backups.

Step 8: Continuous Integration (CI/CD)

The code (Notebooks, ADF JSON) is versioned in Git. We use Azure DevOps (YAML) to deploy changes.

Validation Gate: Before merging to main, we run Nutter tests to ensure the Generic Ingestor still works.

# azure-pipelines.yml
trigger:
  - main

pool:
  vmImage: 'ubuntu-latest'

steps:
- script: |
    pip install nutter
    nutter run /tests/test_generic_ingestor.py --cluster_id $(DEV_CLUSTER_ID)
  displayName: 'Run Unit Tests on Databricks'

- task: AzureDataFactoryV2Publish@1
  inputs:
    ResourceGroupName: 'rg-data-prod'
    DataFactoryName: 'adf-core-prod'
    PublishProfile: 'adb_publish'

The Result

By implementing this framework:

Onboarding Velocity: New tables are added via SQL INSERT. Time-to-market dropped from 3 days to 5 minutes.
Resilience: Schema Drift allows upstream teams to iterate without breaking pipelines.
Governance: Unity Catalog provides a live map of the entire data estate.
Cost: Using Job Clusters for the generic notebook reduced Databricks DBUs by 40% compared to interactive pools.

01 — Context

System Architecture