A hybrid Data Platform with Azure Data Factory & dbt - Part 1

Series Overview

This is Part 1 of our series on building a hybrid data platform. Here are the all articles of this series:

Part 1: From Something-with-Data to Data-as-a-Product - Medallion architecture and business transformation
Part 2: Infrastructure as Code Foundation with Terraform - IaC patterns and module design
Part 3: Domain-Driven Design for Data Engineering - Source system separation and Conway’s Law
Part 4: Hybrid Connectivity Architecture - Integration runtimes and Azure Relay Bridge
Part 5: Extract and Load Pipeline Evolution - Four-pipeline pattern and deletion detection
Part 6: Data Transformation Architecture - Dual-track approach with dbt and analyst SQL
Part 7: CI/CD as Organizational Strategy - Selective deployment and complexity placement
Part 8: DATEV Integration Patterns - Hardcoding Clients and Embracing Failure
Part 9: Integrating Product Telemetry - Integrating Open Telemetry Into Unified Analytics
Part 10: RevOps Funnel Analytics - Building Bowtie GTM Metrics

Introduction

In the current AI hype it is easy to forget that AI is only as useful as the underlying data. In the technology landscape of 2025, data isn’t a nice-to-have asset but it is the backbone of your competitive advantage. If you’re not weaponizing your data, your competitors certainly are.

In this technical series, I’ll walk you through our architecture for a modern data platform built on Azure Data Factory and dbt, with on-premises SQL Server as our datawarehouse system. It isn’t a theoretical exercise, but a working architecture I’ve implemented in production, with all the scars and lessons to prove it. But to set expectations straight, if you are expecting the latest bells and wistles, you will be disappointed. Instead you will be treated to a down to earth architecture with tried and proven technology resulting in extremely low operational overhead, high stability and plenty of room to evolve.

Our Evolution of Data Strategy

Lets start with Ground Zero which I would label as ‘something with Data’. Most companies begin there. Your data infrastructure consists of fragmented databases connected by brittle scripts. You build dashboards that tell you what happened yesterday. You analyze quarterly trends. You justify gut decisions with data retroactively.

That’s exactly where we were. Our tech stack included multiple on-prem MSSQL databases with scheduled jobs copying data between them. Reporting was manual, integration points frequently broke, and we had zero confidence in data consistency.

This fragmented approach meant we were constantly fighting fires rather than leveraging our data for strategic advantage. Adding urgency was the advent of generative AI.

Recognizing these limitations, I started arguing for a transformation in the way we use and process data, evolving from merely having some data to treating Data-as-a-Product. Only if we manage data with the same care, quality standards, and focus on the user as customer as any other product, will data become the strategic enabler for business.

The journey typically follows this progression:

Data-Driven Business: Using data to inform decisions
Data-Driven Products:
- Empowering product management with data
- Embedding data capabilities into products and services
Data as a Product: Treating data itself as a product with users, requirements, and quality standards

This reflects the understanding that data is not merely a byproduct of business operations but a valuable asset that can drive innovation, efficiency, and competitive advantage when properly managed and leveraged.

As CTO, this is where I insisted we needed to be. Data as a product means applying the same engineering discipline to data that you apply to your software products. Your data has defined schemas, versioning, automated testing, documentation, and clear ownership.

The Medallion Data Architecture: Technical Implementation

Let’s talk implementation details. After evaluating several architectural patterns, we selected the medallion architecture—a multi-layered approach that separates concerns in the data pipeline. It’s not revolutionary, but it’s proven, maintainable, and scales well with organizational complexity:

Bronze Layer: Raw Data Preservation (The System of Record)

The bronze layer is fundamentally a copy of the sourced data:

Zero transformations, zero business logic
The baseline data record aka the crude of the data refinement

From a technical implementation standpoint, we use Azure Data Factory (ADF) exclusively for the EL (Extract and Load) operations from 3rd party systems to our Data Warehouse. (On a side note you will still find a lot of references to ETL (Extract, Transform and Load) data pipelines. For a Medallion-style data architecture ETL becomes EL-T for reasons I will explain below.) The ADF connects to our on-prem SQL Server instances via self-hosted integration runtimes (an on-prem ADF component) running in our data center.

For API sources like Pipedrive, we’ve implemented both cursor-based and offset pagination patterns to handle large datasets efficiently. The bronze layer is locked down and we never modify this data after it lands.

Why on-premise databases you might ask? Well, because I believe in an evolutionary approach over revolutions. Our IT team had been running those MSSQL databases for years and there had been a significant investment into the underlying hardware and licenses right before I started. So rather than get bogged down by a protracted discussion about data security and wasted IT investment, my focus was on getting the big picture right and be flexible in the details.

Silver Layer: Cleansed and Transformed Data (The Engineering Layer)

The silver layer is where the data engineering happens:

Type casting and schema enforcement
Deduplication algorithms and conflict resolution
Data quality validation with automated testing
Business logic implementation and field derivation

This layer is entirely implemented using dbt, deployed via Docker containers in Azure Container Instances. We’re not using Azure Databricks or Synapse here since that would be overkill for our workload profile. The dbt containers connect back to on-prem using the open source Azure Relay Bridge from Microsoft, which establishes a secure relay connection from on-prem via the Azure Relay service. Needless to say that all dbt models are version controlled by being committed to Azure DevOps repos and injected into the dbt docker container during the pipeline build.

Gold Layer: Business-Ready Data Products (The Consumption Layer)

The gold layer is purpose-built for consumption:

Star schemas and denormalized structures for analytical workload
Pre-aggregated tables for common dimensional queries
Materialized views to optimize query patterns
Domain-specific terminology and metrics

For implementation, we use SQL stored procedures and views maintained by our data analyst. These transformations convert silver layer data into consumption-ready tables, which then feed directly into Power BI dashboards. The gold layer is both the end of our pipeline and the beginning of our feedback loop—user behavior in analytics tools informs future data product development. Like dbt models the SQL views and stored procedures are committed and stored in Azure DevOps in form of JSON Azure Datafactory pipeline templates.

Technical Architecture: The Component Breakdown

Let me break down our technical stack and explain why each component was selected:

On-Premises MSSQL Server - We kept our source databases on-prem for two pragmatic reasons: we have already paid for the licenses (why pay twice?), and our security team sleeps better knowing sensitive data remains within our physical control.
Azure Data Factory - For ELT orchestration, ADF was the obvious choice in the Microsoft ecosystem. We considered other tools, but the native Azure integration with ADF significantly reduced implementation complexity.
Self-Hosted Integration Runtimes - These lightweight ADF agents run in our data center, creating secure tunnels for ADF to reach our on-prem SQL servers without opening inbound firewall ports.
Azure Storage - Standard blob storage serves as intermediate storage. Cost-effective and virtually unlimited scaling.
dbt in Docker Containers - We containerized dbt to ensure consistent execution environments. These containers run on-demand in Azure Container Instances in form of job runners with literally zero operational costs.
Azure Relay Bridge - This open-source Microsoft application provides the same connectivity benefits for dbt that Integration Runtimes provide for ADF, allowing secure access to on-prem resources via the Azure Relay Service.
Azure DevOps - Contains our repos for infrastructure code, ADF pipeline definitions, and dbt models. All changes go through the CI/CD pipelines with automated deployment.
Terraform - Our entire Infrastructure is defined As-Code, with shared modules to eliminate boilerplate. We maintain separate state files for play, test, and production environments in Azure storage containers.

The architecture follows domain-driven design principles, with Data Factories structured along business and system domain boundaries. This isn’t accidental, but it is fundamental to my approach of evolutionary architecture. Conway’s Law suggests that software architecture will mirror organizational structure, so we explicitly designed the data engineering platform with our organisational structure in mind.

The Technical ROI: Why This Actually Matters

As a CTO, I need to justify every architectural decision in terms of business value. Here’s the measurable impact of our implementation:

From Days to Hours: Acceleration of Analytical Insights

Our previous ETL processes took 2-3 days from data collection to actionable report accessible only to a selected few. With the new architecture, we’ve been able to start implementing PowerBI dashboards showing daily updated MRR Forcast and Order Intake across a wide range of company functions, from product management to business leadership. I would like to believe that when your CEO can see yesterday’s orders before today’s first meeting, you’re operating at a different level.

From Fragile to Resilient: Operational stability

Our previous infrastructure experienced periodic failures, each requiring manual intervention. The new architecture has been running without any significant outtage for the last six months with almost no operational intervention needed while steadily adding more and more data sources.

From Silos to Unified: Reduction in Conflicting Data Reports

Previously, different departments cited conflicting metrics in management meetings because they pulled from different source systems. With more and more of reportings based on our gold layer as the single source of truth, these discrepancies have dropped.

From Bottlenecks to scalability: Evolutionary architecture build for change

Our architecture is evolving to incorporate more and more data sources by leveraging our growing library of shared infrastructure patterns and components. Giving us the flexibility to move beyond mere business data and going to the next level of consuming product data for the next stage in our data journey.

Next Steps: Infrastructure as Code Deep Dive

In my next article, I’ll explore our Terraform implementation, including the parameterized modules we built to eliminate boilerplate code across environments. I’ll share actual code patterns we use to provision ADF instances, storage accounts, and containerized dbt runners across our play, test, and production environments.

Holger Reinhardt

I love building products.