A hybrid Data Platform with Azure Data Factory & dbt - Part 2

Series Overview

This is Part 2 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:

Part 1: From Something-with-Data to Data-as-a-Product - Medallion architecture and business transformation
Part 2: Infrastructure as Code Foundation with Terraform - IaC patterns and module design
Part 3: Domain-Driven Design for Data Engineering - Source system separation and Conway’s Law
Part 4: Hybrid Connectivity Architecture - Integration runtimes and Azure Relay Bridge
Part 5: Extract and Load Pipeline Evolution - Four-pipeline pattern and deletion detection
Part 6: Data Transformation Architecture - Dual-track approach with dbt and analyst SQL
Part 7: CI/CD as Organizational Strategy - Selective deployment and complexity placement
Part 8: DATEV Integration Patterns - Hardcoding Clients and Embracing Failure
Part 9: Integrating Product Telemetry - Integrating Open Telemetry Into Unified Analytics
Part 10: RevOps Funnel Analytics - Building Bowtie GTM Metrics

Introduction

In my previous article, I introduced our hybrid data platform architecture that combines on-premises SQL Server with Azure Data Factory and dbt. We explored the medallion architecture pattern and the business value it delivers. As promised, this second installment dives into how we manage our infrastructure using Terraform.

One of my core principles is that modern infrastructure should be treated as code – versioned, tested, and deployed through automated pipelines. Manual configuration of cloud resources is a recipe for inconsistency, security vulnerabilities, and operational headaches.

Our journey to Infrastructure as Code (IaC) wasn’t optional; it was a necessity driven by three key factors:

Consistency: Each environment (play, test, production) needed to be identical in structure, differing only in scale and specific configuration values.
Auditability: Every change to our infrastructure needed to be documented, reviewed, and traceable.
Repeatability: The ability to recreate environments from scratch or recover from disaster scenarios quickly.

In this article, I’ll walk you through our Terraform implementation, share our module architecture, and provide practical examples of how we provision Azure resources for our hybrid data platform.

Terraform Fundamentals for Our Azure Data Platform

Before diving into specific modules, let’s establish the foundation of our Terraform setup. If you’re already familiar with Terraform, you can skim this section – but I’ve found that even experienced teams can benefit from revisiting fundamentals.

Project Organization

Our repository structure follows a pattern that separates generic modules from environment-specific configurations:

terraform/
├── modules/             # Reusable, parameterized modules
│   ├── az-datafactory/  # Azure Data Factory modules
│   ├── az-keyvault/     # Key Vault modules
│   ├── az-relay/        # Azure Relay modules
│   ├── az-storage/      # Storage Account modules
│   └── az-container/    # Container Registry modules
├── euc-play/            # Development environment
├── euc-test/            # Testing environment
└── euc-prod/            # Production environment

We also have shared modules in a separate repository for higher-level patterns:

shared/terraform/modules/
├── az-datafactory-db-datasets/  # Dataset creation
├── az-keyvault-secret/          # Secret management
├── az-container/                # Container configuration
└── elt-template/                # Complete ELT pipeline template

This structure enables us to maintain a clean separation between reusable components and their specific implementations. Each environment directory contains a complete Terraform configuration that references the shared modules with environment-specific parameters.

State Management Strategy

A critical decision in any Terraform implementation is how to manage state. For our setup, we use Azure Storage as a remote backend. Each environment has its own state files, completely isolated from others. Before any Terraform operations, we run a bootstrap pipeline that creates the necessary resource group and storage account for state management:

- task: AzureCLI@2
  displayName: Create resourcegroup
  inputs:
    azureSubscription: '${{parameters.serviceConnection}}'
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: 'az group create -l ${LOCATION} -n $(PROJECT)$(REGION)$(ENVIRONMENT)'

- task: AzureCLI@2
  displayName: Create storage account
  inputs:
    azureSubscription: '${{parameters.serviceConnection}}'
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: 'az storage account create -n $(STORAGE_ACCOUNT_NAME) -g $(PROJECT)$(REGION)$(ENVIRONMENT)'

Once the storage infrastructure is in place, our Terraform configurations use it for state management:

- task: TerraformTaskV4@4
  displayName: terraformInit
  inputs:
    provider: 'azurerm'
    command: 'init'
    workingDirectory: '$(System.DefaultWorkingDirectory)/infra_deploy/terraform/$(REGION)-$(ENVIRONMENT)'
    backendServiceArm: '${{parameters.serviceConnection}}'
    backendAzureRmResourceGroupName: $(PROJECT)$(REGION)$(ENVIRONMENT)
    backendAzureRmStorageAccountName: $(STORAGE_ACCOUNT_NAME)
    backendAzureRmContainerName: $(STORAGE_CONTAINER_NAME)
    backendAzureRmKey: '$(Build.Repository.Name)-$(TF_VAR_NAME).tfstate'

This isolation serves two purposes:

It prevents accidental changes to production when working on development environments
It allows different team to work on different environments concurrently

Module Design Principles

When designing Terraform modules, I follow these core principles:

Single Responsibility: Each module does one thing well
Sensible Defaults: Modules work with minimal configuration
Complete Documentation: Each variable is well-documented
Consistent Outputs: Output formats are consistent across modules
Standard Structure: All modules follow the same file organization (main.tf, variables.tf, outputs.tf)

Let’s look at a simplified example of our Data Factory module to illustrate these principles:

locals {
  resource_group_name = "${var.PROJECT}${var.REGION}${var.ENVIRONMENT}"
}

data "azurerm_resource_group" "main" {
  name = local.resource_group_name
}

# DataFactory
resource "azurerm_data_factory" "adf_infra_shared" {
  name                = "${local.resource_group_name}-adf${var.TYPE}"
  location            = var.LOCATION
  resource_group_name = local.resource_group_name

  identity {
    type = "SystemAssigned"
  }

  tags = {
    PURPOSE = var.NAME
    OWNER   = var.OWNER
  }
}

# data factory self-hosted integration runtime
resource "azurerm_data_factory_integration_runtime_self_hosted" "integration_runtime" {
  name            = var.SELF_HOSTED_SHARED_RUNTIME_NAME
  data_factory_id = azurerm_data_factory.adf_infra_shared.id
}

This module encapsulates the creation of an Azure Data Factory instance with a self-hosted integration runtime. It follows our standard naming conventions and creates a system-assigned managed identity for authentication.

Building a Data Pipeline with Shared Modules

Now let’s see how we compose multiple modules to create a complete data pipeline. Here’s a simplified example from one of our environment configurations:

module "azure_datafactory" {
  source      = "../modules/az-datafactory"
  PROJECT     = var.PROJECT
  NAME        = var.NAME
  LOCATION    = var.LOCATION
  ENVIRONMENT = var.ENVIRONMENT
  REGION      = var.REGION
  OWNER       = var.OWNER

  SELF_HOSTED_SHARED_RUNTIME_NAME = var.SELF_HOSTED_SHARED_RUNTIME_NAME
}

module "key_vault" {
  source      = "../modules/az-keyvault"
  PROJECT     = var.PROJECT
  NAME        = var.NAME
  LOCATION    = var.LOCATION
  ENVIRONMENT = var.ENVIRONMENT
  REGION      = var.REGION
  OWNER       = var.OWNER
}

module "integration_runtime_secret" {
  source      = "../../../shared/terraform/modules/az-keyvault-secret"
  PROJECT     = var.PROJECT
  NAME        = var.NAME
  LOCATION    = var.LOCATION
  ENVIRONMENT = var.ENVIRONMENT
  REGION      = var.REGION
  OWNER       = var.OWNER

  SECRET_NAME   = var.SELF_HOSTED_SHARED_RUNTIME_SECRET
  SECRET_VALUE  = module.azure_datafactory.integration_runtime_self_hosted_primary_key

  depends_on = [
    module.azure_datafactory,
    module.key_vault
  ]
}

This pattern creates the core infrastructure components and automatically stores the generated integration runtime key in Key Vault. Notice how we reference outputs from one module as inputs to another, creating a dependency chain that Terraform manages for us.

Three-Environment Deployment Strategy

Our data platform uses three distinct environments:

Play: Development environment for building and testing new features
Test: Validation environment to ensure configurations work correctly
Prod: Production environment for business operations

Each environment is completely self-contained with its own resources, state files, and secrets. This isolation is intentional - it prevents cross-environment dependencies and ensures that we can evolve each environment independently as needed.

I’ve learned from experience that while striving for identical environments is the goal, differences inevitably emerge. One hard lesson was discovering that configurations developed in Play didn’t always deploy cleanly to Production. This highlighted the importance of having a Test environment as a validation step, even for infrastructure code.

Our approach to environment management follows an evolutionary pattern rather than trying to over-engineer from the start. While we currently don’t have formal approval gates between environments, our isolated environment design would make adding them straightforward when needed.

Hybrid Connectivity Architecture

The most challenging aspect of our implementation wasn’t Terraform itself but establishing reliable connectivity between cloud services and on-premises databases. We solved this with two complementary approaches:

Self-hosted Integration Runtimes for Azure Data Factory

For Azure Data Factory, we use self-hosted integration runtimes installed on-premise:

resource "azurerm_data_factory_integration_runtime_self_hosted" "contoso_dwh_runtime" {
  name            = var.SELF_HOSTED_SHARED_RUNTIME_NAME
  data_factory_id = azurerm_data_factory.adf_infra_shared.id
}

These runtimes establish outbound connections to Azure and allow ADF to interact with on-premises SQL Server without requiring inbound firewall rules.

To minimize on-premises infrastructure, we use a shared runtime model. Each environment has a single base Azure Data Factory that owns the self-hosted runtime, which is then shared with domain-specific data factories through RBAC:

resource "azurerm_role_assignment" "role_assignment_self_hosted_runtime" {
  scope                = data.azurerm_resource_group.adf_rg_infra_shared.id
  role_definition_name = "Contributor"
  principal_id         = azurerm_data_factory.adf_elt_pipeline.identity[0].principal_id
}

Azure Relay Bridge for dbt Connectivity

For dbt running in Azure Container Instances, we use the open-source Azure Relay Bridge to establish secure connectivity to on-premises databases:

resource "azurerm_container_group" "container_instance" {
  name                = local.dbt_container_instance_name
  location            = var.LOCATION
  resource_group_name = azurerm_resource_group.rg_elt_pipeline.name
  
  # Container configurations...
  
  container {
    name   = var.DBT_CONTAINER_NAME
    image  = "${data.azurerm_container_registry.cr_infra_shared.login_server}/${var.DBT_CONTAINER_NAME}:${var.DBT_CONTAINER_TAG}"
    # dbt container config...
  }
  
  container {
    name   = var.AZBRIDGE_CONTAINER_NAME
    image  = "${data.azurerm_container_registry.cr_infra_shared.login_server}/${var.AZBRIDGE_CONTAINER_NAME}:${var.AZBRIDGE_CONTAINER_TAG}"
    # azbridge container config...
  }
}

This creates a sidecar container pattern where the Azure Bridge container establishes connectivity to on-premises databases via Azure Relay, and the dbt container connects through it. The two containers communicate via a shared volume mounted from Azure Storage, using a file-based semaphore system for coordination.

Both approaches provide secure hybrid connectivity without requiring inbound firewall rules, making the solution more secure and easier to deploy within typical corporate network constraints.

Security Implementation

Security is a fundamental concern for any data platform. Our approach focuses on three key areas:

1. Secret Management with Key Vault

All sensitive information is stored in Azure Key Vault, with each environment having its own isolated vault. Terraform automatically stores generated secrets during deployment:

module "azure_relay_local_secret" {
  source      = "../../../shared/terraform/modules/az-keyvault-secret"
  # Configuration...
  SECRET_NAME   = var.AZBRIDGE_LOCAL_SECRET
  SECRET_VALUE  = module.azure_relay.relay_hybrid_connection_send_connection_string
}

When secrets need to be managed outside of Terraform (for example, when they’re rotated manually), we use lifecycle blocks to prevent Terraform from attempting to revert the changes:

lifecycle {
  ignore_changes = [
    value,
    tags
  ]
}

2. Managed Identities for Authentication

Whenever possible, we use system-assigned managed identities rather than service principals:

identity {
  type = "SystemAssigned"
}

This eliminates the need to manage credentials and reduces the risk of secret leakage. Services access resources using RBAC assignments:

resource "azurerm_key_vault_access_policy" "elt_kv_access_policy" {
  key_vault_id    = data.azurerm_key_vault.kv_infra_shared.id
  tenant_id       = azurerm_data_factory.adf_elt_pipeline.identity[0].tenant_id
  object_id       = azurerm_data_factory.adf_elt_pipeline.identity[0].principal_id

  secret_permissions = [
    "Get", "List"
  ]
}

3. Outbound-Only Connectivity Model

Our hybrid connectivity architecture uses outbound-only connections, eliminating the need for inbound firewall rules and reducing the attack surface. This simplifies network security by focusing on authentication rather than complex network segmentation.

CI/CD Pipeline Integration

Our Terraform implementation is fully integrated with Azure DevOps pipelines. The main pipeline orchestrates the entire deployment process:

stages:
- stage: az_bootstrap
  displayName: Bootstrap infrastructure
  jobs:
  - template: pipelines/az-bootstrap.yml@shared
    
- stage: infra_deploy
  displayName: Deploy infrastructure
  jobs:
  - template: azure-pipeline.yml@infra_deploy
    
- stage: dbt_docker_image
  displayName: Build and push dbt container image
  jobs:
  - template: azure-pipeline.yml@image_dbt
    
# Additional stages...

Each repository contains its own pipeline template that defines the specific deployment steps:

steps:
- task: TerraformTaskV4@4
  displayName: terraformInit
  # Configuration...

- task: TerraformTaskV4@4
  displayName: terraformValidate
  # Configuration...

- task: TerraformTaskV4@4
  displayName: terraformPlan
  # Configuration...

- task: TerraformTaskV4@4
  displayName: terraformApply
  # Configuration...

Environment-specific variables are managed through Azure DevOps variable groups, which can optionally be linked to Key Vault for secure storage of sensitive values.

One particularly powerful feature of Azure DevOps is the ability to reference pipeline templates from other repositories. This allows us to maintain a clean separation of concerns while reusing common pipeline patterns.

Azure Data Factory Pipeline Deployment

For Azure Data Factory pipelines, we’ve implemented a hybrid approach that combines the best of visual design and Infrastructure as Code:

Design and iterate on pipelines using the Azure Data Factory visual editor
Export the pipeline definition to JSON once finalized
Store the JSON in the repository and deploy via Terraform:

resource "azurerm_data_factory_pipeline" "product_pipeline_daily" {
  name            = "${var.NAME}-daily"
  data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id

  activities_json = <<JSON
${jsonencode(local.product_pipeline_daily_json.properties.activities)}
JSON
}

This approach gives us the ease of use of visual tools for the development process, while still maintaining the benefits of infrastructure as code for deployment and governance.

Feature flags allow us to control pipeline behavior across environments without changing the underlying code:

resource "azurerm_data_factory_trigger_schedule" "product_pipeline_daily_trigger" {
  name            = "Out once every workday"
  data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id
  pipeline_name   = azurerm_data_factory_pipeline.product_pipeline_daily.name

  activated = var.ADF_PIPELINE_TRIGGER_ACTIVE == "true"
  
  # Schedule configuration...
}

Lessons Learned and Best Practices

Through implementing this architecture, I’ve learned several valuable lessons:

Keep It Simple, Stupid (KISS)

There’s an unfortunate tendency in engineering to prematurely optimize, leading to overly complicated code without a clear need. I’ve found that starting with simple, straightforward implementations and evolving them as real requirements emerge leads to more maintainable infrastructure.

Test Your Infrastructure Code

One hard lesson was discovering that configurations developed in Play didn’t always deploy cleanly to Production. This highlighted the importance of having an intermediate Test environment as a validation step, even for infrastructure code.

Best of Both Worlds for Tooling

For Azure Data Factory, combining visual design tools with infrastructure as code deployment gives us the best of both worlds - ease of development with the governance benefits of IaC.

Domain-Driven Infrastructure

Organizing our Terraform code and pipelines along business domain boundaries has proven effective. Each data domain has its own repositories and pipelines, which aligns our technical organization with the business structure.

Conclusion and Next Steps

Infrastructure as Code is foundational for any cloud platform. It provides consistency, auditability, and repeatability that would be impossible to achieve with manual processes. Terraform’s declarative approach and the mature Azure provider have made implementation straightforward, allowing us to focus on solving the more challenging aspects of hybrid connectivity.

In my next article, I’ll explore how we’ve applied Domain-Driven Design principles to our data engineering practice. I’ll share how we’ve structured our data factories and transformations along business domain boundaries.

Until then, I encourage you to evaluate your own data infrastructure and consider whether an evolutionary approach to Infrastructure as Code might benefit your organization. The goal isn’t to create the perfect architecture from day one, but to establish a foundation that can evolve with your needs while maintaining operational stability and security.

Holger Reinhardt

I love building products.