Series Overview
This is Part 2 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:
- Part 1: From Something-with-Data to Data-as-a-Product - Medallion architecture and business transformation
- Part 2: Infrastructure as Code Foundation with Terraform - IaC patterns and module design
- Part 3: Domain-Driven Design for Data Engineering - Source system separation and Conway’s Law
- Part 4: Hybrid Connectivity Architecture - Integration runtimes and Azure Relay Bridge
- Part 5: Extract and Load Pipeline Evolution - Four-pipeline pattern and deletion detection
- Part 6: Data Transformation Architecture - Dual-track approach with dbt and analyst SQL
- Part 7: CI/CD as Organizational Strategy - Selective deployment and complexity placement
- Part 8: DATEV Integration Patterns - Hardcoding Clients and Embracing Failure
- Part 9: Integrating Product Telemetry - Integrating Open Telemetry Into Unified Analytics
- Part 10: RevOps Funnel Analytics - Building Bowtie GTM Metrics
Introduction
In my previous article, I introduced our hybrid data platform architecture that combines on-premises SQL Server with Azure Data Factory and dbt. We explored the medallion architecture pattern and the business value it delivers. As promised, this second installment dives into how we manage our infrastructure using Terraform.
One of my core principles is that modern infrastructure should be treated as code – versioned, tested, and deployed through automated pipelines. Manual configuration of cloud resources is a recipe for inconsistency, security vulnerabilities, and operational headaches.
Our journey to Infrastructure as Code (IaC) wasn’t optional; it was a necessity driven by three key factors:
- Consistency: Each environment (play, test, production) needed to be identical in structure, differing only in scale and specific configuration values.
- Auditability: Every change to our infrastructure needed to be documented, reviewed, and traceable.
- Repeatability: The ability to recreate environments from scratch or recover from disaster scenarios quickly.
In this article, I’ll walk you through our Terraform implementation, share our module architecture, and provide practical examples of how we provision Azure resources for our hybrid data platform.
Terraform Fundamentals for Our Azure Data Platform
Before diving into specific modules, let’s establish the foundation of our Terraform setup. If you’re already familiar with Terraform, you can skim this section – but I’ve found that even experienced teams can benefit from revisiting fundamentals.
Project Organization
Our repository structure follows a pattern that separates generic modules from environment-specific configurations:
terraform/
├── modules/ # Reusable, parameterized modules
│ ├── az-datafactory/ # Azure Data Factory modules
│ ├── az-keyvault/ # Key Vault modules
│ ├── az-relay/ # Azure Relay modules
│ ├── az-storage/ # Storage Account modules
│ └── az-container/ # Container Registry modules
├── euc-play/ # Development environment
├── euc-test/ # Testing environment
└── euc-prod/ # Production environment
We also have shared modules in a separate repository for higher-level patterns:
shared/terraform/modules/
├── az-datafactory-db-datasets/ # Dataset creation
├── az-keyvault-secret/ # Secret management
├── az-container/ # Container configuration
└── elt-template/ # Complete ELT pipeline template
This structure enables us to maintain a clean separation between reusable components and their specific implementations. Each environment directory contains a complete Terraform configuration that references the shared modules with environment-specific parameters.
State Management Strategy
A critical decision in any Terraform implementation is how to manage state. For our setup, we use Azure Storage as a remote backend. Each environment has its own state files, completely isolated from others. Before any Terraform operations, we run a bootstrap pipeline that creates the necessary resource group and storage account for state management:
- task: AzureCLI@2
displayName: Create resourcegroup
inputs:
azureSubscription: '${{parameters.serviceConnection}}'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: 'az group create -l ${LOCATION} -n $(PROJECT)$(REGION)$(ENVIRONMENT)'
- task: AzureCLI@2
displayName: Create storage account
inputs:
azureSubscription: '${{parameters.serviceConnection}}'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: 'az storage account create -n $(STORAGE_ACCOUNT_NAME) -g $(PROJECT)$(REGION)$(ENVIRONMENT)'
Once the storage infrastructure is in place, our Terraform configurations use it for state management:
- task: TerraformTaskV4@4
displayName: terraformInit
inputs:
provider: 'azurerm'
command: 'init'
workingDirectory: '$(System.DefaultWorkingDirectory)/infra_deploy/terraform/$(REGION)-$(ENVIRONMENT)'
backendServiceArm: '${{parameters.serviceConnection}}'
backendAzureRmResourceGroupName: $(PROJECT)$(REGION)$(ENVIRONMENT)
backendAzureRmStorageAccountName: $(STORAGE_ACCOUNT_NAME)
backendAzureRmContainerName: $(STORAGE_CONTAINER_NAME)
backendAzureRmKey: '$(Build.Repository.Name)-$(TF_VAR_NAME).tfstate'
This isolation serves two purposes:
- It prevents accidental changes to production when working on development environments
- It allows different team to work on different environments concurrently
Module Design Principles
When designing Terraform modules, I follow these core principles:
- Single Responsibility: Each module does one thing well
- Sensible Defaults: Modules work with minimal configuration
- Complete Documentation: Each variable is well-documented
- Consistent Outputs: Output formats are consistent across modules
- Standard Structure: All modules follow the same file organization (main.tf, variables.tf, outputs.tf)
Let’s look at a simplified example of our Data Factory module to illustrate these principles:
locals {
resource_group_name = "${var.PROJECT}${var.REGION}${var.ENVIRONMENT}"
}
data "azurerm_resource_group" "main" {
name = local.resource_group_name
}
# DataFactory
resource "azurerm_data_factory" "adf_infra_shared" {
name = "${local.resource_group_name}-adf${var.TYPE}"
location = var.LOCATION
resource_group_name = local.resource_group_name
identity {
type = "SystemAssigned"
}
tags = {
PURPOSE = var.NAME
OWNER = var.OWNER
}
}
# data factory self-hosted integration runtime
resource "azurerm_data_factory_integration_runtime_self_hosted" "integration_runtime" {
name = var.SELF_HOSTED_SHARED_RUNTIME_NAME
data_factory_id = azurerm_data_factory.adf_infra_shared.id
}
This module encapsulates the creation of an Azure Data Factory instance with a self-hosted integration runtime. It follows our standard naming conventions and creates a system-assigned managed identity for authentication.
Building a Data Pipeline with Shared Modules
Now let’s see how we compose multiple modules to create a complete data pipeline. Here’s a simplified example from one of our environment configurations:
module "azure_datafactory" {
source = "../modules/az-datafactory"
PROJECT = var.PROJECT
NAME = var.NAME
LOCATION = var.LOCATION
ENVIRONMENT = var.ENVIRONMENT
REGION = var.REGION
OWNER = var.OWNER
SELF_HOSTED_SHARED_RUNTIME_NAME = var.SELF_HOSTED_SHARED_RUNTIME_NAME
}
module "key_vault" {
source = "../modules/az-keyvault"
PROJECT = var.PROJECT
NAME = var.NAME
LOCATION = var.LOCATION
ENVIRONMENT = var.ENVIRONMENT
REGION = var.REGION
OWNER = var.OWNER
}
module "integration_runtime_secret" {
source = "../../../shared/terraform/modules/az-keyvault-secret"
PROJECT = var.PROJECT
NAME = var.NAME
LOCATION = var.LOCATION
ENVIRONMENT = var.ENVIRONMENT
REGION = var.REGION
OWNER = var.OWNER
SECRET_NAME = var.SELF_HOSTED_SHARED_RUNTIME_SECRET
SECRET_VALUE = module.azure_datafactory.integration_runtime_self_hosted_primary_key
depends_on = [
module.azure_datafactory,
module.key_vault
]
}
This pattern creates the core infrastructure components and automatically stores the generated integration runtime key in Key Vault. Notice how we reference outputs from one module as inputs to another, creating a dependency chain that Terraform manages for us.
Three-Environment Deployment Strategy
Our data platform uses three distinct environments:
- Play: Development environment for building and testing new features
- Test: Validation environment to ensure configurations work correctly
- Prod: Production environment for business operations
Each environment is completely self-contained with its own resources, state files, and secrets. This isolation is intentional - it prevents cross-environment dependencies and ensures that we can evolve each environment independently as needed.
I’ve learned from experience that while striving for identical environments is the goal, differences inevitably emerge. One hard lesson was discovering that configurations developed in Play didn’t always deploy cleanly to Production. This highlighted the importance of having a Test environment as a validation step, even for infrastructure code.
Our approach to environment management follows an evolutionary pattern rather than trying to over-engineer from the start. While we currently don’t have formal approval gates between environments, our isolated environment design would make adding them straightforward when needed.
Hybrid Connectivity Architecture
The most challenging aspect of our implementation wasn’t Terraform itself but establishing reliable connectivity between cloud services and on-premises databases. We solved this with two complementary approaches:
Self-hosted Integration Runtimes for Azure Data Factory
For Azure Data Factory, we use self-hosted integration runtimes installed on-premise:
resource "azurerm_data_factory_integration_runtime_self_hosted" "contoso_dwh_runtime" {
name = var.SELF_HOSTED_SHARED_RUNTIME_NAME
data_factory_id = azurerm_data_factory.adf_infra_shared.id
}
These runtimes establish outbound connections to Azure and allow ADF to interact with on-premises SQL Server without requiring inbound firewall rules.
To minimize on-premises infrastructure, we use a shared runtime model. Each environment has a single base Azure Data Factory that owns the self-hosted runtime, which is then shared with domain-specific data factories through RBAC:
resource "azurerm_role_assignment" "role_assignment_self_hosted_runtime" {
scope = data.azurerm_resource_group.adf_rg_infra_shared.id
role_definition_name = "Contributor"
principal_id = azurerm_data_factory.adf_elt_pipeline.identity[0].principal_id
}
Azure Relay Bridge for dbt Connectivity
For dbt running in Azure Container Instances, we use the open-source Azure Relay Bridge to establish secure connectivity to on-premises databases:
resource "azurerm_container_group" "container_instance" {
name = local.dbt_container_instance_name
location = var.LOCATION
resource_group_name = azurerm_resource_group.rg_elt_pipeline.name
# Container configurations...
container {
name = var.DBT_CONTAINER_NAME
image = "${data.azurerm_container_registry.cr_infra_shared.login_server}/${var.DBT_CONTAINER_NAME}:${var.DBT_CONTAINER_TAG}"
# dbt container config...
}
container {
name = var.AZBRIDGE_CONTAINER_NAME
image = "${data.azurerm_container_registry.cr_infra_shared.login_server}/${var.AZBRIDGE_CONTAINER_NAME}:${var.AZBRIDGE_CONTAINER_TAG}"
# azbridge container config...
}
}
This creates a sidecar container pattern where the Azure Bridge container establishes connectivity to on-premises databases via Azure Relay, and the dbt container connects through it. The two containers communicate via a shared volume mounted from Azure Storage, using a file-based semaphore system for coordination.
Both approaches provide secure hybrid connectivity without requiring inbound firewall rules, making the solution more secure and easier to deploy within typical corporate network constraints.
Security Implementation
Security is a fundamental concern for any data platform. Our approach focuses on three key areas:
1. Secret Management with Key Vault
All sensitive information is stored in Azure Key Vault, with each environment having its own isolated vault. Terraform automatically stores generated secrets during deployment:
module "azure_relay_local_secret" {
source = "../../../shared/terraform/modules/az-keyvault-secret"
# Configuration...
SECRET_NAME = var.AZBRIDGE_LOCAL_SECRET
SECRET_VALUE = module.azure_relay.relay_hybrid_connection_send_connection_string
}
When secrets need to be managed outside of Terraform (for example, when they’re rotated manually), we use lifecycle blocks to prevent Terraform from attempting to revert the changes:
lifecycle {
ignore_changes = [
value,
tags
]
}
2. Managed Identities for Authentication
Whenever possible, we use system-assigned managed identities rather than service principals:
identity {
type = "SystemAssigned"
}
This eliminates the need to manage credentials and reduces the risk of secret leakage. Services access resources using RBAC assignments:
resource "azurerm_key_vault_access_policy" "elt_kv_access_policy" {
key_vault_id = data.azurerm_key_vault.kv_infra_shared.id
tenant_id = azurerm_data_factory.adf_elt_pipeline.identity[0].tenant_id
object_id = azurerm_data_factory.adf_elt_pipeline.identity[0].principal_id
secret_permissions = [
"Get", "List"
]
}
3. Outbound-Only Connectivity Model
Our hybrid connectivity architecture uses outbound-only connections, eliminating the need for inbound firewall rules and reducing the attack surface. This simplifies network security by focusing on authentication rather than complex network segmentation.
CI/CD Pipeline Integration
Our Terraform implementation is fully integrated with Azure DevOps pipelines. The main pipeline orchestrates the entire deployment process:
stages:
- stage: az_bootstrap
displayName: Bootstrap infrastructure
jobs:
- template: pipelines/az-bootstrap.yml@shared
- stage: infra_deploy
displayName: Deploy infrastructure
jobs:
- template: azure-pipeline.yml@infra_deploy
- stage: dbt_docker_image
displayName: Build and push dbt container image
jobs:
- template: azure-pipeline.yml@image_dbt
# Additional stages...
Each repository contains its own pipeline template that defines the specific deployment steps:
steps:
- task: TerraformTaskV4@4
displayName: terraformInit
# Configuration...
- task: TerraformTaskV4@4
displayName: terraformValidate
# Configuration...
- task: TerraformTaskV4@4
displayName: terraformPlan
# Configuration...
- task: TerraformTaskV4@4
displayName: terraformApply
# Configuration...
Environment-specific variables are managed through Azure DevOps variable groups, which can optionally be linked to Key Vault for secure storage of sensitive values.
One particularly powerful feature of Azure DevOps is the ability to reference pipeline templates from other repositories. This allows us to maintain a clean separation of concerns while reusing common pipeline patterns.
Azure Data Factory Pipeline Deployment
For Azure Data Factory pipelines, we’ve implemented a hybrid approach that combines the best of visual design and Infrastructure as Code:
- Design and iterate on pipelines using the Azure Data Factory visual editor
- Export the pipeline definition to JSON once finalized
- Store the JSON in the repository and deploy via Terraform:
resource "azurerm_data_factory_pipeline" "product_pipeline_daily" {
name = "${var.NAME}-daily"
data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id
activities_json = <<JSON
${jsonencode(local.product_pipeline_daily_json.properties.activities)}
JSON
}
This approach gives us the ease of use of visual tools for the development process, while still maintaining the benefits of infrastructure as code for deployment and governance.
Feature flags allow us to control pipeline behavior across environments without changing the underlying code:
resource "azurerm_data_factory_trigger_schedule" "product_pipeline_daily_trigger" {
name = "Out once every workday"
data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id
pipeline_name = azurerm_data_factory_pipeline.product_pipeline_daily.name
activated = var.ADF_PIPELINE_TRIGGER_ACTIVE == "true"
# Schedule configuration...
}
Lessons Learned and Best Practices
Through implementing this architecture, I’ve learned several valuable lessons:
Keep It Simple, Stupid (KISS)
There’s an unfortunate tendency in engineering to prematurely optimize, leading to overly complicated code without a clear need. I’ve found that starting with simple, straightforward implementations and evolving them as real requirements emerge leads to more maintainable infrastructure.
Test Your Infrastructure Code
One hard lesson was discovering that configurations developed in Play didn’t always deploy cleanly to Production. This highlighted the importance of having an intermediate Test environment as a validation step, even for infrastructure code.
Best of Both Worlds for Tooling
For Azure Data Factory, combining visual design tools with infrastructure as code deployment gives us the best of both worlds - ease of development with the governance benefits of IaC.
Domain-Driven Infrastructure
Organizing our Terraform code and pipelines along business domain boundaries has proven effective. Each data domain has its own repositories and pipelines, which aligns our technical organization with the business structure.
Conclusion and Next Steps
Infrastructure as Code is foundational for any cloud platform. It provides consistency, auditability, and repeatability that would be impossible to achieve with manual processes. Terraform’s declarative approach and the mature Azure provider have made implementation straightforward, allowing us to focus on solving the more challenging aspects of hybrid connectivity.
In my next article, I’ll explore how we’ve applied Domain-Driven Design principles to our data engineering practice. I’ll share how we’ve structured our data factories and transformations along business domain boundaries.
Until then, I encourage you to evaluate your own data infrastructure and consider whether an evolutionary approach to Infrastructure as Code might benefit your organization. The goal isn’t to create the perfect architecture from day one, but to establish a foundation that can evolve with your needs while maintaining operational stability and security.
