Series Overview
This is Part 3 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:
- Part 1: From Something-with-Data to Data-as-a-Product - Medallion architecture and business transformation
- Part 2: Infrastructure as Code Foundation with Terraform - IaC patterns and module design
- Part 3: Domain-Driven Design for Data Engineering - Source system separation and Conway’s Law
- Part 4: Hybrid Connectivity Architecture - Integration runtimes and Azure Relay Bridge
- Part 5: Extract and Load Pipeline Evolution - Four-pipeline pattern and deletion detection
- Part 6: Data Transformation Architecture - Dual-track approach with dbt and analyst SQL
- Part 7: CI/CD as Organizational Strategy - Selective deployment and complexity placement
- Part 8: DATEV Integration Patterns - Hardcoding Clients and Embracing Failure
- Part 9: Integrating Product Telemetry - Integrating Open Telemetry Into Unified Analytics
- Part 10: RevOps Funnel Analytics - Building Bowtie GTM Metrics
Introduction
In my previous article, I’ve walked you through our hybrid data platform architecture and the Infrastructure as Code foundation that supports it. Today, I want to tackle something that’s often overlooked in data engineering discussions: how to build data systems that can evolve with your organization, even when you can’t predict exactly how that evolution will happen.
Conway’s Law states that “organizations design systems that mirror their own communication structure.” Most engineering teams fight this principle, trying to create “ideal” architectures that ignore organizational reality. I’ve learned to embrace it instead – not by predicting exactly how our organization will change, but by designing for change itself.
This is where Domain-Driven Design principles become invaluable, but not in the way most people think. It’s not about creating perfect bounded contexts from day one. It’s about building evolutionary architecture that makes future organizational changes easy to accommodate.
From Somewhere to Source System Boundaries
Let me start with where we were. Our initial data infrastructure was scattered across multiple Azure Data Factories and scripts with no clear logic about what ran where. When schema changes happened, we’d discover dependencies in unexpected places. Data pipelines would mysteriously break when we updated data structures. The blast radius of any change was unpredictable.
The key insight wasn’t about predicting the future, but about limiting blast radius and making dependencies explicit. I decided to separate our data pipelines by source system: Salesforce, Product, Datev, and other external systems would each get their own Azure Data Factory.
This wasn’t based on organizational boundaries that existed, but on a simple principle: schema changes happen by source system. When Salesforce adds a new field or changes a data type, I want that change isolated to pipelines that actually process Salesforce data.
Here’s our current structure:
Repositories:
├── infra_deploy/ # Shared infrastructure (one per environment)
├── elt_template/ # Blueprint for ADF + dbt pipeline
├── sfdc_pipeline/ # Salesforce-specific ADF
├── product_pipeline/ # Product telemetry ADF
├── datev_pipeline/ # Datev-specifc ADF
└── image_dbt/ # All dbt models in one Docker image
└── models/
├── sfdc-ingentis_dwh/ # Salesforce transforms
├── product-ingentis_dwh/ # Product data transforms
├── datev-ingentis_dwh/ # Financial integration
Each source system gets its own repository and ADF, but they all share infrastructure and dbt models. This gives us the benefits of isolation without the overhead of completely separate deployments.
The ELT Template Pattern: Building for Evolution
The key insight was creating a reusable template that could be instantiated for each source system. Our elt_template contains the blueprint for a complete ELT pipeline:
module "elt_template" {
source = "../../../shared/terraform/modules/elt-template"
PROJECT = var.PROJECT
NAME = var.NAME # Source system identifier
LOCATION = var.LOCATION
ENVIRONMENT = var.ENVIRONMENT
DBT_CONTAINER_NAME = var.DBT_CONTAINER_NAME
DBT_CONTAINER_TAG = var.DBT_CONTAINER_TAG
DWH_DATABASE_HOST = var.DWH_DATABASE_HOST
SELF_HOSTED_SHARED_RUNTIME_NAME = var.SELF_HOSTED_SHARED_RUNTIME_NAME
AZBRIDGE_LOCAL_SECRET = var.AZBRIDGE_LOCAL_SECRET
}
Each instantiation creates:
- A dedicated Azure Data Factory for that source system
- Container group with dbt and Azure Bridge containers
- Linked services to shared infrastructure (integration runtime, Key Vault, storage)
- Independent CI/CD pipeline
The beauty of this approach is that adding a new source system is trivial – just create a new repository, fork the template, and you’re ready to go. The cost of separation now is minimal compared to trying to split a complex, monolithic system later.
Solving the Deletion Detection Problem
One of our biggest early challenges was handling deletions in delta updates. When you’re only loading changed records, how do you detect when something was deleted? This led to gradual data drift as deletions accumulated over time without being reflected in our data warehouse.
Our solution evolved into a pragmatic three-pipeline pattern for each source system:
Init Pipeline (Manual)
Triggered when schema changes occur:
- Drops and recreates table schemas
- Creates or alters views and stored procedures
- Runs the daily pipeline to populate with fresh data
Daily Pipeline (Morning)
Runs first thing each morning:
- Truncates target tables
- Performs full reload to catch any deletions
- Updates financial KPIs and gold layer calculations
Hourly Pipeline (Throughout Day)
Runs every hour for delta updates:
- Uses dynamic lookup to get last modified timestamp
- Loads only changed records since last run
- Handles real-time operational needs
Here’s our delta lookup pattern:
SELECT
CASE
WHEN COUNT(*) = 0 THEN '2010-01-01T00:00:00Z'
ELSE CONVERT(varchar(32),MAX(AC_LastModifiedDate),127)+'Z'
END AS AC_LastModifiedDate
FROM raw.SF_Account
If the table is empty (first run), we use an early date to capture everything. Otherwise, we use the latest modification timestamp from existing data. The corresponding copy activity then uses this timestamp:
Where LastModifiedDate >= @{activity('DWH_AccountLastModified').output.firstRow.AC_LastModifiedDate}
This pattern works well, even though deletion tracking varies by source system. For instance Salesforce main objects do support includeDeletedObjects: true, but subscription data for MRR calculations doesn’t. This is why we still need the daily full reload – it’s our most robust method to limit eventual drift to a single day.
Matching Data Freshness to Business Requirements
The three-pipeline pattern also reflects different business requirements for data freshness and consistency:
Financial KPIs: Daily Consistency
Our financial controlling data cannot accept drift. MRR forecasts, revenue recognition, and compliance reporting must be accurate and consistent. But daily updates are sufficient – we don’t need real-time financial KPIs.
The daily pipeline handles all financial calculations and is the only pipeline that updates the gold layer for financial data. This ensures that any query on financial KPIs throughout the day returns consistent results based on the morning’s data snapshot.
Operational Systems: Hourly Updates
Customer portal access, license management, and business dashboards need faster updates. When a customer purchases our product, they expect to access the customer portal within a reasonable time.
The hourly pipeline updates the bronze layer and triggers dbt transformations for operational gold layer data. In our B2B context of contracts, legal reviews, and formal order processes, a maximum one-hour delay is perfectly acceptable.
The Right Tool for the Job: dbt vs. Direct SQL
One important lesson was recognizing when dbt is the right tool and when it isn’t. We use dbt for two main purposes:
Bronze to Silver Transformations:
- Data validation and cleansing
- Standardized pre-aggregations
- Repeatable transformation patterns
Operational Gold Layer:
- Business system outputs
- Customer portal data
- License management feeds
- Standard reporting views
But for complex financial calculations authored by our data analyst, we found that forcing everything through dbt was counterproductive. Converting sophisticated SQL scripts into dbt models was error-prone, time-consuming, and slowed us down with any added benefit other than making change painfull.
Our pragmatic solution: store the analyst’s SQL scripts directly in Azure Data Factory script actions. This gives us versioning through ADF while allowing the analyst to write optimized SQL without translation overhead.
So we actually have two types of gold layer:
- Operational gold (dbt-generated) for business systems
- Financial gold (SQL script-generated) for KPIs and controlling
Container-Based Model Selection
All our dbt models live in a single Docker image, but model selection happens through environment variables set in the container instance:
container {
name = var.DBT_CONTAINER_NAME
image = "${data.azurerm_container_registry.cr_infra_shared.login_server}/${var.DBT_CONTAINER_NAME}:${var.DBT_CONTAINER_TAG}"
environment_variables = {
"ENV_DBT_MODEL" = "${var.NAME}" # Source system identifier
"ENV_DWH_DATABASE_HOST" = "${local.dbt_db_host}"
"ENV_DWH_DATABASE_NAME" = "${local.dbt_db_database}"
}
}
Each ADF triggers its own container instance with the appropriate model selection. This gives us the benefits of a single dbt codebase while maintaining execution isolation by source system.
Evolutionary Architecture in Practice
The key principle underlying all these decisions is evolutionary architecture – designing for change without trying to predict exactly what those changes will be.
Building for the Unknown
Right now, our data team owns everything. But I’ve structured the architecture anticipating future organizational evolution:
sfdc-ingentis_dwhcould migrate to a Sales Operations teamproduct-ingentis_dwhcould be owned by Product Developmentdatev-ingentis_dwhcould belong to Finance- Each has independent deployment pipelines ready for delegation
I don’t know if or when these transitions will happen, but the cost of enabling them now is trivial compared to splitting a monolithic system later.
Managing Dependencies Explicitly
We eliminated implicit dependencies that made the system brittle. Previously, some models would join across databases from the same host, creating undocumented dependencies between source systems.
Our rule now: if you need data from another source, load it into the bronze layer first. This makes dependencies explicit and traceable. If the product pipeline needs customer data, it must extract and load that data explicitly rather than joining across databases.
Time-Triggered Orchestration
Instead of building complex orchestration, we use simple time-triggered sequences. Each source system pipeline runs on its own schedule, with implicit dependencies managed through timing, i.e.
- 5 AM: Salesforce daily pipeline (full customer data reload)
- 6 AM: Product daily pipeline (can depend on fresh customer data)
- 7 AM: Financial controlling (all source data refreshed)
This works well for our current scale and avoids over-engineering orchestration that we might not need.
Lessons Learned: Conway’s Law as Feature
The biggest lesson has been treating Conway’s Law as a feature, not a bug. Fighting organizational structure in your technical architecture creates unnecessary friction. Instead:
Start with Logical Boundaries
Source system separation was our logical first step because schema changes naturally happen by source system. This creates immediate operational benefits while setting up future organizational transitions.
Make Change Easy, Not Perfect
Evolutionary architecture isn’t about building the perfect end state. It’s about making future changes easy. The template pattern means adding new source systems is trivial. The separation means delegating ownership is straightforward when the time comes.
Balance Pragmatism with Principles
We could have built more sophisticated orchestration, forced everything through dbt, or created perfect domain boundaries. Instead, we chose solutions that work well now and can evolve later: time triggers, direct SQL for complex calculations, and anticipated but not enforced team boundaries in the future.
Informal Coordination Can Work
With a small team, informal coordination between ADF schema changes and dbt model updates works fine. As we grow, we can formalize these processes, but over-engineering coordination now would slow us down without clear benefits.
Conclusion and Looking Forward
Domain-driven design for data engineering isn’t about creating perfect boundaries from day one. It’s about working with organizational reality to create systems that can evolve naturally with business needs.
By separating along source system boundaries, we’ve reduced blast radius from schema changes, clarified deployment ownership, and prepared for future evolution. The template-based approach means we can easily add new source systems or delegate ownership when teams are ready.
Most importantly, we’ve learned that evolutionary architecture is about making change easy, not predicting exactly what those changes will be. Our data platform can now evolve with our organization rather than constraining it.
In my next article, I’ll dive into the technical details of our hybrid connectivity architecture. I’ll share how we implemented the self-hosted integration runtime sharing pattern and the container-based coordination between dbt and Azure Relay Bridge that makes this architecture possible.
The goal isn’t to create perfect domain boundaries from day one, but to establish an architecture that can evolve with your organization while maintaining operational stability and clear separation of concerns.
