A hybrid Data Platform with Azure Data Factory & dbt - Part 5

Series Overview

This is Part 5 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:

Part 1: From Something-with-Data to Data-as-a-Product - Medallion architecture and business transformation
Part 2: Infrastructure as Code Foundation with Terraform - IaC patterns and module design
Part 3: Domain-Driven Design for Data Engineering - Source system separation and Conway’s Law
Part 4: Hybrid Connectivity Architecture - Integration runtimes and Azure Relay Bridge
Part 5: Extract and Load Pipeline Evolution - Four-pipeline pattern and deletion detection
Part 6: Data Transformation Architecture - Dual-track approach with dbt and analyst SQL
Part 7: CI/CD as Organizational Strategy - Selective deployment and complexity placement
Part 8: DATEV Integration Patterns - Hardcoding Clients and Embracing Failure
Part 9: Integrating Product Telemetry - Integrating Open Telemetry Into Unified Analytics
Part 10: RevOps Funnel Analytics - Building Bowtie GTM Metrics

Introduction

In my first article of this series, I’ve walked you through our medallion architecture and the domain-driven organization that shapes our data platform. The bronze layer for our raw data is where our extract and load patterns matter most. How you get data into that bronze layer, how often you refresh it, and how you handle the inevitable complexities of real-world APIs determines whether your medallion architecture delivers on its promise.

Today I want to share how our extract and load patterns evolved through production reality. This isn’t a story about getting it right from day one, but about how each iteration led us to refine our approach until we arrived at a pattern that balances data consistency with operational needs.

The medallion architecture creates clear data quality expectations: bronze for raw preservation, silver for cleansed transformation, and gold for business consumption. But the medallion pattern doesn’t tell you how often to refresh your bronze layer, or how to handle the trade-offs between data freshness and consistency. Those decisions emerge from understanding your business requirements and learning from production experience.

The Single Pipeline Era: Where It All Started

Like most data platform implementations, we started simple. One pipeline, running daily, pulling data from source systems into our bronze layer. The logic was straightforward: extract data, load it into our data warehouse, and let dbt handle the transformations through silver to gold.

This worked perfectly until we moved to production and controlling was validating the data. They noticed that some discrepancies between the financial reports using the data warehouse and the previous database. It turned out that some subscriptions remained active in the DWH despite being having been deleted in Salesforce. When we investigated, we discovered a fundamental flaw in our daily pipeline approach: deleted records were being retained indefinitely.

Our pipeline was performing upserts based on record IDs. When a record was deleted in the source system, it simply didn’t appear in the next day’s extract. The upsert operation had nothing to update, so the old record remained and continued to flow into our financial calculations. We were calculating based on stale data that included subscriptions that no longer existed.

This discovery forced us to confront a basic principle: incremental loading without deletion detection creates data drift over time. The medallion architecture’s bronze layer was supposed to be our system of record, but we were inadvertently creating a historical archive.

The Daily vs Hourly Split: Balancing Consistency and Freshness

Solving the deletion problem led us to our first major evolution. We implemented a daily morning pipeline that would truncate all bronze layer tables and perform a full reload from source systems. This ensured that deletions were captured. If a record wasn’t in the source system, it wouldn’t be in our bronze layer after the morning refresh.

But this created a new challenge. Our business operations needed more current data than once-daily refreshes could provide. The backoffice team would enter new contracts into Salesforce, and customers expected to access their licenses and documentation through our customer portal within hours, not the next day. Support staff might modify a license during a phone call and want to tell the customer they could download the updated version within the hour.

This led us to implement hourly delta updates throughout the day. But here’s where we learned an important lesson about different data consistency requirements:

Financial data cannot tolerate drift. MRR forecasts, revenue recognition, and compliance reporting must be accurate and consistent. When the controlling team runs morning reports, those numbers need to be reliable throughout the day.

Operational data can accept some drift. If the customer portal shows yesterday’s license status for an hour or two, it’s inconvenient but not critical. The business processes that generate these changes typically play out over hours to days: An account is created in Salesforce, then enabled in the customer portal, then generates usage data that flows back to Salesforce.

This understanding led us to a key architectural decision: treat financial and operational data differently.

Financial data: Daily full reload for consistency, no hourly updates
Operational data: Daily full reload plus hourly deltas for freshness

The morning daily pipeline became our financial consistency baseline. All financial KPIs and projections are calculated from this stable, complete dataset. Throughout the day, hourly pipelines update operational tables with delta changes, providing the freshness needed for customer-facing systems while preserving financial data integrity.

Delta Logic: The Database Design Principle

Making delta updates work reliably requires a fundamental database design decision. As a matter of principle, when designing any long-living database schema, always include a last_modified timestamp to enable delta syncs. Salesforce provides this with LastModifiedDate on every object, which is why it works so well for incremental loading. Many homegrown database schemas, unfortunately, do not include this basic requirement.

Our delta lookup pattern handles both full and incremental loads with the same logic:

SELECT 
    CASE 
        WHEN COUNT(*) = 0 THEN '2010-01-01T00:00:00Z'
        ELSE CONVERT(varchar(32),MAX(AC_LastModifiedDate),127)+'Z'
    END AS AC_LastModifiedDate
FROM raw.SF_Account

When the daily pipeline truncates the table, COUNT(*) = 0, so the logic returns a fixed early date that triggers a full download. When the hourly pipeline runs on a populated table, it returns the actual MAX(LastModifiedDate), enabling true delta processing.

This pattern eliminates the need for complex orchestration between daily and hourly pipelines. The table state determines the behavior: empty table equals full load, populated table equals delta load.

Financial Data Protection: Transaction-Level Integrity

The distinction between financial and operational data led to another important pattern. For financial data, we wrap all gold layer calculations in database transactions:

BEGIN TRANSACTION
  -- Copy to gold layer
  -- Run KPI calculations  
  -- Run forecasting scripts
  IF @@ERROR = 0 COMMIT ELSE ROLLBACK

If any part of the financial calculation fails, everything rolls back. We also include a separate timestamp for gold layer generation, so both our controlling team and I can easily check whether calculations are current as of this morning or indicate a pipeline failure.

This approach means that financial data could be outdated by a day in case of failures, but it will never be internally inconsistent. For a B2B business with formal order processes and contract reviews, this trade-off works well.

Schema Evolution: The Development Reality

During development, we discovered another need that led to our third pipeline type. Database schemas evolve constantly during development: new fields, changed data types, additional tables. Each schema change required manual intervention to update target tables before running data pipelines.

This led us to implement an init pipeline that handles schema evolution:

Drops and recreates table schemas
Creates or alters views and stored procedures
Runs the daily pipeline to populate with fresh data

The init pipeline is manually triggered when schema changes occur, making development iteration much smoother and reducing the risk of schema-related pipeline failures.

The Four-Pipeline Reality

What started as a single pipeline evolved into four distinct patterns, each serving a specific purpose:

Init Pipeline (Manual): Schema changes and maintenance
Daily Pipeline (Morning): Full reload for financial consistency and deletion detection
Hourly Pipeline (Throughout day): Delta updates for operational freshness
dbt Pipeline (Called by hourly): Transformation layer from bronze to silver to gold

The hourly pipeline calls the pre-configured dbt pipeline for transformations, maintaining the separation between extraction/loading and transformation concerns.

Production Discoveries: Learning from Real APIs

Working with API’s from external systems taught us a couple of real-live lessons.

During our Pipedrive migration, we encountered a 5000 record limit for lookup activities. While this limitation is documented, it’s not immediately apparent during development and only becomes noticeable when your lookup results exceed the limit in production and silently cuts off the remainder. It only became apparent when our record counts didn’t match expectations.

Similarly, during both Pipedrive and DATEV integrations, we encountered Azure Data Factory’s behavior with empty JSON arrays. When a REST source includes an embedded array property and the copy activity is configured to flatten the array, ADF treats the array as a “collection” in the mapping. If the array is empty, instead of creating one row with NULL array properties, ADF skips the entire record. The solution is to uncheck the collection mapping in the copy activity, treating the collection as optional rather than mandatory.

Error Handling: Graceful Degradation Over System Failure

Our production experience taught us to design for graceful degradation rather than system perfection. API sources like Pipedrive and DATEV aren’t always 100% stable, and certain records that should be readable sometimes aren’t.

For unstable APIs, we use a pattern of adding set-variable actions on the error path of copy activities within ForEach loops. This acts as a catch statement, capturing the error but allowing the pipeline to continue processing other records. The system achieves eventual consistency through hourly delta downloads even if individual runs encounter errors.

The worst that can happen is operational inconsistencies over a period of two hours or so. As a safeguard, the daily pipeline only copies data to the gold layer if there have been no errors, ensuring financial data integrity is never compromised by partial loads.

Dependencies: Microservice-Style Design

We designed each data factory to operate standalone, like microservices favoring explicit dependencies over implicit coupling. While there are dependencies between our domain pipelines (i.e. operational data requires linkage between Salesforce IDs and customer portal user accounts) these dependencies play out over hours to days, not real-time.

Since business processes unfold over this timescale (account creation, portal enablement, usage generation), our hourly sync frequency provides adequate dependency resolution without complex orchestration between domain pipelines.

Lessons Learned: Evolutionary Architecture in Practice

This evolution from one pipeline to four taught us several important lessons:

Start simple, evolve based on real requirements. We didn’t anticipate the deletion detection problem or the different consistency needs for financial versus operational data. These requirements emerged from production use and business feedback.

Different data has different reliability requirements. Financial data demands consistency and integrity, while operational data prioritizes freshness and availability. Architecture should reflect these different needs rather than applying one-size-fits-all solutions.

Time-based dependencies can be simpler than real-time coordination. Since our business processes unfold over hours to days, we can use simple scheduled syncs rather than complex event-driven orchestration.

Expect API quirks and design for resilience. Every external integration teaches you something new about the source system’s behavior. Error handling and safety nets become more important than perfect execution.

The medallion architecture provides the foundation, but business requirements drive the timing. Bronze, silver, and gold layers define data quality expectations, but how often you refresh each layer depends on your specific business needs.

Looking Forward

Our four-pipeline pattern has proven stable and flexible. It handles the business requirements we know about today while remaining adaptable for future needs. The pattern scales well as we add new source systems. Each new domain can use the same template approach while having system specific pipelines.

In my next article, I’ll explore how we’ve balanced the transformation layer, examining when to use dbt versus direct SQL for different types of data processing, and how the architectural decisions from domain-driven design continue to shape our approach to data transformation.

Holger Reinhardt

I love building products.