A hybrid Data Platform with Azure Data Factory & dbt - Part 8

Series Overview

This is Part 8 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:

Integrating DATEV: German Tax Accounting in a Modern Data Warehouse

When Google Searches Return ‘not much’

DATEV is the elephant in the room for any German company building a data platform. DATEV is the de facto standard for tax accounting in Germany. And for most if not all companies it is the baseline source of truth since it reflects the actual cash flow of the company. So if I want to provide a dashboard to departements tracking fundamental operational data like budget vs spent, there is no way around it. For me the vision of enabling a data-driven business can not be limited to just customer revenues, but needs to encompass the operational key metrics of the entire company with all of its functions.

This article shares our production implementation: the actual code, the architectural decisions, and the domain knowledge required to successfully integrate DATEV into a modern data warehouse. If you’re facing this challenge, this article will hopefully provide you with enought technical deep-dive to get started.

The DATEV Reality Check

Before diving into the technical implementation, let me set the context. Unlike modern SaaS APIs like Salesforce or Stripe, integrating DATEV comes with unique challenges:

Infrastructure: We run DATEV as a virtual machine in our own network. There’s no cloud-hosted option, no managed service. The VM sits on our infrastructure, managed by our IT team. Connecting to it was simple with our locally deployed ‘Shared Runtime’ of the Azure Datafactory.

API Access: The API isn’t enabled by default in the on-prem (aka desktop) version of the software,. We had to explicitly enable it, configure authentication, and work through the accounting domain model that DATEV exposes.

Domain Structure: DATEV isn’t organized around generic “customers” or “transactions.” It’s structured around the German accounting domain with specific concepts like Mandanten (clients), Geschäftsjahre (fiscal years), and Festschreibung (commitment of accounting batches).

Documentation: While DATEV provides API documentation, there are few examples of integrating it with modern data platforms.

The API Structure: Hierarchy Before Data

The first major challenge with DATEV is understanding its hierarchical API structure. Unlike flat REST APIs where you might call /api/invoices and get a list of invoices, DATEV requires you to navigate through a domain hierarchy:

/api/clients (Mandanten)
  ↓
/api/clients/{client_id}/fiscal-years (Geschäftsjahre)
  ↓
/api/clients/{client_id}/fiscal-years/{fiscal_year_id}/account-postings
/api/clients/{client_id}/fiscal-years/{fiscal_year_id}/debitors
/api/clients/{client_id}/fiscal-years/{fiscal_year_id}/creditors

This structure reflects German accounting law and practice:

Mandanten (Clients): Each legal entity that needs separate tax accounting is a Mandant. In our case:

  • Software A GmbH (our software A company)
  • Software B GmbH (our software B company)
  • Holding GmbH (our holding company)

Geschäftsjahre (Fiscal Years): Each client has multiple fiscal years, typically following the calendar year but sometimes with different start dates depending on when the company was founded or restructured.

Only after navigating through this hierarchy can you access the actual financial records: account postings, debitors, creditors, account balances, and so on.

Here’s how we configured our REST datasets in Terraform to handle this hierarchical structure:

# Dataset for Client Fiscal Years
resource "azurerm_data_factory_custom_dataset" "dataset_datev_client_fiscal_years" {
  name            = "DATEV_client_fiscal_years"
  data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id
  folder          = "DATEV"
  type            = "RestResource"

  linked_service {
    name = var.LINKED_DATEV_SERVICE_NAME
  }

  parameters = {
    client_id = ""
  }

  type_properties_json = <<JSON
  {
    "relativeUrl": {
      "value": "clients/@{dataset().client_id}/fiscal-years\n\n",
      "type": "Expression"
    }
  }
JSON
}

# Dataset for Account Postings (nested three levels deep)
resource "azurerm_data_factory_custom_dataset" "dataset_datev_accounting" {
  name            = "DATEV_account_postings"
  data_factory_id = data.azurerm_data_factory.adf_elt_pipeline.id
  folder          = "DATEV"
  type            = "RestResource"

  linked_service {
    name = var.LINKED_DATEV_SERVICE_NAME
  }

  parameters = {
    fiscal_year = ""
    client_id   = ""
  }

  type_properties_json = <<JSON
  {
    "relativeUrl": {
      "value": "clients/@{dataset().client_id}/fiscal-years/@{dataset().fiscal_year}/account-postings",
      "type": "Expression"
    }
  }
JSON
}

The question becomes: how do you flatten this hierarchy to enable efficient data warehouse processing?

The Pragmatic Solution: Hardcoded Clients, Dynamic Fiscal Years

I considered several approaches to handling DATEV’s hierarchical API:

  1. Recursive API Calls: Dynamically discover clients, then fiscal years, then records
  2. Metadata Tables: Build a configuration-driven system that stores the hierarchy
  3. Hardcoded Clients + Dynamic Fiscal Years: Hardcode the client list, dynamically fetch fiscal years

We chose option 3, and here’s why: organizational structure changes slowly, but accounting periods change predictably.

In our case, we have three legal entities (Mandanten). These don’t change frequently. When we create a new subsidiary or restructure, it’s a major business event that we’re fully aware of. There’s no need for automatic discovery.

However, fiscal years change predictably every year, and we need to handle historical data going back multiple years. This needs to be dynamic.

Here’s our actual implementation in the init pipeline that creates the static client list (please note that UUIDs and IDs have been replaced with dummy values in this article):

-- From datev-controlling-init.json
-- DATEV_Clients (statically provided)

IF EXISTS (SELECT 1 FROM sys.objects WHERE object_id = OBJECT_ID(N'raw.DATEV_Clients') AND type in (N'U'))
BEGIN
    DROP TABLE raw.DATEV_Clients;
END

SELECT
    name
    ,number
    ,id

    INTO raw.DATEV_Clients FROM (
        VALUES
        ('Software A GmbH', 10001, '80707794-2553-441e-b161-2df029cce28c'),
        ('Software B GmbH', 10002, 'eb280cc4-93ca-49dd-80da-2523eb6c72ed'),
        ('Holding GmbH', 10003, 'd6b702f8-6375-4191-bd35-f23104a08cc9')
    ) AS t (name, number, id)

This hardcoded approach has several advantages:

Explicit Dependencies: When we add a new subsidiary, we explicitly update the infrastructure code. There’s no hidden discovery logic that might fail silently.

Version Controlled: The client list is in our Git repository, reviewed through pull requests, and deployed through CI/CD pipelines.

Schema Definition During Init: The init pipeline runs when we change schemas or add new clients. It’s the perfect place to define this foundational data.

Predictable Behavior: No API calls to discover clients means no API failures during this step. The client list is simply data.

Once we have the static client list, the daily pipeline dynamically fetches all fiscal years for each client:

{
    "name": "Lookup Clients",
    "type": "Lookup",
    "typeProperties": {
        "source": {
            "type": "SqlServerSource",
            "sqlReaderQuery": "select id as client_id from raw.DATEV_Clients",
            "queryTimeout": "02:00:00"
        },
        "dataset": {
            "referenceName": "DB_Clients",
            "type": "DatasetReference"
        },
        "firstRowOnly": false
    }
},
{
    "name": "For Each Client",
    "type": "ForEach",
    "dependsOn": [
        {
            "activity": "Lookup Clients",
            "dependencyConditions": ["Succeeded"]
        }
    ],
    "typeProperties": {
        "items": {
            "value": "@activity('Lookup Clients').output.value",
            "type": "Expression"
        },
        "activities": [
            {
                "name": "Client_Fiscal_Years",
                "type": "Copy",
                "typeProperties": {
                    "source": {
                        "type": "RestSource",
                        "httpRequestTimeout": "00:01:40",
                        "requestInterval": "00.00:00:00.010",
                        "requestMethod": "GET",
                        "paginationRules": {
                            "supportRFC5988": "true"
                        }
                    },
                    "sink": {
                        "type": "SqlServerSink",
                        "writeBehavior": "upsert",
                        "upsertSettings": {
                            "keys": ["FY_Client_Id", "FY_Id"]
                        }
                    }
                }
            }
        ]
    }
}

The result is a flat lookup table in our data warehouse:

raw.DATEV_Client_Fiscal_Years:
+-----------+--------+------------+------------+
| Client_Id | Fiscal | Start_Date | End_Date   |
|           | Year   |            |            |
+-----------+--------+------------+------------+
| eb280cc4  | 2023   | 2023-01-01 | 2023-12-31 |
| eb280cc4  | 2024   | 2024-01-01 | 2024-12-31 |
| eb280cc4  | 2025   | 2025-01-01 | 2025-12-31 |
| 80707794  | 2023   | 2023-01-01 | 2023-12-31 |
| 80707794  | 2024   | 2024-01-01 | 2024-12-31 |
| d6b702f8  | 2023   | 2023-01-01 | 2023-12-31 |
| d6b702f8  | 2024   | 2024-01-01 | 2024-12-31 |
+-----------+--------+------------+------------+

This flat table becomes the foundation for all subsequent data loading. We can now iterate over (client_id, fiscal_year_id) combinations to fetch account postings, debitors, creditors, and all other financial records in parallel.

Committed vs Uncommitted: The Delta Logic Secret

One of the most powerful patterns in DATEV integration comes from understanding German accounting principles. Specifically, the concept of Festschreibung (commitment) of accounting batches.

The Accounting Principle

In German accounting practice:

  • Financial records are organized into batches (Stapel or Accounting Sequences Processed)
  • Batches can be in an “uncommitted” state while accountants work on them
  • Once verified, batches are “committed” (festgeschrieben)
  • Committed batches become immutable - they cannot be changed
  • Any corrections to committed records must be done through compensating transactions

This isn’t a technical limitation of DATEV. This is how German accounting law (HGB - Handelsgesetzbuch) works. Once records are committed, they represent the official accounting record.

The Technical Opportunity

This accounting principle provides a natural mechanism for delta loading. If committed records never change, we only need to fetch:

  1. Records in fiscal years that have uncommitted batches
  2. Fiscal years where no records exist yet (first load)

Here’s our actual implementation of this logic:

-- From datev-controlling-daily.json
-- Lookup query that determines which (client_id, fiscal_year) combinations need processing

SELECT 
    cfy.fy_client_id AS client_id, 
    cfy.fy_id AS fiscal_year
FROM raw.DATEV_Client_Fiscal_Years AS cfy
LEFT JOIN raw.DATEV_Accounting_Sequences_Processed AS asp
    ON cfy.fy_client_id = asp.asp_client_id 
    AND cfy.fy_id = asp.asp_fiscal_year_id
WHERE
    cfy.fy_start_date >= '2002-01-01'  -- Only process fiscal years from 2002 onwards
GROUP BY
    cfy.fy_client_id, cfy.fy_id
HAVING
    -- Include fiscal years with at least one uncommitted batch
    SUM(CASE WHEN ISNULL(asp.asp_is_committed,0) = 0 THEN 1 ELSE 0 END) > 0
    -- OR fiscal years with no batches at all (first load)
    OR COUNT(asp.asp_id) = 0

This query is elegant in its simplicity:

  • First run: No batches exist in raw.DATEV_Accounting_Sequences_Processed → Download all fiscal years
  • Subsequent runs: Only select fiscal years with uncommitted batches
  • Committed fiscal years: Automatically excluded, no unnecessary API calls

Updating the Commitment Status

Account postings don’t carry their commitment status directly. Instead, they reference an accounting sequence, and that sequence has the commitment status. We use a stored procedure to update this relationship:

-- From datev-controlling-init.json
CREATE OR ALTER PROCEDURE raw.prc_update_DATEV_Account_Postings
    @ClientId NVARCHAR(36),
    @FiscalYearId NVARCHAR(12)
AS

BEGIN
    SET NOCOUNT ON;

    UPDATE ap
    SET ap.AP_Is_Committed = asp.ASP_Is_Committed
    FROM raw.DATEV_Account_Postings ap
    INNER JOIN raw.DATEV_Accounting_Sequences_Processed asp
        ON ap.AP_Client_Id = asp.ASP_Client_Id
        AND ap.AP_Fiscal_Year_Id = asp.ASP_Fiscal_Year_Id
        AND ap.AP_ASP_ID = asp.ASP_Accounting_Sequence_Id
    WHERE ap.AP_Client_Id = @ClientId
      AND ap.AP_Fiscal_Year_Id = @FiscalYearId;
END;

This pattern is analogous to Salesforce’s LastModifiedDate approach, but domain-specific. Instead of a timestamp indicating when a record changed, we have a commitment flag indicating whether a record can change. The immutability marker becomes our delta detection mechanism.

Business Impact

This delta logic has significant practical benefits:

Reduced API Load: In a typical month, only the current fiscal year has uncommitted batches. We fetch 1 fiscal year instead of 20+ years of historical data.

Faster Pipeline Execution: Processing one fiscal year takes minutes. Processing all fiscal years would take hours.

Aligned with Business Process: The technical implementation mirrors how our accounting team actually works. They commit batches monthly, and our pipeline automatically detects and processes only what’s new.

Predictable Behavior: There are no edge cases where committed records mysteriously change. The system’s behavior matches the accounting rules.

Reverse Data Flow: Controlling to Subsidiaries

The DATEV integration reveals an architectural pattern that’s opposite to what we use for other data sources like Salesforce.

The Salesforce Pattern: Aggregate Up

With Salesforce, we follow a bottom-up aggregation pattern:

Subsidiary Data Warehouse (Ingentis)
  ↓ Extract and load specific records
Controlling Data Warehouse
  ↓ Aggregate across subsidiaries
Consolidated Reports

Each subsidiary’s data warehouse contains only that subsidiary’s Salesforce data. The controlling data warehouse aggregates across all subsidiaries for company-wide reporting.

The DATEV Pattern: Split Down

With DATEV, we discovered we need the opposite pattern - top-down splitting:

DATEV System
  ↓ All financial data for all legal entities
Controlling Data Warehouse → raw.* schema (everything)
  ↓ Split by Mandant (Client_Id)
  ├─→ holding.* schema (Holding company only)
  ├─→ ingentis.* schema (Software company only)
  └─→ kanzlei.* schema (KanzleiSuite company only)
  ↓ Copy to Subsidiary DWHs (filtered, privacy-protected)

Why This Matters

DATEV records represent every financial detail of our company:

  • Every invoice received and paid
  • Every salary payment
  • Every expense reimbursement
  • Every bank transaction
  • Cost center allocations
  • Tax calculations

We can’t simply push all this data to subsidiary databases because:

Privacy Concerns: Individual salary data, personal expense reimbursements, and compensation details should not be accessible in subsidiary systems that have broader user access.

Business Need-to-Know: Subsidiary teams need their own financial data for operational purposes, but they don’t need visibility into other legal entities’ finances.

Data Sovereignty: Different legal entities may have different data protection requirements and access controls.

Compliance: GDPR and German data protection laws (BDSG) require limiting access to personal financial data to those with a legitimate business need.

Implementation: Schema-Based Splitting

Our init pipeline creates separate schemas for each legal entity, then uses stored procedures to split the data:

-- From datev-controlling-init.json
-- Example for Holding company (repeated for each entity)

CREATE OR ALTER PROCEDURE holding.[prc_fill_DATEV]
AS

TRUNCATE TABLE holding.DATEV_Debitors
INSERT INTO holding.DATEV_Debitors SELECT * FROM raw.DATEV_Debitors 
WHERE D_Client_Id = 'd6b702f8-6375-4191-bd35-f23104a08cc9'  -- Holding GmbH ID

TRUNCATE TABLE holding.DATEV_Creditors
INSERT INTO holding.DATEV_Creditors SELECT * FROM raw.DATEV_Creditors 
WHERE C_Client_Id = 'd6b702f8-6375-4191-bd35-f23104a08cc9'

TRUNCATE TABLE holding.DATEV_Account_Records
INSERT INTO holding.DATEV_Account_Records SELECT * FROM raw.DATEV_Account_Records 
WHERE AR_Client_Id = 'd6b702f8-6375-4191-bd35-f23104a08cc9'

TRUNCATE TABLE holding.DATEV_Account_Postings
INSERT INTO holding.DATEV_Account_Postings SELECT * FROM raw.DATEV_Account_Postings 
WHERE AP_Client_Id = 'd6b702f8-6375-4191-bd35-f23104a08cc9'

-- Continue for all DATEV tables...

The daily pipeline calls these procedures after loading all data into the raw schema:

{
    "name": "Holding prc_fill_DATEV",
    "type": "SqlServerStoredProcedure",
    "dependsOn": [
        {
            "activity": "For Each Client Fiscal Year",
            "dependencyConditions": ["Succeeded"]
        }
    ],
    "typeProperties": {
        "storedProcedureName": "[holding].[prc_fill_DATEV]"
    }
},
{
    "name": "Ingentis prc_fill_DATEV",
    "type": "SqlServerStoredProcedure",
    "dependsOn": [
        {
            "activity": "For Each Client Fiscal Year",
            "dependencyConditions": ["Succeeded"]
        }
    ],
    "typeProperties": {
        "storedProcedureName": "[ingentis].[prc_fill_DATEV]"
    }
},
{
    "name": "Kanzlei prc_fill_DATEV",
    "type": "SqlServerStoredProcedure",
    "dependsOn": [
        {
            "activity": "For Each Client Fiscal Year",
            "dependencyConditions": ["Succeeded"]
        }
    ],
    "typeProperties": {
        "storedProcedureName": "[kanzlei].[prc_fill_DATEV]"
    }
}

This creates a clean data architecture:

controlling_db:
  ├─ raw.*       → All data from DATEV API (controlling team only)
  ├─ holding.*   → Filtered to Holding GmbH records
  ├─ ingentis.*  → Filtered to Software GmbH records
  └─ kanzlei.*   → Filtered to KanzleiSuite GmbH records

ingentis_dwh:
  └─ datev.*     → Copy of ingentis.* from controlling_db
                   (minus sensitive personal financial data)

Privacy Protection in the Copy Process

When copying from the controlling database to subsidiary databases, we implement additional privacy filtering. The stored procedures above include all DATEV tables, but when pushing to subsidiary warehouses, we exclude:

  • Detailed salary information
  • Personal compensation and bonuses
  • Individual expense reimbursements (unless business-related)
  • Personal tax data
  • Private account movements

We keep:

  • Business-to-business financial records (invoices, vendor payments)
  • Cost center allocations (but aggregated, not individual)
  • Account balances and P&L statements
  • Business expense categories

This filtering happens at the pipeline level when copying from controlling_db to ingentis_dwh:

{
    "name": "controlling_db-ingentis_dwh-daily",
    "type": "ExecutePipeline",
    "typeProperties": {
        "pipeline": {
            "referenceName": "controlling_db-ingentis_dwh-daily",
            "type": "PipelineReference"
        }
    }
}

The controlling_db-ingentis_dwh-daily pipeline contains specific copy activities that filter which tables and which records get copied to the subsidiary warehouse.

Production Resilience: When APIs Are Unreliable

One of the harsh realities of integrating with DATEV is that the API isn’t 100% reliable. We encountered several classes of failures in production:

  1. Transient failures: Occasional timeouts or connection issues
  2. Inaccessible records: Some records appear in listings but return 404 when fetched
  3. Partial failures: One entity fails while others succeed

The question became: Should a single failed record terminate the entire daily pipeline?

The Traditional Approach (Too Fragile)

The default Azure Data Factory behavior is to fail the entire pipeline on any error. For a daily accounting data load, this means:

  • One inaccessible creditor record → Entire DATEV sync fails
  • Timeout on one fiscal year → No data loaded for any entity
  • Transient API hiccup → Manual intervention required

This isn’t acceptable for a production system that needs to run reliably every day.

Our Approach: Graceful Degradation

Instead, we implemented a multi-layered error handling strategy using Azure Data Factory’s “fail connectors” - the ability to route pipeline execution based on success or failure outcomes.

Layer 1: Retry with Backoff For known transient failures (like AccountRecords), we implemented retry logic by connecting the failure output to a Wait activity, followed by a second attempt:

{
    "name": "Copy AccountRecords - First Attempt",
    "type": "Copy",
    "dependencyConditions": ["Succeeded"],
    "typeProperties": {
        "source": {...},
        "sink": {...}
    }
}

// Connect the FAILURE output to a wait activity
{
    "name": "Wait 30 Seconds",
    "type": "Wait",
    "dependsOn": [
        {
            "activity": "Copy AccountRecords - First Attempt",
            "dependencyConditions": ["Failed"]
        }
    ],
    "typeProperties": {
        "waitTimeInSeconds": 30
    }
}

// After waiting, retry the copy
{
    "name": "Copy AccountRecords - Retry",
    "type": "Copy",
    "dependsOn": [
        {
            "activity": "Wait 30 Seconds",
            "dependencyConditions": ["Succeeded"]
        }
    ],
    "typeProperties": {
        "source": {...},
        "sink": {...}
    }
}

Layer 2: Error Collection When retries fail, we don’t terminate the pipeline - we collect the error details into a pipeline variable and continue processing other entities:

{
    "name": "Log AccountRecords Error",
    "type": "SetVariable",
    "dependsOn": [
        {
            "activity": "Copy AccountRecords - Retry",
            "dependencyConditions": ["Failed"]
        }
    ],
    "typeProperties": {
        "variableName": "pipelineErrors",
        "value": {
            "value": "@concat(variables('pipelineErrors'), '; AccountRecords failed for client ', dataset().client_id, ' fiscal year ', dataset().fiscal_year, ': ', activity('Copy AccountRecords - Retry').error.message, '; ')",
            "type": "Expression"
        }
    }
}

This pattern is repeated across multiple Copy activities in the pipeline. Each failure path leads to a SetVariable activity that appends the error details to our error log.

Layer 3: Partial Success The pipeline continues processing other entities even if one fails. At the end of the pipeline run, we have:

  • All successfully loaded data in the warehouse
  • A complete error log in the pipelineErrors variable
  • Monitoring alerts triggered only if errors exceed a threshold

Why This Matters

This pattern transformed our DATEV integration from brittle to resilient:

Before:

  • Required daily manual intervention when any endpoint failed
  • All-or-nothing data loads meant zero data on partial failures
  • No visibility into which specific entities failed

After:

  • 98% of data loads successfully even when some endpoints fail
  • Detailed error tracking for debugging and trend analysis
  • Graceful degradation - we get most data even during partial outages

The Trade-off

This approach requires accepting that your data warehouse might have:

  • Some missing records (but you know exactly which ones via the error log)
  • Slight staleness for failed entities (they’ll sync on the next run)
  • Need for monitoring the error variable and setting up appropriate alerts

For financial data, this might sound risky. But consider:

  1. Known gaps are better than unknown gaps: We log exactly what failed and why
  2. Partial data beats no data: Finance teams can work with 95% of data rather than 0%
  3. Next-day recovery: Failed records typically succeed on the next run
  4. Alerting on patterns: We alert when the same entity fails repeatedly, indicating a systemic issue

Pattern: Fail Connectors in ADF

Azure Data Factory’s “fail connectors” enable this pattern. Every Copy activity has three possible dependency conditions:

  • Succeeded: Green path continues to next activity
  • Failed: Red path can retry, log, or continue processing
  • Skipped: Activity wasn’t executed due to conditions
  • Completed: Activity finished regardless of success/failure

By connecting the failure path to error collection rather than pipeline termination, you build systems that degrade gracefully:

Copy Data → [Succeeded] → Next Activity
          → [Failed] → Wait → Retry → [Succeeded] → Next Activity
                                    → [Failed] → Log Error → Continue Pipeline

Lesson Learned: API Reliability is a Spectrum

Don’t treat all external systems as equally reliable. DATEV’s API required special error handling that our Salesforce or Stripe integrations didn’t need.

The right pattern depends on:

  • API reliability characteristics: How often does it fail? What types of failures?
  • Data criticality: Can you tolerate gaps?
  • Recovery time objectives: Can you wait until tomorrow?
  • Business tolerance: Will users accept 95% data availability?

For DATEV, graceful degradation was the right choice. The alternative - manual daily intervention - was simply not sustainable for a production system.

Production Learnings: The Empty Array Gotcha

During DATEV integration, we encountered the same Azure Data Factory quirk that we’d later discover again during our Pipedrive integration: the empty array collection mapping bug.

The Problem

DATEV’s API returns embedded arrays for certain relationships. For example, a creditor record might include an array of bank accounts:

{
  "id": "abc-123",
  "caption": "Vendor Name",
  "banks": [
    {
      "iban": "GB33BUKB20201555555555",
      "bic": "BARCGB22",
      "bank_name": "Mybank"
    }
  ]
}

When you configure ADF’s copy activity to flatten this structure and map the array elements to columns, ADF treats the array as a “collection reference.” This works fine when the array has data.

But when a creditor has no bank accounts:

{
  "id": "def-456",
  "caption": "Another Vendor",
  "banks": []  // Empty array
}

Azure Data Factory’s behavior is surprising: it skips the entire record. Not just the bank-related columns - the entire creditor record doesn’t get inserted into the database.

This is documented behavior in Microsoft’s official documentation: “For records where the array marked as collection reference is empty and the check box is selected, the entire record is skipped.”

The Solution

The fix is simple but non-obvious. In the copy activity mapping configuration, there’s a setting called “collection reference.” By default, when you map an array property, ADF sets this automatically. The fix is to uncheck the collection reference checkbox or set it to an empty string:

{
    "translator": {
        "type": "TabularTranslator",
        "mappings": [
            {
                "source": {"path": "$['banks'][0]['iban']"},
                "sink": {"name": "C_Bank_Iban", "type": "String"}
            }
        ],
        "collectionReference": ""  // Empty = treat as optional
        // Previously would be: "collectionReference": "$['banks']"
    }
}

With collectionReference set to empty, ADF treats the array as optional:

  • Records with bank data → Bank columns populated
  • Records without bank data → Bank columns set to NULL, but record still created

Why This Matters

This isn’t documented particularly well in the UI, but it’s confirmed in Microsoft’s official documentation. The behavior makes some sense from a collection-flattening perspective (if you’re flattening an array and the array is empty, what do you flatten?), but the default behavior of skipping the entire record is surprising and can lead to silent data loss.

The lesson: Always test with records that have empty arrays. Your development data might have complete records, but production will inevitably have sparse data.

Reference: Microsoft Learn - Schema and data type mapping in copy activity

Lessons Learned: Domain Knowledge Matters

After running DATEV integration in production for several months, here are the key lessons I’ve learned:

Lesson 1: Domain-Specific APIs Require Domain Knowledge

You cannot simply “integrate” DATEV by reading API documentation. You need to understand German accounting principles:

  • Mandanten aren’t just API entities - they represent legal obligations for separate tax filings
  • Geschäftsjahre aren’t arbitrary time periods - they’re legally defined fiscal periods
  • Festschreibung isn’t a status field - it’s a commitment with legal implications

I spent a lot of time with our Kanzlei and accounting team to understand how DATEV’s structure maps to our actual accounting practice. That conversation was more valuable than any technical documentation.

Recommendation: If you’re integrating DATEV, start with your accounting team. Understand the business processes before you write any code.

Lesson 2: Hardcoding Isn’t Always Bad

As software engineers, we’re trained to avoid hardcoding. Build flexible, configurable systems that can adapt to change.

But in this case, hardcoding the client list was the right choice:

  • Clients change rarely (maybe once per year)
  • When they change, it’s a major business event with IT involvement anyway
  • Explicit beats implicit when structure is stable
  • Makes failure modes obvious and debugging easier

The key is knowing when structure is stable enough to hardcode. Organizational structure? Stable. Fiscal years? Dynamic. Financial records? Very dynamic.

Recommendation: Hardcode what changes at the pace of business restructuring. Make dynamic what changes at the pace of daily operations.

Lesson 3: Immutability Flags Are First-Class Data

The IS_COMMITTED flag in DATEV isn’t metadata - it’s core domain knowledge that drives business processes and enables technical optimizations.

By treating this flag as first-class data:

  • We aligned our technical implementation with legal requirements
  • We enabled smart delta logic without complex timestamp comparisons
  • We made the system’s behavior predictable and explainable

Recommendation: Look for domain concepts that indicate immutability or state transitions. These often provide natural mechanisms for delta detection.

Lesson 4: Data Flow Direction Matters

Not all data sources should be aggregated bottom-up. DATEV taught us that sometimes you need to:

  1. Consolidate at the top
  2. Split to subsidiaries
  3. Filter for privacy

This is the opposite of our Salesforce pattern, but it’s the right pattern for DATEV because:

  • DATEV already contains all entities
  • Privacy requires filtering sensitive data
  • Subsidiaries need access to their own finances
  • Controlling needs visibility across all entities

Recommendation: Let the source system’s structure guide your data flow. Don’t force every integration into the same pattern.

Lesson 5: Privacy by Design Through Schema Separation

The schema-based splitting approach (raw.*, holding.*, ingentis.*, kanzlei.*) isn’t just organizational convenience. It’s a privacy enforcement mechanism:

  • Database permissions can be set at the schema level
  • No way to accidentally query across legal entities
  • Audit trail is built into the architecture
  • Adding a new entity means adding a new schema with its own permissions

This physical separation provides stronger guarantees than row-level security or view-based filtering.

Recommendation: When dealing with sensitive financial or personal data, use schema separation as a first-class privacy control mechanism.

Lesson 6: Test with Production-Scale Data and Realistic Failure Scenarios

During development and staging, we worked with small datasets - a few hundred creditors, a couple of fiscal years worth of postings, and mostly successful API calls. Everything worked fine.

Production revealed:

  • Performance issues with certain queries at scale
  • API reliability issues requiring retry logic (see Production Resilience section)
  • The empty array collection mapping edge case
  • Unexpected API timeouts for certain endpoints
  • Memory issues when processing large batches

Recommendation: Create a production-scale test environment with realistic failure scenarios. Test not just happy paths with complete data, but also:

  • Partial API failures
  • Empty arrays and sparse data
  • Network timeouts and retries
  • Large batch processing

Conclusion: Pragmatism Over Purity

Integrating DATEV into our data warehouse required abandoning some “best practices” and embracing pragmatic solutions:

  • Hardcoded client list instead of dynamic discovery
  • Schema-based splitting instead of unified tables with row-level security
  • Reverse data flow (consolidate then split) instead of bottom-up aggregation
  • Domain-specific delta logic instead of generic timestamp-based approaches
  • Graceful degradation instead of all-or-nothing pipeline execution

The result is a system that:

  • Aligns with German accounting principles and practices
  • Respects privacy and data protection requirements
  • Performs efficiently through smart delta detection
  • Is maintainable because it makes structure explicit
  • Is resilient through graceful error handling and retry logic
  • Handles edge cases like empty arrays and API failures

If you’re facing DATEV integration, I hope this article provides a useful starting point. The code snippets are real, the patterns work, and the lessons were learned through production experience.

For German companies or those with German subsidiaries, DATEV integration is inevitable. The lack of public examples makes it challenging, but it’s absolutely achievable with the right combination of domain knowledge, pragmatic architecture, and attention to edge cases.

In my next article, I’ll tackle another API integration challenge: implementing cursor-based pagination for Pipedrive CRM and discovering Azure Data Factory’s surprising 5000-record lookup limit in production.


Technical Appendix: Key Pipeline Structure

For reference, here’s the high-level structure of our DATEV pipelines:

Init Pipeline (datev-controlling-init.json):

  1. Create all table schemas in raw, holding, ingentis, kanzlei schemas
  2. Insert hardcoded client list into raw.DATEV_Clients
  3. Create stored procedures for schema splitting (prc_fill_DATEV)
  4. Create helper procedure for updating commitment status
  5. Call daily pipeline for initial data load

Daily Pipeline (datev-controlling-daily.json):

  1. Lookup clients from raw.DATEV_Clients
  2. For each client, fetch all fiscal years from DATEV API
  3. Upsert fiscal years into raw.DATEV_Client_Fiscal_Years
  4. Query for fiscal years with uncommitted batches
  5. For each (client_id, fiscal_year_id), call records pipeline
  6. After all records loaded, call schema-splitting stored procedures
  7. Copy filtered data to subsidiary warehouses

Records Pipeline (datev-controlling-records.json):

  1. Fetch account postings, debitors, creditors for given client + fiscal year
  2. Fetch accounting sequences processed (batches)
  3. Fetch account records, cost centers, cost systems
  4. Fetch accounting statistics and sums/balances
  5. Update commitment status via stored procedure

The entire integration is managed through Terraform modules:

  • az-datafactory-datev-endpoint: Configures REST API connection
  • az-datafactory-datasets: Creates all dataset definitions
  • az-datafactory-datev-pipeline: Deploys pipeline JSON

All infrastructure is version controlled and deployed through Azure DevOps CI/CD pipelines, ensuring reproducibility across environments (play, test, prod).