A hybrid Data Platform with Azure Data Factory & dbt - Part 9

Series Overview

This is Part 9 of our series on building a hybrid data platform. If you’re joining mid-series, here are the previous articles:

Introduction

In previous articles I’ve walked you through extracting and transforming data from external systems like Salesforce and DATEV. But there’s a fundamental gap in our data: visibility into how our product is actually being used and being able to tie that back to customer data

As of now our data platform captured what happened in our business (contracts, finances, invoices) but not what happened through our product. We can not see how a particular customer uses our product, when they hit limits, or where they experienced problems.

This gap is particularly acute for a B2B SaaS company. Product usage data isn’t just interesting but operationally critical. When a customer approaches their master record limit, we want to know before the transformation fails. When we release a feature, we would like to know how relevant it has been to our customers. When usage patterns shift, we want to understand why. The ability to understand cause and effect has a direct effect on our ability to create the right kind of value for our customers.

The challenge is that integrating product telemetry into a unified data warehouse is fundamentally different from integrating business systems. Product data is generated at scale, distributed across multiple regions, and may contain customer sensitive data that requires careful handling.

Today I’ll share how this could be solved with OpenTelemetry, regional collectors, and a pragmatic approach to data privacy. This is a proof-of-concept implementation, but it demonstrates the architectural patterns which I believe will work at scale.

Why Product Telemetry Matters

Before diving into technical architecture, let me establish why this matters beyond the obvious observability benefits.

Our business operates across multiple regions (for instance for our SaaS offering in US, Europe, Canada, plus a partner-controlled deployment in Australia). Each region runs its own cluster hosting our SaaS application.

For years, we managed logs and traces purely from an operational standpoint and in isolation from our business data. Support teams could get access to application traces when debugging issues. Product teams could obtain some usage metrics. Finance would get cost allocations. But nobody could correlate: Did this customer’s usage spike align with a new contract? or Which features drive the most MRR for high-value accounts? or How does license utilization differ by industry segment?

Integrating product telemetry into our data warehouse would change this. We would be able to answer questions like:

  • Alert customers approaching license limits before they hit them (operational efficiency)
  • Analyze feature adoption rates by customer segment (product strategy)
  • Track resource consumption vs. billing (financial accuracy)

The data platform would become the source of truth along the entire value chain of our business.

The Data Privacy Challenge

The challenge with integrating product telemetry into a data warehouse is that product data contains sensitive information that business data typically doesn’t.

Business data from Salesforce contains customer names and contact information, but these are intended to be shared. In contrast product telemetry may contain:

  • Personal data: User IDs, email addresses, potentially demographic information
  • Business intelligence: Feature usage patterns, query details, resource consumption
  • Operational details: Database query text, configuration settings, internal identifiers
  • Compliance concerns: GDPR and local data residency

My approach: redact at collection time, not at query time. Rather than storing all data and trying to filter it later, we redact sensitive information as the OpenTelemetry collector pipeline processes the telemetry before it’s persisted.

This “privacy by design” approach has several advantages:

  • Sensitive data never reaches storage, reducing liability
  • No performance impact from row-level security queries
  • Audit trail is clean as we can prove sensitive data was never stored
  • Regional operators team can access their own regional data

The trade-off is that we need to be precise about what we redact, because once redacted, we can’t recover it later.

Architecture: Regional Collectors with Data Locality

Our implementation uses a regional deployment pattern with OpenTelemetry collectors distributed across our production regions, i.e. the Europe cluster has a dedicated OTel collector co-located at the same location, as has the US and Canada.

Each collector receives telemetry from its co-located cluster, processes it locally, and exports it to local blob storage. From there, selective replication brings relevant data to a central blob storage accessible by the Azure Data Factories for data processing.

The regional approach serves multiple purposes:

Data Locality Compliance: Regulations require data to remain within geographic boundaries. By processing regionally first, we satisfy these requirements while maintaining architectural flexibility.

Improved Resilience and Reduced Overhead: Processing near the source reduces data moving across the network, improving performance and reducing bandwidth costs. It also make the initial data storage independent from any central service or region.

Local Compliance and Flexibility: Local storage in the region allows us to implement region-specific data retention policies. In addition we provide the flexibility to customize telemetry data regionally. If we need to redact additional fields in Europe due to GDPR concerns, we can do that without affecting other regions.

Two Architectural Approaches

We evaluated two approaches for integrating regional collectors into the central data warehouse.

Approach 1: Azure Blob Storage Replication (Current PoC)

Architecture:

Regional Collector (EU)
        ↓ (Azure Blob Exporter)
Regional Blob Storage (EU)
        ↓ (blob replication)
Central Blob Storage (DE)
        ↓ (Azure Data Factory)
DWH (bronze layer)

How it works:

  1. OpenTelemetry collectors in each region receive telemetry (traces, logs, metrics) via OTLP protocol
  2. Collectors process telemetry (redaction, sampling, enrichment) and export to regional Azure Storage accounts
  3. Azure Storage blob replication automatically copies parts of this data to central storage account
  4. Azure Data Factory loads from central storage into DWH bronze layer on schedule (daily)

Advantages:

  • Simple, well-understood pattern (we already use blob storage → ADF for other data transformations)
  • Leverages existing Azure infrastructure and CI/CD patterns
  • Each region maintains local storage for i.e. 14 days (enables local-first support in the future)
  • Decouples collection (real-time, regional) from consumption (batch, centralized)

Challenges:

  • Blob replication adds storage overhead (we’re storing data twice during replication window)
  • Requires blob versioning/change detection enabled to track updates in the Azure Storage

Approach 2: Central OpenTelemetry Collector Forwarding (Alternative)

Architecture:

Regional Collector (EU) -> (Azure Blob Exporter) -> Regional Blob Storage (EU)
        ↓ (OTLP)
Central OpenTelemetry Collector (DE)
        ↓ (Azure Blob Exporter)
Central Blob Storage
        ↓ (Azure Data Factory)
DWH (bronze layer)

How it works:

  1. OpenTelemetry collectors in each region receive telemetry (traces, logs, metrics) via OTLP protocol
  2. Collectors process telemetry (redaction, sampling, enrichment) and
    1. Export to regional Azure Storage accounts
    2. Forward to central collector via OTLP
  3. Central collector receives and processes collector data (cross-regional sampling, aggregation)
  4. Central collector exports to central Azure Storage account
  5. Azure Data Factory loads from central storage into DWH bronze layer on schedule (daily)

Advantages:

  • Natural extension point for future central processing requirements
  • Easier to implement complex transformations before storage
  • Potentially simpler storage management (less duplication)

Challenges:

  • Additional network hop for all telemetry data
  • Single point of failure for central processing
  • Requires careful capacity planning for central collector

In this initial proof-of-concept I choose to implement Approach 1 (blob replication)

Implementation: Terraform and Configuration

Here’s the actual implementation of our regional collector deployment. This is structured as a Terraform module that instantiates an Azure container group instance together with an Azure blob storage in a specific deployment region.

OTel Docker Image

The OpenTelemetry collector container image is built and maintained in a separate image_otel repository, following the same pattern as image_dbt and image_azbridge described in earlier blog articles.

This separation serves an important purpose: the collector’s evolution—updating to newer OpenTelemetry versions, adding custom processors, tuning batch sizes—can happen through CI/CD pipelines focused entirely on observability needs, without requiring infrastructure changes or coordination with other deployments.

The image_otel repository contains the Dockerfile that simply pulls the OpenTelemetry Contrib otel-contrib distribution (which includes all the processors and exporters we need, including the redaction processor and Azure Blob exporter). Each build produces a container image tagged with a semantic version (e.g., otel-contrib:0.92.0) and pushed to the shared Azure Container Registry.

Module Structure

product-contoso_dwh/
├── terraform/
│   └── modules/
│       └── az-container-otel-collector/
│           ├── main.tf           # Container instance, storage accounts
│           ├── variables.tf      # Configuration inputs
│           ├── outputs.tf        # Exported values
│           ├── config.tmpl       # OTel collector configuration template
│           └── README.md         # Documentation

Main Terraform Configuration

This creates the regional infrastructure for an OpenTelemetry collector:

# File: terraform/modules/az-container-otel-collector/main.tf

locals {
  # Infra Shared Variables
  infra_shared_resource_group_name    = "${var.PROJECT}${var.REGION}${var.ENVIRONMENT}"
  infra_shared_log_analytics_ws_name  = "${local.infra_shared_resource_group_name}-law${var.TYPE}"
  infra_shared_storage_account_name   = "${var.PROJECT}${var.REGION}${var.ENVIRONMENT}"

  # Local Resource Variables
  resource_group_name = "${var.NAME}${var.REGION}${var.ENVIRONMENT}"
  otel_container_name = "${lower(var.NAME)}-otel-collector"
}

# Data Block -> Local resource group

data "azurerm_resource_group" "rg" {
  name = local.resource_group_name
}

# Data Block -> Infra Shared Resources

data "azurerm_resource_group" "rg_infra_shared" {
  name = local.infra_shared_resource_group_name
}

data "azurerm_storage_account" "stacc_infra_shared" {
  name                = local.infra_shared_storage_account_name
  resource_group_name = data.azurerm_resource_group.rg_infra_shared.name
}

data "azurerm_container_registry" "acr" {
  name                = "${var.PROJECT}acr"
  resource_group_name = data.azurerm_resource_group.rg_infra_shared.name
}

data "azurerm_log_analytics_workspace" "law_infra_shared" {
    name                = local.infra_shared_log_analytics_ws_name
    resource_group_name = data.azurerm_resource_group.rg_infra_shared.name
}

# Resource Block -> Local storage account

# Storage account for local telemetry (logs, traces, metrics)
resource "azurerm_storage_account" "storage_otel_regional" {
  name                     = "${lower(var.PROJECT)}otel${lower(var.REGION)}"
  resource_group_name      = data.azurerm_resource_group.rg.name
  location                 = var.LOCATION
  account_tier             = "Standard"
  account_replication_type = "LRS"  # Local redundancy only; replication happens separately

  # enabling for blob replication
  blob_properties {
    versioning_enabled            = true
    change_feed_enabled           = true
    change_feed_retention_in_days = 1
  }

  tags = {
    PURPOSE = "OTel local storage"
    OWNER   = var.OWNER
  }
}

# Create blob containers for each telemetry signal
resource "azurerm_storage_container" "container_logs" {
  name                  = "logs"
  storage_account_name  = azurerm_storage_account.storage_otel_regional.name
  container_access_type = "private"
}

resource "azurerm_storage_container" "container_traces" {
  name                  = "traces"
  storage_account_name  = azurerm_storage_account.storage_otel_regional.name
  container_access_type = "private"
}

resource "azurerm_storage_container" "container_metrics" {
  name                  = "metrics"
  storage_account_name  = azurerm_storage_account.storage_otel_regional.name
  container_access_type = "private"
}

# Lifecycle policy: delete blobs after 14 days
resource "azurerm_storage_management_policy" "lifecycle_otel" {
  storage_account_id = azurerm_storage_account.storage_otel_regional.id

  rule {
    name    = "delete-old-telemetry"
    enabled = true

    filters {
      blob_types   = ["blockBlob"]
      prefix_match = ["logs/", "traces/", "metrics/"]
    }

    actions {
      base_blob {
        delete_after_days_since_modification_greater_than = 14
      }
      version {
        delete_after_days_since_creation = 14
      }
    }
  }
}

# File share for mounting OTel collector configuration
resource "azurerm_storage_share" "container_share" {
  name                 = "otel-config"
  storage_account_name = azurerm_storage_account.storage_otel_regional.name
  quota                = 1024
  enabled_protocol     = "SMB"

  depends_on  = [
    azurerm_storage_account.storage_otel_regional
  ]
}

# Config file for otel collector from template
resource "local_file" "config_template" {
  filename  = "${azurerm_storage_account.storage_otel_regional.name}-config.yaml"
  content   = templatefile("../modules/az-container-otel-collector/config.tmpl", {
      connection_string = azurerm_storage_account.storage_otel_regional.primary_connection_string
      # Add more variables and interpolation as needed
    }
  )
}

# Upload otel collector config file to file share
resource "azurerm_storage_share_file" "config_yaml" {
  name             = "config.yaml"
  storage_share_id = azurerm_storage_share.container_share.id
  source           = local_file.config_template.filename
  content_md5      = filemd5(local_file.config_template.filename)

  depends_on  = [
    azurerm_storage_share.container_share,
    local_file.config_template
  ]
}

# Resource Block -> Replicated Logs in central storage account

# Storage container for logs in central storage
resource "azurerm_storage_container" "replicated_logs" {
  name                  = "${azurerm_storage_account.storage_otel_regional.name}-${azurerm_storage_container.logs.name}"
  storage_account_name  = data.azurerm_storage_account.stacc_infra_shared.name
  container_access_type = "private"

  depends_on  = [
    azurerm_storage_container.logs
  ]
}

# Replication policy from local logs storage to central storage
resource "azurerm_storage_object_replication" "replication_policy" {
  source_storage_account_id      = azurerm_storage_account.storage_otel_regional.id
  destination_storage_account_id = data.azurerm_storage_account.stacc_infra_shared.id

  rules {
    source_container_name      = azurerm_storage_container.logs.name
    destination_container_name = azurerm_storage_container.replicated_logs.name
  }

  depends_on  = [
    azurerm_storage_container.replicated_logs
  ]
}

# Retention policy for replicated logs
resource "azurerm_storage_management_policy" "replicated_retention_policy" {
  storage_account_id = data.azurerm_storage_account.stacc_infra_shared.id

  rule {
    name    = "otel-replicated-retention"
    enabled = true
    filters {
      prefix_match = ["${azurerm_storage_container.replicated_logs.name}/"]
      blob_types   = ["blockBlob"]
    }
    actions {
      base_blob {
        delete_after_days_since_creation_greater_than = 14
      }
      version {
        delete_after_days_since_creation = 14
      }
    }
  }
}

# Resource Block -> OTel collector container instance

# Container Instance for running OpenTelemetry Collector
resource "azurerm_container_group" "otel_collector" {
  name                = local.otel_container_name
  location            = var.LOCATION
  resource_group_name = data.azurerm_resource_group.rg.name
  os_type             = "Linux"
  ip_address_type     = "Public"
  restart_policy      = "Always"

  container {
    name   = "otel-collector"
    image  = "${data.azurerm_container_registry.acr.login_server}/${var.OTEL_CONTAINER_NAME}:${var.OTEL_CONTAINER_TAG}"
    cpu    = "1.0"
    memory = "2"

    # Expose ports for OTLP receivers
    ports {
      port     = 4317
      protocol = "TCP"
    }

    ports {
      port     = 4318
      protocol = "TCP"
    }

    # Health check endpoint
    ports {
      port     = 13133
      protocol = "TCP"
    }

    # Mount configuration from file share
    volume {
      name                 = "otel-config"
      # volume mount_path for the contrib image is /etc/otelcol-contrib, for the core image, it’s /etc/otelcol
      mount_path           = "/etc/otelcol-contrib"
      storage_account_name = azurerm_storage_account.storage_otel_regional.name
      storage_account_key  = azurerm_storage_account.storage_otel_regional.primary_access_key
      share_name           = azurerm_storage_share.config_otel.name
    }
  }

  image_registry_credential {
    server   = data.azurerm_container_registry.acr.login_server
    username = data.azurerm_container_registry.acr.admin_username
    password = data.azurerm_container_registry.acr.admin_password
  }

  identity {
    type = "SystemAssigned"
  }

  diagnostics {
    log_analytics {
      workspace_id  = data.azurerm_log_analytics_workspace.law_infra_shared.workspace_id
      workspace_key = data.azurerm_log_analytics_workspace.law_infra_shared.primary_shared_key
    }
  }

  tags = {
    PURPOSE = "OpenTelemetry Collector for ${var.REGION} region"
    OWNER   = var.OWNER
  }

  depends_on  = [
    azurerm_storage_share_file.config_file
  ]
}

Variables Configuration

# File: terraform/modules/az-container-otel-collector/variables.tf

variable "PROJECT" {
  description = "Project name"
  type        = string
}

variable "NAME" {
  description = "Component name"
  type        = string
  default     = "OTel"
}

variable "LOCATION" {
  description = "Azure region location (e.g., westeurope, eastus)"
  type        = string
}

variable "REGION" {
  description = "Region identifier (euc, use, cac)"
  type        = string
}

variable "ENVIRONMENT" {
  description = "Environment (play, test, prod)"
  type        = string
}

variable "OWNER" {
  description = "Team/person responsible for this resource"
  type        = string
}

variable "OTEL_CONTAINER_NAME" {
  description = "OTel collector container image name"
  type        = string
  default     = "otel-collector"
}

variable "OTEL_CONTAINER_TAG" {
  description = "OTel collector container image tag"
  type        = string
  default     = "latest"
}

Outputs

# File: terraform/modules/az-container-otel-collector/outputs.tf

output "container_instance_id" {
  value = azurerm_container_group.otel_collector.id
}

output "container_instance_fqdn" {
  value = azurerm_container_group.otel_collector.fqdn
}

output "regional_storage_account_name" {
  value = azurerm_storage_account.storage_otel_regional.name
}

output "otlp_grpc_endpoint" {
  value = "grpc://${azurerm_container_group.otel_collector.fqdn}:4317"
}

output "otlp_http_endpoint" {
  value = "http://${azurerm_container_group.otel_collector.fqdn}:4318"
}

output "health_check_endpoint" {
  value = "http://${azurerm_container_group.otel_collector.fqdn}:13133"
}

Replication Scope and Strategy

As discussed above in this POC we use Azure Object Replication and only replicate the logs container to central storage. The traces and metrics containers remain local to each region with a 14-day retention policy. This is partly operational (we haven’t yet built the downstream processing for metrics, so there’s no point replicating them), but it’s also a deliberate scaling decision.

Traces tend to be voluminous. In a cloud-native application generating distributed traces, you can easily produce millions of spans per day. Replicating all traces from all regions to central storage would create substantial data movement and storage costs, especially during the first few days while you’re tuning sampling. By keeping traces local, we give each region operator the ability to debug their cluster independently. They can access 14 days of local trace history without incurring cross-region replication costs.

Logs, by contrast, are smaller and operationally critical. Application logs often contain information needed for timely troubleshooting: error messages, correlation IDs, request timing. By replicating logs to central storage, we make them available to the data warehouse for correlation with customer and business data.

As we mature the system and add metrics processing to the data warehouse, we’ll likely expand replication to include metrics containers. This is an evolutionary decision: replicate what you need now, add more as new requirements emerge. The architecture supports it. It’s just a matter of adding another replication rule and updating the ADF pipeline to load from the new containers.

The 14-day local retention window serves two purposes. First, it’s long enough for regional operators to investigate issues without needing to route through the central data warehouse. Second, it’s short enough to keep storage costs reasonable (regional blob storage is inexpensive, but data egress for replication isn’t free). If we later decide that 14 days isn’t enough, changing the lifecycle policy is a single Terraform variable update.

OpenTelemetry Collector Configuration Template

The collector configuration template defines what telemetry is accepted, how it’s processed, and where it’s exported to.

# File: terraform/modules/az-container-otel-collector/config.tmpl

extensions:
  health_check:
    endpoint: "0.0.0.0:13133" 
    path: "/health/status" 
    check_collector_pipeline: 
      enabled: true 
      interval: "5m" 
      exporter_failure_threshold: 5

  text_encoding:
    encoding: utf8
    marshaling_separator: "\n"
    unmarshaling_separator: "\r?\n"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  # Batch processor: groups telemetry for efficient export
  batch:
    send_batch_size: 1000
    timeout: 10s

  # Redaction processor: example for removing sensitive data
  redaction:
    allow_all_keys: true
    blocked_values:
        # IPv4 addresses
        - '(?:[0-9]{1,3}\.){3}[0-9]{1,3}'
        # Email patterns
        - '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'

exporters:
  azureblob:
    auth:
      type: connection_string
      connection_string: ${connection_string}
    
    container:
      logs: logs
      traces: traces
      metrics: metrics
    
    # Export text format (UTF-8, newline-separated JSON)
    encodings:
      logs: text_encoding
      traces: text_encoding

service:
  extensions: [health_check, text_encoding]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [redaction]
      exporters: [azureblob]
    logs:
      receivers: [otlp]
      processors: [redaction]
      exporters: [azureblob]

The Template File Pattern: Bridging Infrastructure as Code with Service Configuration

You’ll notice the Terraform code uses a pattern that deserves explanation because it’s more sophisticated than it might first appear. Rather than embedding the OpenTelemetry configuration directly into Terraform or storing it as a static file in the repository, we use a three-layer approach:

Layer 1 - Template File (config.tmpl): Contains the generic OpenTelemetry configuration with placeholders like ${connection_string}. This file is safe to commit to git because it contains no actual secrets but just placeholder names. The template can be edited to add new processors, change receiver ports, or modify export behavior without touching Terraform.

Layer 2 - Terraform Rendering: The templatefile() function takes the template and a map of variable values, substituting placeholders with actual Terraform-managed secrets. This happens during terraform apply, and the rendered output (with real connection strings) is never stored in git or state files in a way that leaks secrets.

Layer 3 - Container Mount: The rendered configuration is uploaded to an Azure File Share and mounted into the container at runtime. The container reads /etc/otelcol-contrib/config.yaml on startup, which contains the actual connection string for the regional storage account.

The trade-off is that you can’t test the rendered configuration easily without running Terraform. If you want to validate the configuration before applying it, you can use terraform console to test the templatefile() function locally:

templatefile(file("path/to/config.tmpl"), {
  connection_string = "DefaultEndpoint=..."
})

This pattern is generic enough to apply beyond OpenTelemetry. It can be used for any service that needs Terraform-managed secrets injected into fixed configuration files, like database connection strings in Java property files, API keys in environment-specific configuration, certificate paths in nginx configs. Once you have the pattern working, it becomes a reusable building block.

Health Check

The container configuration includes a health check endpoint on port 13133. This endpoint does more than just report whether the container is running. It actually validates whether the OpenTelemetry collector is healthy and able to process telemetry.

The health check configuration includes an important parameter: check_collector_pipeline: enabled: true. This tells the health check to validate not just that the collector process is running, but that the actual data pipeline is functional. It checks whether telemetry is flowing through receivers → processors → exporters without blockages. This is more robust than simply checking if the process is alive.

Azure Container Instances could be configured to use this health check to determine if the container should remain running.

Regional Deployment Configuration

Here’s how I instantiate the module for each region:

# File: euc-prod/main.tf (Europe production environment)

module "otel_collector_eu" {
  source = "../../../shared/terraform/modules/az-container-otel-collector"
  
  PROJECT              = var.PROJECT
  NAME                 = "OTel-Collector"
  LOCATION             = "westeurope"
  REGION               = "euc"
  ENVIRONMENT          = "prod"
  OWNER                = "devops"
  OTEL_CONTAINER_NAME  = "otel-contrib"
  OTEL_CONTAINER_TAG   = var.OTEL_VERSION
}

# File: use-prod/main.tf (US production environment)

module "otel_collector_us" {
  source = "../../../shared/terraform/modules/az-container-otel-collector"
  
  PROJECT              = var.PROJECT
  NAME                 = "OTel-Collector"
  LOCATION             = "eastus"
  REGION               = "use"
  ENVIRONMENT          = "prod"
  OWNER                = "devops"
  OTEL_CONTAINER_NAME  = "otel-contrib"
  OTEL_CONTAINER_TAG   = var.OTEL_VERSION
}

Each instantiation creates independent regional infrastructure:

  • Its own storage account (regional data stays regional)
  • Its own container instance (independent scaling per region)
  • Its own lifecycle policies (region-specific retention)
  • Its own OTLP endpoints for instrumented services to connect

The Path Forward: Open Questions

The current implementation is a proof-of-concept that demonstrates the technical feasibility. Several important questions remain open and will be addressed as the concept moves toward production:

1. Partner-controlled Data Residency

Our Australian cluster is partner-controlled under stringent data residency requirements. We could:

  • Replicate to central: Australian telemetry flows to central controlling database alongside other regions
  • Keep isolated: Australian data remains in regional storage, queried separately by partner

We haven’t decided which approach aligns with our legal and operational requirements. The architecture allows for both patterns.

2. Additional PII Redaction Patterns

IP addresses are just the beginning. We need to identify:

  • Customer identifiers and user IDs (how are they formatted?)
  • Internal resource identifiers (database names, table identifiers?)
  • Business-sensitive query details (what should be redacted from logs?)
  • Regional configuration that should stay regional?

3. Aggregation and Sampling Strategy

Should we:

  • Store everything (14-day retention per region)?
  • Implement sampling at collector level?
  • Aggregate metrics differently by region?
  • Apply time-based decay (hour-level detail → day-level aggregates)?

4. Production Security

The proof-of-concept implementation currently has no security in front of the OpenTelemetry collectors. Anyone who knows the FQDN can send telemetry to ports 4317 and 4318.

One way to implement network security could be by fronting the OTel collector with an Azure Application Gateway. The IP whitelist would restrict OTLP connections to known good sources (the Kubernetes cluster in that region plus testing sources).

Conclusion: Bringing Product Insights to the Data Warehouse

Integrating product telemetry into a data warehouse closes a critical gap. Business data shows what happened; product telemetry shows how it happened. Together, they enable insights that neither provides alone.

The regional collector architecture with local-first processing demonstrates how to deal with data locality responsibly. We’re not storing everything and filtering later. We’re making intentional decisions about what data matters, redacting sensitive information at collection time, and maintaining data locality for compliance.

The proof-of-concept is infrastructure-complete but operationally immature. We have not yet instrumented our production systems, we haven’t run a full analytics pipeline against the telemetry, and security is minimal.


Additional resources

Testing the collector

For validating the basic functions, we’re using simple test scripts to send sample telemetry:

# Send test trace via HTTP
curl -X POST http://<ip-address-of-aci>:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{
    "resourceSpans": [{
      "resource": {
        "attributes": [
          {"key": "service.name", "value": {"stringValue": "test-service"}}
        ]
      },
      "scopeSpans": [{
        "scope": {"name": "test-scope"},
        "spans": [{
          "traceId": "0af7651916cd43dd8448eb211c80319c",
          "spanId": "b7ad6b7169203331",
          "name": "test-span",
          "startTimeUnixNano": "1614556800000000000",
          "endTimeUnixNano": "1614556801000000000"
        }]
      }]
    }]
  }'

The collector processes this and exports to Azure Blob Storage as:

traces/2025/01/24/traces_10_30_45.json

The json structure can then be processed by dbt models that normalize the JSON into business-ready analytics tables.

Additional articles