Data Segmentation With Azure Data Factory & SQL Data Warehouse

A solution to provide segmentation to advertising data built on top of Azure Data Services

A well-known player in the Internet advertising industry asked VIAcode to help one of their clients perform complex data integration of analytics data from AWS to Azure.

The Problem

The client was looking for an automated solution that would allow them to provide segmentation to advertising data accumulated for their different products. For that purpose, development of a new system for migration of telemetry data from AWS to Azure was required. This system would need to provide further data segmentation based on various parameters using analytic tools and machine learning technologies.

Requirements

The requirements of this high-load (the expected load was 860 million requests and 190 Gb per month) system are:

  • Great performance
  • Ease of maintainability
  • Simple configuration of segmentation rules
  • Implementation of a robust monitoring solution

To deliver this system along with the documentation and system usage training classes, a team of four people – developers and DevOps specialists – and two months of work were required.

Building the Solution With Azure Data Factory and Data Lake

The client wanted to build the solution on top of Azure Data Factory (ADF) uploading data into an Azure Data Lake storage daily. This data is then available for analytic purposes using Data Lake Analytics. To apply data segmentation rules and transfer the segmented data to Data API, Azure Functions will be used.

A diagram of the before data pipeline: data factory to blob storage to azure functions to the data API

Proposed solution

VIAcode engineers revised the proposed design and suggested the following changes aiming to improve performance and ease the system maintenance while keeping the system complexity and costs as low as possible.

  • Replacing Azure Data Lake Analytics (ADLA) with Azure SQL Data Warehouse (Azure SQL DWH) since ADLA doesn’t support Azure Data Lake Storage Gen2 that was released after the initial draft of the system design had been created.
  • Usage of Azure Data Factory instead of Azure Functions for data segmentation. Processing data in Azure Functions is not optimal; required operations are better performed using map-reduce or SQL-like technologies.
  • A new solution to update the set of heuristic segmentation rules in production on the fly with no change to the existing infrastructure or source code.
  • Usage of Azure Monitor Alerts – to perform monitoring of the whole solution, including Azure Data Factory.
A full mapping of the new data flow from AWS through Azure data factory and  SQL warehouse

Benefits of our solution

The design offered by VIAcode is also based upon Azure Data Factory. Data is uploaded into Azure Data Lake Storage Gen2 after initial processing and becomes accessible in raw format for analyzing in Azure SQL Data Warehouse, so analysts can use T-SQL queries in their work with no need to study a new solution.

At the same time, the data that has just been stored in the Data Lake storage is processed by the dedicated ADF pipeline that performs segmentation based on the existing rules. Once segmentation is complete, the data is sent to Data API by a component built with Azure Functions.

Deployment of the solution was implemented through Azure Pipelines so it could be deployed and redeployed with ease in any environment on Azure. This allowed the customer to promptly create an instance of the solution for development or testing purposes in just one click.

The Results

VIAcode quickly provided a robust solution that achieved the stated goals:

  • Performance. The overall system performance was improved by 50% because of migration of the segmentation logic from Azure Functions to Azure Data Factory, while maintenance costs were cut in half.
  • Maintainability.
    • The solution is based on services and solutions native to Azure Cloud Platform and covered by monitoring with Azure Monitor.
    • No changes are required in the existing infrastructure or source code when updating the set of segmentation rules.
    • Azure Pipelines CI/CD is configured so that the entire solution can be deployed in any Azure subscription with just one click.

Need Help in Azure? Contact Us:

  • This field is for validation purposes and should be left unchanged.

VIAcode provides services for migration, optimization and management for Azure.