Azure

COVID-19 Data Engineering Pipeline (Azure)

End-to-end Azure pipeline from ingestion to curated SQL tables and Power BI analytics.

Azure Data FactoryADLS Gen2DatabricksPySparkAzure SQLPower BI

Overview

An end-to-end Azure data engineering solution that ingests public COVID-19 datasets, transforms raw files into curated layers, and serves analytics-ready tables for reporting.

Problem

  • Public health datasets arrive in different files and formats, making unified analytics difficult.
  • Reporting requires a reliable pipeline from ingestion to curated, query-friendly outputs.

Solution

  • Ingest ECDC COVID-19 CSV datasets and population TSV files into ADLS Gen2 raw zone with Azure Data Factory.
  • Transform and standardize data using ADF Mapping Data Flows and Databricks PySpark notebooks.
  • Load clean datasets into Azure SQL Database for downstream BI in Power BI.

Architecture

  1. Public datasets and blob files -> ADF ingestion
  2. ADLS Gen2 raw zone -> ADF/Databricks transformations
  3. Curated clean zone in ADLS Gen2
  4. ADF load into Azure SQL analytical tables
  5. Power BI dashboards on top of Azure SQL

Metrics

  • Unified multiple COVID-19 source feeds into a single analytics model.
  • Enabled interactive dashboards for trends, country filters, and date-range analysis.
  • Established a reusable raw-to-curated architecture for future data products.

Highlights

  • Hybrid transformation approach using low-code ADF plus code-based PySpark.
  • Clear separation of raw and curated layers for maintainability.
  • Architecture prepared for CI/CD, incremental loads, and data quality checks.