Azure

COVID-19 Data Engineering Pipeline (Azure)

End-to-end Azure pipeline from ingestion to curated SQL tables and Power BI analytics.

Azure Data FactoryADLS Gen2DatabricksPySparkAzure SQLPower BI

Overview

An end-to-end Azure data engineering solution that ingests public COVID-19 datasets, transforms raw files into curated layers, and serves analytics-ready tables for reporting.

Problem

Public health datasets arrive in different files and formats, making unified analytics difficult.
Reporting requires a reliable pipeline from ingestion to curated, query-friendly outputs.

Solution

Ingest ECDC COVID-19 CSV datasets and population TSV files into ADLS Gen2 raw zone with Azure Data Factory.
Transform and standardize data using ADF Mapping Data Flows and Databricks PySpark notebooks.
Load clean datasets into Azure SQL Database for downstream BI in Power BI.

Architecture

Public datasets and blob files -> ADF ingestion
ADLS Gen2 raw zone -> ADF/Databricks transformations
Curated clean zone in ADLS Gen2
ADF load into Azure SQL analytical tables
Power BI dashboards on top of Azure SQL

Metrics

Unified multiple COVID-19 source feeds into a single analytics model.
Enabled interactive dashboards for trends, country filters, and date-range analysis.
Established a reusable raw-to-curated architecture for future data products.

Highlights

Hybrid transformation approach using low-code ADF plus code-based PySpark.
Clear separation of raw and curated layers for maintainability.
Architecture prepared for CI/CD, incremental loads, and data quality checks.