Azure
COVID-19 Data Engineering Pipeline (Azure)
End-to-end Azure pipeline from ingestion to curated SQL tables and Power BI analytics.
Azure Data FactoryADLS Gen2DatabricksPySparkAzure SQLPower BI
Overview
An end-to-end Azure data engineering solution that ingests public COVID-19 datasets, transforms raw files into curated layers, and serves analytics-ready tables for reporting.
Problem
- Public health datasets arrive in different files and formats, making unified analytics difficult.
- Reporting requires a reliable pipeline from ingestion to curated, query-friendly outputs.
Solution
- Ingest ECDC COVID-19 CSV datasets and population TSV files into ADLS Gen2 raw zone with Azure Data Factory.
- Transform and standardize data using ADF Mapping Data Flows and Databricks PySpark notebooks.
- Load clean datasets into Azure SQL Database for downstream BI in Power BI.
Architecture
- Public datasets and blob files -> ADF ingestion
- ADLS Gen2 raw zone -> ADF/Databricks transformations
- Curated clean zone in ADLS Gen2
- ADF load into Azure SQL analytical tables
- Power BI dashboards on top of Azure SQL
Metrics
- Unified multiple COVID-19 source feeds into a single analytics model.
- Enabled interactive dashboards for trends, country filters, and date-range analysis.
- Established a reusable raw-to-curated architecture for future data products.
Highlights
- Hybrid transformation approach using low-code ADF plus code-based PySpark.
- Clear separation of raw and curated layers for maintainability.
- Architecture prepared for CI/CD, incremental loads, and data quality checks.
