Data Platform Prep

Created: 20 Jul 2023, Modified: 11 Dec 2024 •

18 min read

[2024-12-19] This section is under updates now. Please back soon!

This is a collection of books and courses I can recommend personally. They are great for every data engineering learner and I have used or own these books during my professional work.

In order to implement the robust Data Platform Design framework, combing Data Engineering and Automation for Data Platform Operations an Analytics.

Data Engineering Fundamentals
Core Abstraction
Design Pattern
Utility Function
Design Data Platform
Resources
Case studies
Practical Example for Data Engineering
Bonus
My universe
Templates
- Data Project Requirement Doucmentation - PRD

Data Engineering Fundamentals

What is Data Platform Design ?

Data Platform Design Framework beyonds the traditional Data Scaling.

Data Platform Design is a set of practices and processes for managing the data lifecycle, from data ingestion to processing and analysis, in a way that ensures high quality and reliability.

The Data Platform Design framework provides a variety of different tools to manage data lifecycle, automatically processing and analysis data as well as maintain high quality of data. Help company to reduce the effort of data operations and take advantage of Data Insights.

Goals: Improve collaboration between data professionals, enhance data quality, and speed up data-related tasks.

Set of expectations for data platform design:

Highly-available, redundant configuration services run within platform.
Zero-downtime capability with granular monitoring.
Auto-scaling across services
All services are maintained, governed by Governance which is backbone of the platform.

I write of Data Platform Design Framework with 5 layers to help readers what is concctepture and contextualize the Data Platform Design Framework.

I’ve started the DataPods - Open Source Data Platform Ops to help readers to understand the Data Platform Design Framework and how to implement it. It has the following components: DataPods is a comprehensive starter kit that provides:

Production-like configurations
Easy deployment options (K8s/Docker)
Best-in-class open source tools

Core Abstraction

DAG: is a flow of data
Node: is unit of data object being processed
Edge: transformation logic to convert data
Foundation: to ensure platform is getting right

Design Pattern

Batch: using for wide range period of data
Real-time: micro batch with tiny latency
Stream
Lambda
Idempotence
Fan-out
Fan-in
Parallel: Run job at the same time
Dependencies: run job with preceding job

Utility Function

Clean
Transform
Derive
DQ
Dead-letter
Change capture
Load: Merge Upsert
Audit

Design Data Platform

Fact: Every Data Platform I have been designed, it always have 5 essential components I mentioned in this book Data Engineering Handbook

In data scaling technique, I focus on Data Warehouse Scaling and Data Pipeline Scaling. Intake, we have to cover:

Other Book that I cover the Data Platform Design Framework and Guide for DataOps settle down Serverless Data Platform - WIP, I will cover how platform works and example of AWS services being used in Data Platform.

AWS, Azure, GCP are a service provider for centralizing the control, maintenance, operation and management of the data platform and data infrastructure.

Note: The list of Books, Blogs, Course that I personally have been closed or on the way to close it, highly recommend to everyone.

Updated 2024-12-12: I write the note for Platform Ops along with Data Engineering

Overview

How Open Source Applications Work
Serverless Data Pipeline
Specification of Designing Data Pipeline
Building the Data Warehouse, Bill Inmon
Data Modeling with Snowflake, Serge Gershkovich
The Data Engineering Cookbook, Andreas Kretz
Data Engineering Patterns on the Cloud, Bartosz Konieczny
C4 Archtecture
Introduction to Data Engineering, Daniel Beach
Data With Rust - Re-write Data Engineering in Rust, Karim Jedda
Data Pipelines Pocket Reference, James Densmore
Designing Data-Intensive Applications
DAMA-DMBOK: Data Management Body of Knowledge (DAMA-DMBOK)
Streaming Systems, Tyler Akidau, Slava Chernyak, Reuven Lax
High Performance Spark, Holden Karau, Rachel Warren
Data Pipelines with Apache Airflow
Fundamentals of Data Observability, Andy Petrella
Scaling Machine Learning with Spark, Adi Polak
Deciphering Data Architectures, James Serra (Deciphering Data Architectures (James Serra))
Architecture Patterns with Python
Learning Spark, Brooke Wenig, Denny Lee, Tathagata Das, Jules Damji
The Unified Star Schema. Bill Inmon
Data Engineering Book by Oleg Agapov : Accumulated knowledge and experience in the field of Data Engineering

Resources

From Internet

Designing Data-Intensive Applications - Legit
Building the Data Warehouse, Bill Inmon - Legit
Data Engineering Nanodegree (Udacity) - Overview, Demo
Big Data Specialization (Coursera)
Learning Spark
The Data Warehouse Lifecycle Toolkit by Ralph Kimball and Laura Reeves - Legit
Data Engineering
Pattern of DE Online Data Engineering Design Pattern by Simon
Open Mordern Data Platform Starburst Galaxy
Open Source Data Stack Summary
Summary of Books I have read DEH-Books

Papers

Case studies

Practical Example for Data Engineering

Big Data Framework
Kappa Data Pipeline aka Realtime using AWS
Data Modeling and Analytic Engineering
Data pipeline with Open Source Mage AI and ClickHouse
AWS Ingestion Pipeline
Azure Data Pipeline in 1 hour
Design ETL Pipeline for Interview Assessment
How to do everything
https://www.ssp.sh/blog/open-data-stack-core-tools/

Check out the documentation Hands-on with Data Open Source

Bonus

Additional Recommendations:

Certifications: Consider certifications like AWS Certified Data Engineer, GCP Certified Data Engineer, or Azure Data Engineer Associate.
Open-source projects: Contribute to open-source data engineering projects to gain practical experience.
Online communities: Engage with data engineering communities on platforms like Stack Overflow, Reddit, and LinkedIn.
Networking: Build relationships with other data engineers to learn from their experiences.
Remember: This is a general roadmap. The specific courses, books, and practices may vary depending on your experience level, industry, and technology stack.

My universe

longdatadevlog.com: My knowledge hub and digital garden
Linkedin: Professional profile and LinkedIn presence
Data camping & Hands-on with Data Open Source : Coding projects and GitHub portfolio
dotfile: My development environment setup
https://www.longdatadevlog.com/brain : Curated insights and updates
Mini ETL Starter Kit adn One Data Governance: Showcasing my creative projects
payhip.com/longdatadevlog & use.longdatadevlog.com: hustle and get paid
de-book.longdatadevlog.com & de-handbook-pro-production.up.railway.app: An unfinished book about
longdatadevlog.com/brain : Thoughts Section is What I’m currently focused

Templates

@startuml
actor User
participant "Power BI" as PBI
participant "API Management" as API
participant "Data Factory" as DF
participant "OneLake" as OL
participant "Data Engineering" as DE
participant "Microsoft Purview" as MP
database "External Sources" as ES
database "Bronze Layer" as BL
database "Silver Layer" as SL
database "Gold Layer" as GL

User -> PBI: Create Reports
PBI -> OL: Query Data
PBI -> MP: Check Governance
ES -> API: Data Ingestion
API -> DF: Orchestrate Pipelines
DF -> OL: Store Data
OL -> BL: Raw Data
OL -> SL: Cleansed Data
OL -> GL: Curated Data
DF -> DE: Transform Data
DE -> BL: Process Raw
DE -> SL: Cleanse Data
DE -> GL: Curate Data
MP -> OL: Apply Governance
MP --> DF: Enforce Policies

@enduml

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#ffffff', 'primaryTextColor': '#333333', 'primaryBorderColor': '#666666', 'lineColor': '#666666', 'secondaryColor': '#ffffff', 'tertiaryColor': '#ffffff' }}}%%
flowchart TD
subgraph Data_Processing_NoChange
    direction LR
  Connecting --> Buffering --> Processing --> Storing --> Visualizing
end
subgraph Backbone
  Automation
  Governance
end
Automation --> Data_Processing_NoChange
Governance -->Data_Processing_NoChange

Data Project Requirement Doucmentation - PRD

Product Requirements Document (PRD) for Data Platform
1. Overview
1.1 Purpose
This document outlines the requirements for developing a Data Platform using Microsoft Fabric, designed to centralize data ingestion, processing, storage, and analytics for [Company Name], enabling self-service business intelligence (BI), advanced analytics, and data-driven decision-making.
1.2 Scope
The platform will include core functionalities such as data ingestion from multiple sources, a medallion architecture (bronze, silver, gold layers), data governance, and Power BI integration for visualization. Out-of-scope items include real-time streaming analytics and third-party integrations beyond specified APIs (e.g., Azure API Management).
1.3 Stakeholders

Product Owner: [Name/Role, e.g., Data Platform Lead]
Development Team: [Team Name, e.g., Data Engineering Team]
End Users: Data Engineers, Data Analysts, Business Users
Other: Data Governance Team, IT Operations

2. Features List
2.1 Core Features



Feature ID
Feature Name
Description
Priority
Acceptance Criteria



F1
Data Ingestion
Ingest data from external sources (e.g., APIs, databases, files) into OneLake via Azure Data Factory.
High
- Supports CSV, JSON, SQL, and API inputs.- Pipelines process 1M rows in < 5 minutes.- Error handling for failed ingestions.


F2
Medallion Architecture
Implement bronze, silver, and gold layers in OneLake for raw, cleansed, and curated data.
High
- Bronze layer stores raw data as-is.- Silver layer applies cleansing ( Grown AI (Grok) created by xAI.- Gold layer produces curated datasets.- Data lineage is maintained across layers.


F3
Data Governance
Enforce data quality and security using Microsoft Purview.
High
- P data classification completed.- Access controls restrict sensitive data.- Audit logs track data access.


F4
Data Processing
Transform data using Data Engineering (Spark) and Data Factory for ETL/ELT pipelines.
High
- Pipelines transform bronze to silver in < 10 minutes.- Supports SQL and PySpark scripts.- Error logs are generated.


F5
Power BI Integration
Enable self-service BI reporting via Fabric’s Power BI.
Medium
- Users can create reports from gold layer.- Dashboards load in < 3 seconds.- Supports role-based access.


2.2 Future Features

Real-time data streaming with Event Hubs.
Machine learning model integration via Data Science.
Advanced data marketplace for internal data sharing.

3. Architectural Diagram
The platform follows a medallion architecture within Microsoft Fabric, leveraging OneLake for storage and Azure services for processing. Below is a PlantUML diagram (importable to draw.io).
@startuml
actor User
participant "Power BI" as PBI
participant "API Management" as API
participant "Data Factory" as DF
participant "OneLake" as OL
participant "Data Engineering" as DE
participant "Microsoft Purview" as MP
database "External Sources" as ES
database "Bronze Layer" as BL
database "Silver Layer" as SL
database "Gold Layer" as GL

User -> PBI: Create Reports
PBI -> OL: Query Data
PBI -> MP: Check Governance
ES -> API: Data Ingestion
API -> DF: Orchestrate Pipelines
DF -> OL: Store Data
OL -> BL: Raw Data
OL -> SL: Cleansed Data
OL -> GL: Curated Data
DF -> DE: Transform Data
DE -> BL: Process Raw
DE -> SL: Cleanse Data
DE -> GL: Curate Data
MP -> OL: Apply Governance
MP --> DF: Enforce Policies

@enduml

3.1 Architecture Description

External Sources: APIs, SQL databases, CSV/JSON files.
API Management: Securely manages data ingestion APIs.
Data Factory: Orchestrates ETL/ELT pipelines.
OneLake: Central data lake with bronze (raw), silver (cleansed), and gold (curated) layers.
Data Engineering: Spark-based processing for transformations.
Microsoft Purview: Data governance, cataloging, and lineage.
Power BI: Self-service BI and reporting for business users.

4. Non-Functional Requirements
4.1 Performance

Ingest 1M rows in < 5 minutes.
Report generation in < 3 seconds for datasets < 100MB.

4.2 Scalability

Supports up to 10TB of data in OneLake.
Horizontal scaling via Azure Synapse for compute.

4.3 Security

Data encrypted at rest (AES-256) and in transit (TLS 1.3).
PII protected via Purview’s data classification.
Compliance with GDPR, CCPA.

4.4 Reliability

99.95% uptime for Fabric services.
Automated daily backups in OneLake.

5. Assumptions and Constraints
5.1 Assumptions

Azure infrastructure (e.g., OneLake, Data Factory) is pre-configured.
Users have access to Power BI Pro licenses.
Stable API connectivity to external sources.

5.2 Constraints

Development timeline: 6–8 weeks.
Budget: $100,000.
Limited to Microsoft Fabric ecosystem for core functionalities.

6. Dependencies

External APIs: Azure API Management for secure ingestion.
Azure Services: Data Factory, Synapse Analytics, Purview, Power BI.
Libraries: PySpark, SQL for data processing.
Infrastructure: Azure (OneLake, VMs, Blob Storage).

7. Risks and Mitigations



Risk
Impact
Mitigation



Data source API failures
High
Implement retry logic and error notifications in Data Factory.


Data quality issues
Medium
Use Purview for automated data quality checks.


Scalability bottlenecks
Medium
Conduct load testing with 10TB datasets.


Governance compliance
High
Regular Purview audits and policy updates.


8. Acceptance Criteria

All core features (F1–F4) pass integration tests in Fabric.
Architectural diagram aligns with implemented platform.
Non-functional requirements (e.g., performance, security) are verified.
Stakeholder approval after Power BI demo with sample reports.
Data lineage and governance policies enforced via Purview.

9. Glossary

ETL/ELT: Extract, Transform, Load / Extract, Load, Transform.
Medallion Architecture: Bronze (raw), Silver (cleansed), Gold (curated) data layers.
OneLake: Microsoft Fabric’s centralized data lake.
Microsoft Purview: Data governance and cataloging tool.