Olist End-to-End Data Engineering Pipeline

Medallion pipeline บน GCP ประมวลผล 112K+ transactions ผ่าน PySpark, dbt, Airflow — production-grade DE ที่รันใน GCP Free Tier

Personal project ที่ออกแบบเพื่อแสดง end-to-end data engineering capability โดยใช้ public dataset จาก Olist (Brazilian e-commerce) ซึ่งมี entity resolution problem ที่ซับซ้อนแบบสถานการณ์จริง

สถาปัตยกรรม Bronze → Silver → Gold: Ingest raw data เข้า Bronze layer ด้วย PySpark โดยบังคับใช้ explicit StructType schemas เพื่อสร้าง data contract แบบ fail-fast ป้องกัน schema drift ตั้งแต่ต้นทาง จากนั้น transform ต่อด้วย dbt ใช้ incremental load strategy + table partitioning + clustering by customer_unique_id ลด BigQuery scan cost ลงประมาณ 90%

แก้ปัญหา "The Olist Trap": ปัญหา entity resolution ของ Olist คือ customer_id เปลี่ยนทุกครั้งที่ลูกค้าซื้อใหม่ ทำให้ Customer Lifetime Value (LTV) คำนวณผิดทั้งระบบ — แก้โดย map กลับไปที่ customer_unique_id ใน Silver layer เพื่อให้ Gold marts ให้ผลลัพธ์ถูกต้อง

Orchestration & Observability: ใช้ Apache Airflow รัน idempotent DAGs (รันซ้ำกี่รอบก็ได้ผลเดิม) พร้อมเขียน custom JobLogger สำหรับ end-to-end BigQuery observability ทั้งระบบออกแบบให้ทำงานภายใต้ GCP Always Free tier ทั้งหมด