Back to projects

beam-ad-click-lakehouse

Create a cloud data platform that tracks advertising performance across channels, updating dashboards every 90 seconds to help marketers optimize spending

Python
0 stars
0 forks
View on GitHub

Project Overview

What You'll Build

A complete data lakehouse on Google Cloud that ingests advertising data from multiple platforms (Google Ads, Facebook, etc.), processes it in near real-time, and provides dashboards showing ROI metrics. Marketing teams can see which ads are performing within minutes, not days.

Why This Project Matters

  • Business Value: Marketers waste millions on underperforming ads. Real-time visibility helps redirect budget to winning campaigns
  • Modern Architecture: Learn the "lakehouse" pattern combining the best of data lakes and warehouses
  • Cloud-Native Skills: Master Google Cloud Platform's data stack, used by thousands of companies

Tech Stack Explained for Beginners

TechnologyWhat it isWhy you'll use it
Apache BeamUnified programming model for batch and streamingWrite one pipeline that handles both real-time clicks and historical reprocessing
Google Pub/SubMessage queue service (like Kafka but serverless)Receives click events from ad platforms without managing servers
Delta LakeStorage layer with ACID transactionsEnsures data consistency - no duplicate clicks or lost conversions
BigQueryGoogle's serverless data warehouseRuns SQL queries on billions of rows in seconds
dbt (data build tool)SQL-based transformation frameworkBuilds clean, tested data models that analysts can trust
AirflowWorkflow orchestratorSchedules and monitors all your data pipelines
Monte CarloData observability platformAlerts you when data is late or looks wrong

Step-by-Step Build Plan

  1. Week 1: Set up GCP account and create sample ad click data
  2. Week 2: Build Pub/Sub ingestion for multiple ad platforms
  3. Week 3: Implement Beam pipeline for click deduplication and enrichment
  4. Week 4: Design Delta Lake schema and implement incremental updates
  5. Week 5: Create dbt models for ROI calculations and attribution
  6. Week 6: Build Looker Studio dashboards and set up monitoring
  7. Week 7: Implement cost controls and optimization

Detailed Requirements

Functional Requirements

  • Data Sources:
    • Ingest from 3+ ad platforms (Google, Meta, LinkedIn)
    • Handle click, impression, and conversion events
    • Support both real-time streaming and daily batch loads
  • Transformations:
    • Deduplicate clicks using 24-hour windows
    • Join clicks with conversions for attribution
    • Calculate metrics: CTR, CPA, ROAS, LTV
    • Build cohort analysis for customer retention
  • Output:
    • Executive dashboard with hourly KPI updates
    • Campaign performance drill-downs
    • Anomaly alerts for sudden metric changes
    • Cost allocation reports by team/product

Technical Requirements

  • Performance:
    • End-to-end data freshness < 90 seconds
    • Support 1 billion events/day
    • Dashboard queries return in < 5 seconds
  • Cost Efficiency:
    • Total GCP cost < $50/day
    • Automatic data lifecycle (delete raw after 30 days)
    • Use spot instances where possible
  • Data Quality:
    • 99.9% data completeness SLA
    • Automated testing for all transformations
    • Row count validation between stages
    • Schema change detection

Prerequisites & Learning Path

  • Required: Basic SQL and Python
  • Helpful: Familiarity with cloud services (any provider)
  • You'll Learn: Cloud data engineering, streaming systems, DataOps practices

Success Metrics

  • Process 100M test events with <90s lag
  • All dbt tests pass on every run
  • Dashboard loads in <3 seconds
  • Monthly GCP bill stays under $1,500
  • Zero data quality incidents in production

Technologies Used

BeamDataflowDelta LakeBigQuerydbtAirflowMonte Carlo

Project Topics

#beam#delta-lake#gcp#dbt

Ready to explore the code?

Dive deep into the implementation, check out the documentation, and feel free to contribute!

Open in GitHub →

Interested in this project?

Check out the source code, documentation, and feel free to contribute or use it in your own projects.