Bryce Davidson

What You'll Build

A high-performance analytics platform that combines real-time stream processing with lightning-fast historical queries. Perfect for IoT scenarios where you need instant insights on sensor data while maintaining years of history for trend analysis.

Why This Project Matters

Cutting Edge: HTAP (Hybrid Transactional/Analytical Processing) is the future of data systems
Real Scale: Learn to handle true big data volumes (50K events/second = 4.3B events/day)
Cost Efficient: One system instead of separate streaming and batch platforms

Tech Stack Explained for Beginners

Technology	What it is	Why you'll use it
Apache Spark Streaming	Distributed stream processing engine	Handles massive data volumes across multiple machines
Apache Hudi	Data lake storage format	Enables updates/deletes on cloud storage (usually append-only)
ClickHouse	Columnar OLAP database	Queries billions of rows in milliseconds
Apache Kafka	Distributed message queue	Buffers incoming IoT data reliably
Grafana	Metrics dashboard	Visualizes real-time sensor readings
Soda SQL	Data quality tool	Validates data consistency between systems

Step-by-Step Build Plan

Week 1: Set up local ClickHouse and generate IoT sensor data
Week 2: Build Kafka ingestion pipeline for sensor events
Week 3: Implement Spark Streaming transformations
Week 4: Configure Hudi for incremental updates
Week 5: Optimize ClickHouse schema and queries
Week 6: Build real-time dashboards and alerts
Week 7: Performance tuning and chaos testing

Detailed Requirements

Functional Requirements

Data Ingestion:
- Handle 5 types of IoT sensors (temperature, pressure, etc.)
- Support out-of-order events (up to 1 hour late)
- Deduplicate messages using 5-minute windows
- Enrich with reference data (sensor location, type)
Stream Processing:
- Calculate 1-minute aggregates (avg, min, max)
- Detect anomalies (values outside 3 sigma)
- Generate alerts for threshold breaches
- Maintain running statistics per sensor
Analytics Queries:
- Real-time dashboard (last 5 minutes)
- Historical trends (last 30 days)
- Sensor comparison reports
- Anomaly investigation tools

Technical Requirements

Performance Targets:
- Sustain 50,000 events/second ingestion
- End-to-end latency < 5 seconds
- Dashboard queries < 1 second (p95)
- Historical queries < 5 seconds (1 year of data)
Storage Efficiency:
- Achieve 10x compression ratio
- Automatic data tiering (hot/warm/cold)
- Retain raw data for 1 year
- Keep aggregates for 3 years
Reliability:
- No data loss during node failures
- Exactly-once processing semantics
- Automatic rebalancing
- Self-healing capabilities

Prerequisites & Learning Path

Required: Java/Scala basics, SQL, understanding of distributed systems
Helpful: Experience with any streaming platform
You'll Learn: HTAP architecture, stream processing, columnar databases

Success Metrics

Process 1 billion events without data loss
Maintain p95 query latency < 1 second
Achieve 10x data compression
Handle node failures with < 30s recovery
Pass 48-hour stress test at full load

clickhouse-htap-streaming

Project Overview

What You'll Build

Why This Project Matters

Tech Stack Explained for Beginners

Step-by-Step Build Plan

Detailed Requirements

Functional Requirements

Technical Requirements

Prerequisites & Learning Path

Success Metrics

Technologies Used

Project Topics

Ready to explore the code?

Interested in this project?