Back to projects

clickhouse-htap-streaming

Create a hybrid system that handles both real-time IoT analytics and historical analysis, processing 50,000 events per second with sub-second query response

Scala
0 stars
0 forks
View on GitHub

Project Overview

What You'll Build

A high-performance analytics platform that combines real-time stream processing with lightning-fast historical queries. Perfect for IoT scenarios where you need instant insights on sensor data while maintaining years of history for trend analysis.

Why This Project Matters

  • Cutting Edge: HTAP (Hybrid Transactional/Analytical Processing) is the future of data systems
  • Real Scale: Learn to handle true big data volumes (50K events/second = 4.3B events/day)
  • Cost Efficient: One system instead of separate streaming and batch platforms

Tech Stack Explained for Beginners

TechnologyWhat it isWhy you'll use it
Apache Spark StreamingDistributed stream processing engineHandles massive data volumes across multiple machines
Apache HudiData lake storage formatEnables updates/deletes on cloud storage (usually append-only)
ClickHouseColumnar OLAP databaseQueries billions of rows in milliseconds
Apache KafkaDistributed message queueBuffers incoming IoT data reliably
GrafanaMetrics dashboardVisualizes real-time sensor readings
Soda SQLData quality toolValidates data consistency between systems

Step-by-Step Build Plan

  1. Week 1: Set up local ClickHouse and generate IoT sensor data
  2. Week 2: Build Kafka ingestion pipeline for sensor events
  3. Week 3: Implement Spark Streaming transformations
  4. Week 4: Configure Hudi for incremental updates
  5. Week 5: Optimize ClickHouse schema and queries
  6. Week 6: Build real-time dashboards and alerts
  7. Week 7: Performance tuning and chaos testing

Detailed Requirements

Functional Requirements

  • Data Ingestion:
    • Handle 5 types of IoT sensors (temperature, pressure, etc.)
    • Support out-of-order events (up to 1 hour late)
    • Deduplicate messages using 5-minute windows
    • Enrich with reference data (sensor location, type)
  • Stream Processing:
    • Calculate 1-minute aggregates (avg, min, max)
    • Detect anomalies (values outside 3 sigma)
    • Generate alerts for threshold breaches
    • Maintain running statistics per sensor
  • Analytics Queries:
    • Real-time dashboard (last 5 minutes)
    • Historical trends (last 30 days)
    • Sensor comparison reports
    • Anomaly investigation tools

Technical Requirements

  • Performance Targets:
    • Sustain 50,000 events/second ingestion
    • End-to-end latency < 5 seconds
    • Dashboard queries < 1 second (p95)
    • Historical queries < 5 seconds (1 year of data)
  • Storage Efficiency:
    • Achieve 10x compression ratio
    • Automatic data tiering (hot/warm/cold)
    • Retain raw data for 1 year
    • Keep aggregates for 3 years
  • Reliability:
    • No data loss during node failures
    • Exactly-once processing semantics
    • Automatic rebalancing
    • Self-healing capabilities

Prerequisites & Learning Path

  • Required: Java/Scala basics, SQL, understanding of distributed systems
  • Helpful: Experience with any streaming platform
  • You'll Learn: HTAP architecture, stream processing, columnar databases

Success Metrics

  • Process 1 billion events without data loss
  • Maintain p95 query latency < 1 second
  • Achieve 10x data compression
  • Handle node failures with < 30s recovery
  • Pass 48-hour stress test at full load

Technologies Used

ClickHouseHudiSparkKafkaGrafana

Project Topics

#clickhouse#hudi#spark

Ready to explore the code?

Dive deep into the implementation, check out the documentation, and feel free to contribute!

Open in GitHub →

Interested in this project?

Check out the source code, documentation, and feel free to contribute or use it in your own projects.