Back to projectsView on GitHub
clickhouse-htap-streaming
Create a hybrid system that handles both real-time IoT analytics and historical analysis, processing 50,000 events per second with sub-second query response
Scala
0 stars
0 forks
ClickHouse
Hudi
Spark
Kafka
Grafana
Project Overview
What You'll Build
A high-performance analytics platform that combines real-time stream processing with lightning-fast historical queries. Perfect for IoT scenarios where you need instant insights on sensor data while maintaining years of history for trend analysis.
Why This Project Matters
- Cutting Edge: HTAP (Hybrid Transactional/Analytical Processing) is the future of data systems
- Real Scale: Learn to handle true big data volumes (50K events/second = 4.3B events/day)
- Cost Efficient: One system instead of separate streaming and batch platforms
Tech Stack Explained for Beginners
Technology | What it is | Why you'll use it |
---|---|---|
Apache Spark Streaming | Distributed stream processing engine | Handles massive data volumes across multiple machines |
Apache Hudi | Data lake storage format | Enables updates/deletes on cloud storage (usually append-only) |
ClickHouse | Columnar OLAP database | Queries billions of rows in milliseconds |
Apache Kafka | Distributed message queue | Buffers incoming IoT data reliably |
Grafana | Metrics dashboard | Visualizes real-time sensor readings |
Soda SQL | Data quality tool | Validates data consistency between systems |
Step-by-Step Build Plan
- Week 1: Set up local ClickHouse and generate IoT sensor data
- Week 2: Build Kafka ingestion pipeline for sensor events
- Week 3: Implement Spark Streaming transformations
- Week 4: Configure Hudi for incremental updates
- Week 5: Optimize ClickHouse schema and queries
- Week 6: Build real-time dashboards and alerts
- Week 7: Performance tuning and chaos testing
Detailed Requirements
Functional Requirements
- Data Ingestion:
- Handle 5 types of IoT sensors (temperature, pressure, etc.)
- Support out-of-order events (up to 1 hour late)
- Deduplicate messages using 5-minute windows
- Enrich with reference data (sensor location, type)
- Stream Processing:
- Calculate 1-minute aggregates (avg, min, max)
- Detect anomalies (values outside 3 sigma)
- Generate alerts for threshold breaches
- Maintain running statistics per sensor
- Analytics Queries:
- Real-time dashboard (last 5 minutes)
- Historical trends (last 30 days)
- Sensor comparison reports
- Anomaly investigation tools
Technical Requirements
- Performance Targets:
- Sustain 50,000 events/second ingestion
- End-to-end latency < 5 seconds
- Dashboard queries < 1 second (p95)
- Historical queries < 5 seconds (1 year of data)
- Storage Efficiency:
- Achieve 10x compression ratio
- Automatic data tiering (hot/warm/cold)
- Retain raw data for 1 year
- Keep aggregates for 3 years
- Reliability:
- No data loss during node failures
- Exactly-once processing semantics
- Automatic rebalancing
- Self-healing capabilities
Prerequisites & Learning Path
- Required: Java/Scala basics, SQL, understanding of distributed systems
- Helpful: Experience with any streaming platform
- You'll Learn: HTAP architecture, stream processing, columnar databases
Success Metrics
- Process 1 billion events without data loss
- Maintain p95 query latency < 1 second
- Achieve 10x data compression
- Handle node failures with < 30s recovery
- Pass 48-hour stress test at full load
Technologies Used
ClickHouseHudiSparkKafkaGrafana
Project Topics
#clickhouse#hudi#spark
Ready to explore the code?
Dive deep into the implementation, check out the documentation, and feel free to contribute!
Open in GitHub →Interested in this project?
Check out the source code, documentation, and feel free to contribute or use it in your own projects.