Back to projectsView on GitHub
duckdb-polars-cost-explorer
Build a lightning-fast cost analysis tool that runs on your laptop, analyzing millions of financial records in milliseconds without cloud expenses
Python
0 stars
0 forks
DuckDB
Polars
Streamlit
Prefect
Great Expectations
Project Overview
What You'll Build
A desktop application that analyzes cloud spending data (AWS/GCP/Azure) with sub-second query performance. Finance teams can slice and dice costs by team, service, or tag without waiting for slow warehouse queries or paying for compute.
Why This Project Matters
- Immediate Value: Companies waste 30% of cloud spend on average. Your tool helps find that waste
- Zero Infrastructure: Runs entirely on a laptop - no cloud bills or DevOps complexity
- Modern Tools: Learn DuckDB and Polars, the fastest growing tools in data analytics
Tech Stack Explained for Beginners
Technology | What it is | Why you'll use it |
---|---|---|
DuckDB | SQLite for analytics - runs in-process | Execute complex SQL on millions of rows without a server |
Polars | Rust-powered DataFrame library (10x faster than Pandas) | Load and transform data using all CPU cores efficiently |
Streamlit | Python framework for data apps | Build interactive web UI in pure Python - no JavaScript needed |
Great Expectations | Data quality framework | Ensure cost data is complete and accurate before analysis |
Prefect | Modern workflow orchestration | Schedule daily cost data updates and quality checks |
Step-by-Step Build Plan
- Week 1: Set up environment and generate sample cost data
- Week 2: Build data ingestion pipeline for cloud cost exports
- Week 3: Implement DuckDB schema and optimization
- Week 4: Create Polars transformations for cost allocation
- Week 5: Build Streamlit dashboard with interactive filters
- Week 6: Add data quality checks and scheduling
Detailed Requirements
Functional Requirements
- Data Processing:
- Ingest AWS Cost and Usage Reports (CUR format)
- Support GCP BigQuery billing export
- Handle 10M+ cost line items
- Parse complex tag structures
- Analysis Features:
- Cost breakdown by service, account, tag
- Month-over-month variance analysis
- Anomaly detection (>20% increase)
- Forecast next month's spend
- Chargeback reports by team
- User Interface:
- Interactive filters for date range, service, tags
- Drill-down from summary to line items
- Export reports to Excel/CSV
- Shareable dashboard links
Technical Requirements
- Performance:
- Load 10M rows in < 10 seconds
- Interactive queries return in < 300ms
- Use < 8GB RAM on laptop
- Support incremental data updates
- Data Quality:
- Validate cost data completeness (no gaps)
- Check for duplicate charges
- Ensure tag coverage > 80%
- Alert on data anomalies
- Deployment:
- Single executable or Docker container
- Auto-update capability
- Work offline with cached data
- Secure credential storage
Prerequisites & Learning Path
- Required: Python basics and SQL fundamentals
- Helpful: Understanding of cloud billing concepts
- You'll Learn: Modern OLAP, efficient data processing, data quality engineering
Success Metrics
- Query 10M cost records in <300ms on laptop
- 95% of data quality checks pass
- Identify at least 3 cost-saving opportunities in test data
- Dashboard works smoothly with 50+ concurrent filters
Technologies Used
DuckDBPolarsStreamlitPrefectGreat Expectations
Project Topics
#duckdb#polars#analytics
Ready to explore the code?
Dive deep into the implementation, check out the documentation, and feel free to contribute!
Open in GitHub →Interested in this project?
Check out the source code, documentation, and feel free to contribute or use it in your own projects.