Back to projects

duckdb-polars-cost-explorer

Build a lightning-fast cost analysis tool that runs on your laptop, analyzing millions of financial records in milliseconds without cloud expenses

Python
0 stars
0 forks
View on GitHub

Project Overview

What You'll Build

A desktop application that analyzes cloud spending data (AWS/GCP/Azure) with sub-second query performance. Finance teams can slice and dice costs by team, service, or tag without waiting for slow warehouse queries or paying for compute.

Why This Project Matters

  • Immediate Value: Companies waste 30% of cloud spend on average. Your tool helps find that waste
  • Zero Infrastructure: Runs entirely on a laptop - no cloud bills or DevOps complexity
  • Modern Tools: Learn DuckDB and Polars, the fastest growing tools in data analytics

Tech Stack Explained for Beginners

TechnologyWhat it isWhy you'll use it
DuckDBSQLite for analytics - runs in-processExecute complex SQL on millions of rows without a server
PolarsRust-powered DataFrame library (10x faster than Pandas)Load and transform data using all CPU cores efficiently
StreamlitPython framework for data appsBuild interactive web UI in pure Python - no JavaScript needed
Great ExpectationsData quality frameworkEnsure cost data is complete and accurate before analysis
PrefectModern workflow orchestrationSchedule daily cost data updates and quality checks

Step-by-Step Build Plan

  1. Week 1: Set up environment and generate sample cost data
  2. Week 2: Build data ingestion pipeline for cloud cost exports
  3. Week 3: Implement DuckDB schema and optimization
  4. Week 4: Create Polars transformations for cost allocation
  5. Week 5: Build Streamlit dashboard with interactive filters
  6. Week 6: Add data quality checks and scheduling

Detailed Requirements

Functional Requirements

  • Data Processing:
    • Ingest AWS Cost and Usage Reports (CUR format)
    • Support GCP BigQuery billing export
    • Handle 10M+ cost line items
    • Parse complex tag structures
  • Analysis Features:
    • Cost breakdown by service, account, tag
    • Month-over-month variance analysis
    • Anomaly detection (>20% increase)
    • Forecast next month's spend
    • Chargeback reports by team
  • User Interface:
    • Interactive filters for date range, service, tags
    • Drill-down from summary to line items
    • Export reports to Excel/CSV
    • Shareable dashboard links

Technical Requirements

  • Performance:
    • Load 10M rows in < 10 seconds
    • Interactive queries return in < 300ms
    • Use < 8GB RAM on laptop
    • Support incremental data updates
  • Data Quality:
    • Validate cost data completeness (no gaps)
    • Check for duplicate charges
    • Ensure tag coverage > 80%
    • Alert on data anomalies
  • Deployment:
    • Single executable or Docker container
    • Auto-update capability
    • Work offline with cached data
    • Secure credential storage

Prerequisites & Learning Path

  • Required: Python basics and SQL fundamentals
  • Helpful: Understanding of cloud billing concepts
  • You'll Learn: Modern OLAP, efficient data processing, data quality engineering

Success Metrics

  • Query 10M cost records in <300ms on laptop
  • 95% of data quality checks pass
  • Identify at least 3 cost-saving opportunities in test data
  • Dashboard works smoothly with 50+ concurrent filters

Technologies Used

DuckDBPolarsStreamlitPrefectGreat Expectations

Project Topics

#duckdb#polars#analytics

Ready to explore the code?

Dive deep into the implementation, check out the documentation, and feel free to contribute!

Open in GitHub →

Interested in this project?

Check out the source code, documentation, and feel free to contribute or use it in your own projects.