Resource Utilization System -- Core Reference

Project Overview

Built and delivered a cloud resource utilization reporting system to give business units visibility into AWS resource consumption and identify cost-optimization opportunities. The system continuously collects, validates, and stores utilization metrics (CPU %, memory, disk I/O etc.) across EC2, RDS and other AWS resources—automating what used to be a manual cost-tracking process.

Environment: AWS (EC2 / RDS / MSK), Prometheus + YACE exporter, PolarDB, Python
Role: End-to-end owner — architecture, data pipeline, validation, alerting and documentation

Background

Problem

Previously, business teams lacked a central view of resource usage across accounts. Data came from disparate dashboards and manual CloudWatch queries ( expensive per API call ).
This made cost optimization reactive and error-prone.

Pain Points

Fragmented data sources (CloudWatch, OpsDB, Prometheus)
High manual effort for monthly cost analysis
No alerting or data validation pipeline

Goal

Build a system that:

Gives business units clear, unified visibility into AWS resource utilization
Helps teams identify and reduce underutilized resources, driving cost optimization
Automates manual reporting workflows to improve efficiency and reliability
Itself is robust, low-cost, and reliable, minimizing operational overhead

Core Architecture

System Architecture

        +-------------------+
        |   AWS Services    |  (EC2, RDS, EBS, etc.)
        +---------+---------+
                  |
        [YACE Exporter / CloudWatch API]
                  |
        +---------v---------+
        | Data Collection   |  (Prometheus scrape, OpsDB)
        +---------+---------+
                  |
        | Transformation + Validation |
                  |
        +---------v---------+
        | Storage Layer     |  (PolarDB + File System)
        +---------+---------+
                  |
        +---------v-------------+
        | Monitoring + Alerting |
        +-----------------------+

Collector – multi-source metric collection
Processor – data transformation & NaN handling
Storage – dual storage (file + DB) with idempotent writes
Monitor – system health and data quality alerts

Data Flow

1. Collect

CloudWatch → YACE scrapes once per metric namespace (cost-efficient), exposes Prometheus metrics. Prometheus scrapes YACE at fixed intervals.
OpsDB / other sources are polled for metadata or missing dimensions (owner, BU, tags).

2. Transform & Validate

ETL job (Python) loads from Prometheus, normalizes to a unified schema (resource_id, ts, cpu%, mem%, disk_io, tags).
Runs completeness checks, range checks, timestamp alignment, and mapping accuracy; bad rows quarantined for retry/backfill.

3. Store

Dual-write: append-only files (cheap raw archive) + PolarDB (query/BI). Writes are idempotent to tolerate retries.

4. Serve & Monitor

BI/SQL consumers read from PolarDB; scheduled reports power cost-optimization and capacity planning.
Health alerts on collection failures, data gaps, and pipeline errors (“monitor the monitoring”).

Design Highlights

Efficient Data Collection: Integrated YACE to scrape CloudWatch metrics once and expose them to Prometheus, avoiding repetitive API calls. Added Prometheus recording rules (1m/5m aggregates) to reduce query load and speed up ETL processing.
Reliable Processing Pipeline: Developed a Python ETL job that retrieves Prometheus metrics, performs transformation and validation, and handles missing values, NaNs, and timestamp misalignments through retries and normalization.
Robust Storage Design: Implemented a dual-storage strategy (file system + PolarDB) with idempotent writes, ensuring consistency, recoverability, and low operational cost.
Comprehensive Monitoring: Added built-in health checks and alerts for collection failures, processing errors, and data anomalies, with detailed logging for troubleshooting.
Scalable & Maintainable: Modular architecture and parameterized configs make it easy to extend to new AWS services or integrate with future cost-management and analytics platforms.

Impact

Dimension	Before	After
Visibility	Scattered CloudWatch dashboards	Unified utilization view in PolarDB + Grafana
Cost Insight	Manual analysis per team	Automated reports identifying idle resources
Data Quality	Ad hoc queries, incomplete	100% collection success, 0 data loss
Efficiency	Manual collection hours / week	Fully automated nightly runs + alerting

Business Impact

Enabled data-driven cost reduction and capacity planning
Eliminated manual aggregation workload
Provided foundation for future predictive analytics

See also

Resource Utilization System — Interview Talking Points

Nature's Digital Garden

Explorer

Resource Utilization System -- Core Reference

Project Overview

Background

Problem

Pain Points

Goal

Core Architecture

System Architecture

Data Flow

1. Collect

2. Transform & Validate

3. Store

4. Serve & Monitor

Design Highlights

Impact

Graph View

Table of Contents

Backlinks