AI Processing Platform
A 3-phase evolutionary architecture for scalable AI task processing
Architecture Evolution
From MVP to Scale — infrastructure grows with business demand
Task Processing Flow
From audio upload to summarized result — fully async pipeline
Capacity Planner
Calculate GPU requirements and costs for your workload
STT GPUs Needed
1
LLM GPUs Needed
1
Estimated Monthly Cost(Self-hosted)
$1,340
vs Managed Service Cost(Managed)
$3,850
Cost Crossover Analysis
When does self-hosting GPU become cheaper than managed services?
Technology Selection
Technology choices evolve with each phase — here's what we chose and why
ECS Fargate
No GPU needed — zero cluster management, per-second billing. Control plane cost ($73/mo for EKS) not justified at this scale.
High ops burden, no advantage over EKS for GPU workloads
No auto-scaling, suitable only for local dev
AWS Transcribe
Zero GPU ops, pay-per-minute ($0.024/min). Acceptable cost below 50K tasks/month. Validate product first.
Similar per-minute pricing, less integrated with AWS ecosystem
Higher latency (cross-cloud), less cost-effective at scale
Amazon Bedrock (Claude/Titan)
Zero GPU infrastructure. Access to frontier models. ~$0.003/task. Ideal for POC validation.
Higher per-call cost, external dependency, no self-hosting option
SQS (Single Queue)
One queue with message attributes to distinguish STT vs LLM tasks. Simplest setup for low volume.
Event streaming semantics — overkill for task queue. MSK $200+/mo minimum
Requires self-hosting. Used in local dev only (docker-compose)
RDS PostgreSQL (Single-AZ)
ACID for task state. Single-AZ keeps cost low (~$15/mo). Acceptable downtime risk for POC.
3x cost, 6-way replication overkill for this workload
Poor at relational queries needed for task state management
ElastiCache Redis (Single Node)
Result caching, idempotency locks, rate limiting. Single node sufficient for POC (~$12/mo).
No persistence, no pub/sub, no SETNX for distributed locks
CloudWatch
Built-in with AWS. Zero setup. Sufficient for basic metrics and logs at POC stage.
Per-host pricing scales expensively with GPU nodes
Similar per-host cost concern
ECS Rolling Update
Simple rolling update. No canary needed for POC. Fast iteration.
Requires 2x resources during deployment, more costly
Go + Echo
Goroutines (~2KB each) excel at I/O-bound HTTP calls. Single binary ~10MB Docker image. Echo provides mature middleware.
GIL limits concurrency; ML ecosystem irrelevant since workers only make HTTP calls
Higher memory (~1MB/thread), slower cold start
Architecture Characteristics
Six dimensions of system quality
KEDA + Cluster Autoscaler for proactive, queue-depth-driven scaling
- KEDA scales workers based on SQS queue depth (proactive, not reactive)
- Cluster Autoscaler provisions GPU nodes when pods are unschedulable
- Spot + On-Demand GPU mix for cost optimization (40-60% savings)
- API pods scale via standard HPA on CPU utilization
Deployment & Operations
From local development to production with zero-downtime canary releases