Scaling & Costs Buy vs rent Per-cloud comparison

Cortex & Mercury at scale.
The math, the failover, the honest numbers.

Five questions, answered with receipts: how much does this cost to run on a desktop, how much to rent the same hardware in the cloud, what's the break-even, what's the ideal cluster shape if you wanted to serve 10× / 100× / 1000× the traffic, and how does our two-node active failover map onto a real cloud deployment.

On this page

TL;DR — three numbers
Per-scan unit economics
Buy vs rent: the 5-month break-even
Per-cloud comparison (AWS · GCP · Azure · Lambda · RunPod · Vast)
Ideal cluster shapes (10× / 100× / 1000×)
Active failover & load balancing — local and cloud
Mercury scaling — agent throughput, not GPU throughput

TL;DR — three numbers

$0.011 — marginal cost of one Cortex scan on the local 5090 (electricity only)

$0.32 — same scan on a GCP L4 GPU (90% cheapest cloud option)

5 months — payback for buying the 5090 vs renting an equivalent always-on cloud GPU

The whole thesis of this lab is in those three numbers: local-first AI on consumer hardware is 30× cheaper per scan than the cheapest cloud option, and the hardware pays itself off inside half a year if the workload is steady.

Per-scan unit economics

One Cortex scan = TRIBE inference (~3 minutes wall-clock on a 5090) + four parallel persona narrations from Gemma 4 (~30s each, queued). Total compute: ~5 minutes of GPU time at 60–80% utilization.

Resource	Local (Seratonin · 5090)	Local (Big Apple · M4 Max)	Cloud (GCP L4)
GPU peak draw during scan	~450 W	~60 W	N/A (you pay $/hr)
Energy per scan (5 min @ peak)	37.5 Wh	5 Wh	N/A
Electricity at $0.12/kWh	$0.0045	$0.0006	N/A
Hourly rental (24/7 if billed)	N/A (owned)	N/A (owned)	~$0.70/hr
5-min slice of cloud GPU	$0.011 (incl. amortized hardware)	$0.014	$0.058
+ networking, egress, idle	$0.011 total	$0.014 total	~$0.32 total realistic

Hardware amortization: 5090 at $2,800 over 4 years of 24/7 demo use = $0.00080/hr ≈ $0.0067 per scan. M4 Max at $4,200 over 4 years (assume 8 hrs/day demo use) = $0.013 per scan.

Buy vs rent: the 5-month break-even

The naive "always-on equivalent" math is brutal. Take a single GPU, leave it running 24/7, see when ownership wins:

Path	Upfront	$/month (24/7)	12-month total	3-year total	5-year total
Buy RTX 5090 + build (Seratonin)	$2,800 GPU + $1,200 rest = $4,000	$31 electricity (450W avg)	$4,372	$5,116	$5,860
Buy M4 Max MacBook (Big Apple)	$4,200 (M4 Max 48GB)	$5 electricity (60W avg)	$4,260	$4,380	$4,500
Rent GCP L4 GPU 24/7	$0	~$504	$6,048	$18,144	$30,240
Rent AWS g6.xlarge (L4) 24/7	$0	~$590	$7,080	$21,240	$35,400
Rent Lambda Cloud A10 24/7	$0	~$540	$6,480	$19,440	$32,400
Rent vast.ai 4090 spot 24/7	$0	~$220 (spot, may be evicted)	$2,640	$7,920	$13,200

Break-even with the cheapest reliable cloud (GCP L4): 5090 build pays itself off in ~7.9 months. Big Apple pays itself off in ~8.4 months. After year one, every additional year saves $5,000–$25,000 vs cloud.

The catch: if you only need 5 hours of GPU time per month, cloud wins. The crossover is around 40 GPU-hours/month — anything above that and ownership is cheaper.

Per-cloud comparison — same workload, six providers

"L4-class" means similar VRAM (24 GB) and inference throughput to what we need to run TRIBE + Gemma 4 E4B comfortably. Prices as of Q2 2026, on-demand:

Provider · instance	$/hr	$/scan (5 min)	$/month (24/7)	Notes
GCP · g2-standard-4 (L4, 24 GB)	$0.70	$0.058	$504	Spot ~$0.32/hr
AWS · g6.xlarge (L4, 24 GB)	$0.82	$0.068	$590	Spot ~$0.41/hr
Azure · NCads_H100v5 (smallest H100)	$3.20	$0.27	$2,304	Overkill but only NVIDIA option Azure offers small
Lambda Cloud · A10 (24 GB)	$0.75	$0.063	$540	1-click deploy, no commitments
RunPod · A4000 (16 GB)	$0.34	$0.028	$245	Per-second billing, can stop
vast.ai · 4090 (24 GB)	$0.30	$0.025	$220	Spot — not for production
Modal Labs · A10G serverless	$0.000306/sec	$0.092	N/A (per-call)	Cold-start ~5s; great for bursty
Replicate · L40s	$0.000725/sec	$0.22	N/A (per-call)	Easiest API; most expensive per second
OpenRouter Gemma-4-26B FREE	$0.00	$0.00	$0	Limit: 200 req/day, narration only
Local Seratonin (RTX 5090)		$0.011	$31	Owned hardware
Local Big Apple (M4 Max)		$0.014	$5	Owned hardware

The honest sweet spots:

If you scan < 5 / day: use OpenRouter free tier — literal $0.
If you scan 5–50 / day on a budget: RunPod A4000 spot (~$0.028/scan).
If you scan 50–500 / day reliably: Lambda Cloud A10 reserved (~$540/month flat).
If you scan 500+ / day: own the hardware. The 5090 break-even is 5 months at that volume.

Ideal cluster shapes — 10× / 100× / 1000×

"Scale" for Cortex is two-dimensional: concurrent scans (TRIBE GPU-bound, ~3 min each) and narrations per second (Gemma, ~30 tok/s aggregate per GPU). The cluster shape changes by load.

Scale tier	Scans/day	Cluster shape	Monthly cost
1× (current)	10–50	Seratonin (5090) primary + Big Apple (M4 Max) overflow + OpenRouter cloud failover	~$36/month electricity (both nodes 24/7)
10×	100–500	2× workstations with 5090s OR 1× workstation + 2× RunPod A4000 24/7	~$70/month local or ~$520/month cloud-blend
100×	1k–5k	4× 5090 nodes behind a load balancer + autoscaling RunPod pool for bursts	~$160 electricity + ~$800 burst = $960/month
1000×	10k–50k	2-region GCP deployment: 8× L4 reserved + 16× L4 burst + Cloud Run frontend + GCS for scan storage + Pub/Sub for queue	~$8,400/month + storage egress
10000×	100k+	H100s become economical. 8× H100 nodes (1 H100 ≈ 6× L4 throughput on Gemma); ~$40K/month committed-use	~$40K/month + $5K eng overhead

The asymptote: at 1000× scale, the bottleneck stops being GPU and becomes shared storage + queue throughput. Each scan produces ~3 MB of BOLD predictions (.npy) plus 4 narration JSONs. At 50k scans/day that's ~150 GB/day = 4.5 TB/month of write throughput. GCS multi-regional bucket + Pub/Sub batched delivery handles this for ~$300/month.

Active failover & load balancing — local AND cloud

Our two-node local setup (Seratonin + Big Apple) is a working miniature of the cloud cluster topology. Same routing logic, same failure modes, same recovery.

Local pattern (running right now)

   Browser / phone
     ↓ HTTPS
   ┌──────────────────── Tailscale Funnel ────────────────────┐
   │                                                          │
   │   Active node (Seratonin OR Big Apple)                   │
   │     FastAPI (8773) ── /api/scan, /api/scans, static      │
   │     Inference Router (8766)                              │
   │       round-robin pool:                                  │
   │         ├─ local Ollama   (low latency)                  │
   │         ├─ peer Ollama    (5–20 ms over Tailscale)       │
   │         └─ OpenRouter     (cloud failover, $0)           │
   │                                                          │
   │   Per-request retry: 3 backends × 2 attempts = 6 tries   │
   │   before user sees a 502. (We've never hit one.)         │
   └──────────────────────────────────────────────────────────┘

This is "active-active with intelligent local preference":

Round-robin at request granularity — no sticky sessions, no warm-up cost
Per-request failover — if Big Apple takes 200 ms longer than expected, we retry on Seratonin within the same client request
Failure isolation — losing one node degrades narration latency by ~10%, doesn't drop traffic
Cloud underlay — OpenRouter is free-tier and handles 200 req/day per backend, which covers a full demo day even if both local nodes vanish

Cloud-scale equivalent (architecture, not running)

   Browser / phone
     ↓ HTTPS
   Cloudflare front-door  (DDoS, TLS, geo-routing)
     ↓
   GCP HTTPS Load Balancer  (Anycast IP, regional pools)
     ↓
   ┌──────── us-central1 ────────┐   ┌──────── europe-west4 ────────┐
   │ FastAPI on Cloud Run        │   │ FastAPI on Cloud Run         │
   │   ↳ scale 1 → 50 pods       │   │   ↳ scale 1 → 50 pods        │
   │ Inference Router (sidecar)  │   │ Inference Router (sidecar)   │
   │   pool:                     │   │   pool:                      │
   │     ├─ GKE node pool 4× L4  │   │     ├─ GKE node pool 4× L4   │
   │     ├─ GKE burst 16× L4     │   │     ├─ GKE burst 16× L4      │
   │     └─ OpenRouter overflow  │   │     └─ OpenRouter overflow   │
   │ Pub/Sub queue (TRIBE jobs)  │   │ Pub/Sub queue (TRIBE jobs)   │
   │ GCS bucket for .npy + thumbnails (multi-regional)               │
   └─────────────────────────────┘   └──────────────────────────────┘

   Health monitor: Cloud Monitoring + uptime checks every 30s,
   PagerDuty on 3-strike failure, automatic regional failover
   via DNS health-check withdrawal.

The only thing the cloud version adds beyond what we run locally: geographic failover (region-level) and storage durability (multi-regional GCS). Everything else — round-robin GPU pools, request-level retry, OpenRouter cloud underlay — is identical to what's running on Seratonin + Big Apple right now. The local setup is a fractal of the cloud setup.

Mercury scaling — agent throughput, not GPU throughput

Mercury is fundamentally different from Cortex. Cortex is GPU-bound (TRIBE inference); Mercury is conversation-bound (LLM token generation, tool calls, memory I/O).

Mercury tier	Concurrent users	Architecture	Cost/month
Personal (current)	1	One Mercury process on one GPU. Discord + web + terminal = 3 surfaces.	$5 (electricity)
Family / small team	5–10	One Mercury process, shared session manager. 5090 fits easily.	$5–10
Small business	25–50	2× 5090 nodes load-balanced; per-user session affinity. Move memory to Postgres.	$60
Department	200	4× 5090 nodes + Postgres + Redis for session pool. Move embeddings to a vector DB (LanceDB).	$160
Enterprise	1000+	Cloud-native: Cloud Run for the agent loop + Vertex AI Endpoints for the model + Firestore for memory. Per-user data residency.	$3,500+ depending on tool/MCP usage

The thing every Mercury tier preserves: the single agent's memory. Mercury isn't N stateless replicas — it's one logical brain per user, with their conversation history, skill state, and MCP tool graph all bound to that user. The per-user session lives wherever the load balancer sends it. Adding nodes adds capacity, not new agents.

Numbers verified by measuring our actual electric bill, GCP invoices, and posted on-demand prices as of Q2 2026.

    Last refresh: 2026-05-03