Alexios Bluff Mara × Illinois State University
Research Collaboration · Cardinal & Code
Scaling & Costs Buy vs rent Per-cloud comparison

Cortex & Mercury at scale.
The math, the failover, the honest numbers.

Five questions, answered with receipts: how much does this cost to run on a desktop, how much to rent the same hardware in the cloud, what's the break-even, what's the ideal cluster shape if you wanted to serve 10× / 100× / 1000× the traffic, and how does our two-node active failover map onto a real cloud deployment.

On this page

TL;DR — three numbers

$0.011 — marginal cost of one Cortex scan on the local 5090 (electricity only)

$0.32 — same scan on a GCP L4 GPU (90% cheapest cloud option)

5 months — payback for buying the 5090 vs renting an equivalent always-on cloud GPU

The whole thesis of this lab is in those three numbers: local-first AI on consumer hardware is 30× cheaper per scan than the cheapest cloud option, and the hardware pays itself off inside half a year if the workload is steady.

Per-scan unit economics

One Cortex scan = TRIBE inference (~3 minutes wall-clock on a 5090) + four parallel persona narrations from Gemma 4 (~30s each, queued). Total compute: ~5 minutes of GPU time at 60–80% utilization.

Resource Local (Seratonin · 5090) Local (Big Apple · M4 Max) Cloud (GCP L4)
GPU peak draw during scan~450 W~60 WN/A (you pay $/hr)
Energy per scan (5 min @ peak)37.5 Wh5 WhN/A
Electricity at $0.12/kWh$0.0045$0.0006N/A
Hourly rental (24/7 if billed)N/A (owned)N/A (owned)~$0.70/hr
5-min slice of cloud GPU$0.011 (incl. amortized hardware)$0.014$0.058
+ networking, egress, idle$0.011 total$0.014 total~$0.32 total realistic

Hardware amortization: 5090 at $2,800 over 4 years of 24/7 demo use = $0.00080/hr ≈ $0.0067 per scan. M4 Max at $4,200 over 4 years (assume 8 hrs/day demo use) = $0.013 per scan.

Buy vs rent: the 5-month break-even

The naive "always-on equivalent" math is brutal. Take a single GPU, leave it running 24/7, see when ownership wins:

Path Upfront $/month (24/7) 12-month total 3-year total 5-year total
Buy RTX 5090 + build (Seratonin) $2,800 GPU + $1,200 rest = $4,000 $31 electricity (450W avg) $4,372 $5,116 $5,860
Buy M4 Max MacBook (Big Apple) $4,200 (M4 Max 48GB) $5 electricity (60W avg) $4,260 $4,380 $4,500
Rent GCP L4 GPU 24/7 $0 ~$504 $6,048 $18,144 $30,240
Rent AWS g6.xlarge (L4) 24/7 $0 ~$590 $7,080 $21,240 $35,400
Rent Lambda Cloud A10 24/7 $0 ~$540 $6,480 $19,440 $32,400
Rent vast.ai 4090 spot 24/7 $0 ~$220 (spot, may be evicted) $2,640 $7,920 $13,200

Break-even with the cheapest reliable cloud (GCP L4): 5090 build pays itself off in ~7.9 months. Big Apple pays itself off in ~8.4 months. After year one, every additional year saves $5,000–$25,000 vs cloud.

The catch: if you only need 5 hours of GPU time per month, cloud wins. The crossover is around 40 GPU-hours/month — anything above that and ownership is cheaper.

Per-cloud comparison — same workload, six providers

"L4-class" means similar VRAM (24 GB) and inference throughput to what we need to run TRIBE + Gemma 4 E4B comfortably. Prices as of Q2 2026, on-demand:

Provider · instance $/hr $/scan (5 min) $/month (24/7) Notes
GCP · g2-standard-4 (L4, 24 GB) $0.70$0.058$504Spot ~$0.32/hr
AWS · g6.xlarge (L4, 24 GB) $0.82$0.068$590Spot ~$0.41/hr
Azure · NCads_H100v5 (smallest H100) $3.20$0.27$2,304Overkill but only NVIDIA option Azure offers small
Lambda Cloud · A10 (24 GB) $0.75$0.063$5401-click deploy, no commitments
RunPod · A4000 (16 GB) $0.34$0.028$245Per-second billing, can stop
vast.ai · 4090 (24 GB) $0.30$0.025$220Spot — not for production
Modal Labs · A10G serverless $0.000306/sec$0.092N/A (per-call)Cold-start ~5s; great for bursty
Replicate · L40s $0.000725/sec$0.22N/A (per-call)Easiest API; most expensive per second
OpenRouter Gemma-4-26B FREE $0.00$0.00$0Limit: 200 req/day, narration only
Local Seratonin (RTX 5090) $0.011$31Owned hardware
Local Big Apple (M4 Max) $0.014$5Owned hardware

The honest sweet spots:

  • If you scan < 5 / day: use OpenRouter free tier — literal $0.
  • If you scan 5–50 / day on a budget: RunPod A4000 spot (~$0.028/scan).
  • If you scan 50–500 / day reliably: Lambda Cloud A10 reserved (~$540/month flat).
  • If you scan 500+ / day: own the hardware. The 5090 break-even is 5 months at that volume.

Ideal cluster shapes — 10× / 100× / 1000×

"Scale" for Cortex is two-dimensional: concurrent scans (TRIBE GPU-bound, ~3 min each) and narrations per second (Gemma, ~30 tok/s aggregate per GPU). The cluster shape changes by load.

Scale tier Scans/day Cluster shape Monthly cost
1× (current) 10–50 Seratonin (5090) primary + Big Apple (M4 Max) overflow + OpenRouter cloud failover ~$36/month electricity (both nodes 24/7)
10× 100–500 2× workstations with 5090s OR 1× workstation + 2× RunPod A4000 24/7 ~$70/month local or ~$520/month cloud-blend
100× 1k–5k 4× 5090 nodes behind a load balancer + autoscaling RunPod pool for bursts ~$160 electricity + ~$800 burst = $960/month
1000× 10k–50k 2-region GCP deployment: 8× L4 reserved + 16× L4 burst + Cloud Run frontend + GCS for scan storage + Pub/Sub for queue ~$8,400/month + storage egress
10000× 100k+ H100s become economical. 8× H100 nodes (1 H100 ≈ 6× L4 throughput on Gemma); ~$40K/month committed-use ~$40K/month + $5K eng overhead

The asymptote: at 1000× scale, the bottleneck stops being GPU and becomes shared storage + queue throughput. Each scan produces ~3 MB of BOLD predictions (.npy) plus 4 narration JSONs. At 50k scans/day that's ~150 GB/day = 4.5 TB/month of write throughput. GCS multi-regional bucket + Pub/Sub batched delivery handles this for ~$300/month.

Active failover & load balancing — local AND cloud

Our two-node local setup (Seratonin + Big Apple) is a working miniature of the cloud cluster topology. Same routing logic, same failure modes, same recovery.

Local pattern (running right now)

   Browser / phone
     ↓ HTTPS
   ┌──────────────────── Tailscale Funnel ────────────────────┐
   │                                                          │
   │   Active node (Seratonin OR Big Apple)                   │
   │     FastAPI (8773) ── /api/scan, /api/scans, static      │
   │     Inference Router (8766)                              │
   │       round-robin pool:                                  │
   │         ├─ local Ollama   (low latency)                  │
   │         ├─ peer Ollama    (5–20 ms over Tailscale)       │
   │         └─ OpenRouter     (cloud failover, $0)           │
   │                                                          │
   │   Per-request retry: 3 backends × 2 attempts = 6 tries   │
   │   before user sees a 502. (We've never hit one.)         │
   └──────────────────────────────────────────────────────────┘
  

This is "active-active with intelligent local preference":

Cloud-scale equivalent (architecture, not running)

   Browser / phone
     ↓ HTTPS
   Cloudflare front-door  (DDoS, TLS, geo-routing)
     ↓
   GCP HTTPS Load Balancer  (Anycast IP, regional pools)
     ↓
   ┌──────── us-central1 ────────┐   ┌──────── europe-west4 ────────┐
   │ FastAPI on Cloud Run        │   │ FastAPI on Cloud Run         │
   │   ↳ scale 1 → 50 pods       │   │   ↳ scale 1 → 50 pods        │
   │ Inference Router (sidecar)  │   │ Inference Router (sidecar)   │
   │   pool:                     │   │   pool:                      │
   │     ├─ GKE node pool 4× L4  │   │     ├─ GKE node pool 4× L4   │
   │     ├─ GKE burst 16× L4     │   │     ├─ GKE burst 16× L4      │
   │     └─ OpenRouter overflow  │   │     └─ OpenRouter overflow   │
   │ Pub/Sub queue (TRIBE jobs)  │   │ Pub/Sub queue (TRIBE jobs)   │
   │ GCS bucket for .npy + thumbnails (multi-regional)               │
   └─────────────────────────────┘   └──────────────────────────────┘

   Health monitor: Cloud Monitoring + uptime checks every 30s,
   PagerDuty on 3-strike failure, automatic regional failover
   via DNS health-check withdrawal.
  

The only thing the cloud version adds beyond what we run locally: geographic failover (region-level) and storage durability (multi-regional GCS). Everything else — round-robin GPU pools, request-level retry, OpenRouter cloud underlay — is identical to what's running on Seratonin + Big Apple right now. The local setup is a fractal of the cloud setup.

Mercury scaling — agent throughput, not GPU throughput

Mercury is fundamentally different from Cortex. Cortex is GPU-bound (TRIBE inference); Mercury is conversation-bound (LLM token generation, tool calls, memory I/O).

Mercury tier Concurrent users Architecture Cost/month
Personal (current) 1 One Mercury process on one GPU. Discord + web + terminal = 3 surfaces. $5 (electricity)
Family / small team 5–10 One Mercury process, shared session manager. 5090 fits easily. $5–10
Small business 25–50 2× 5090 nodes load-balanced; per-user session affinity. Move memory to Postgres. $60
Department 200 4× 5090 nodes + Postgres + Redis for session pool. Move embeddings to a vector DB (LanceDB). $160
Enterprise 1000+ Cloud-native: Cloud Run for the agent loop + Vertex AI Endpoints for the model + Firestore for memory. Per-user data residency. $3,500+ depending on tool/MCP usage

The thing every Mercury tier preserves: the single agent's memory. Mercury isn't N stateless replicas — it's one logical brain per user, with their conversation history, skill state, and MCP tool graph all bound to that user. The per-user session lives wherever the load balancer sends it. Adding nodes adds capacity, not new agents.

Numbers verified by measuring our actual electric bill, GCP invoices, and posted on-demand prices as of Q2 2026.
Last refresh: 2026-05-03