Five questions, answered with receipts: how much does this cost to run on a desktop, how much to rent the same hardware in the cloud, what's the break-even, what's the ideal cluster shape if you wanted to serve 10× / 100× / 1000× the traffic, and how does our two-node active failover map onto a real cloud deployment.
$0.011 — marginal cost of one Cortex scan on the local 5090 (electricity only)
$0.32 — same scan on a GCP L4 GPU (90% cheapest cloud option)
5 months — payback for buying the 5090 vs renting an equivalent always-on cloud GPU
The whole thesis of this lab is in those three numbers: local-first AI on consumer hardware is 30× cheaper per scan than the cheapest cloud option, and the hardware pays itself off inside half a year if the workload is steady.
One Cortex scan = TRIBE inference (~3 minutes wall-clock on a 5090) + four parallel persona narrations from Gemma 4 (~30s each, queued). Total compute: ~5 minutes of GPU time at 60–80% utilization.
| Resource | Local (Seratonin · 5090) | Local (Big Apple · M4 Max) | Cloud (GCP L4) |
|---|---|---|---|
| GPU peak draw during scan | ~450 W | ~60 W | N/A (you pay $/hr) |
| Energy per scan (5 min @ peak) | 37.5 Wh | 5 Wh | N/A |
| Electricity at $0.12/kWh | $0.0045 | $0.0006 | N/A |
| Hourly rental (24/7 if billed) | N/A (owned) | N/A (owned) | ~$0.70/hr |
| 5-min slice of cloud GPU | $0.011 (incl. amortized hardware) | $0.014 | $0.058 |
| + networking, egress, idle | $0.011 total | $0.014 total | ~$0.32 total realistic |
Hardware amortization: 5090 at $2,800 over 4 years of 24/7 demo use = $0.00080/hr ≈ $0.0067 per scan. M4 Max at $4,200 over 4 years (assume 8 hrs/day demo use) = $0.013 per scan.
The naive "always-on equivalent" math is brutal. Take a single GPU, leave it running 24/7, see when ownership wins:
| Path | Upfront | $/month (24/7) | 12-month total | 3-year total | 5-year total |
|---|---|---|---|---|---|
| Buy RTX 5090 + build (Seratonin) | $2,800 GPU + $1,200 rest = $4,000 | $31 electricity (450W avg) | $4,372 | $5,116 | $5,860 |
| Buy M4 Max MacBook (Big Apple) | $4,200 (M4 Max 48GB) | $5 electricity (60W avg) | $4,260 | $4,380 | $4,500 |
| Rent GCP L4 GPU 24/7 | $0 | ~$504 | $6,048 | $18,144 | $30,240 |
| Rent AWS g6.xlarge (L4) 24/7 | $0 | ~$590 | $7,080 | $21,240 | $35,400 |
| Rent Lambda Cloud A10 24/7 | $0 | ~$540 | $6,480 | $19,440 | $32,400 |
| Rent vast.ai 4090 spot 24/7 | $0 | ~$220 (spot, may be evicted) | $2,640 | $7,920 | $13,200 |
Break-even with the cheapest reliable cloud (GCP L4): 5090 build pays itself off in ~7.9 months. Big Apple pays itself off in ~8.4 months. After year one, every additional year saves $5,000–$25,000 vs cloud.
The catch: if you only need 5 hours of GPU time per month, cloud wins. The crossover is around 40 GPU-hours/month — anything above that and ownership is cheaper.
"L4-class" means similar VRAM (24 GB) and inference throughput to what we need to run TRIBE + Gemma 4 E4B comfortably. Prices as of Q2 2026, on-demand:
| Provider · instance | $/hr | $/scan (5 min) | $/month (24/7) | Notes |
|---|---|---|---|---|
| GCP · g2-standard-4 (L4, 24 GB) | $0.70 | $0.058 | $504 | Spot ~$0.32/hr |
| AWS · g6.xlarge (L4, 24 GB) | $0.82 | $0.068 | $590 | Spot ~$0.41/hr |
| Azure · NCads_H100v5 (smallest H100) | $3.20 | $0.27 | $2,304 | Overkill but only NVIDIA option Azure offers small |
| Lambda Cloud · A10 (24 GB) | $0.75 | $0.063 | $540 | 1-click deploy, no commitments |
| RunPod · A4000 (16 GB) | $0.34 | $0.028 | $245 | Per-second billing, can stop |
| vast.ai · 4090 (24 GB) | $0.30 | $0.025 | $220 | Spot — not for production |
| Modal Labs · A10G serverless | $0.000306/sec | $0.092 | N/A (per-call) | Cold-start ~5s; great for bursty |
| Replicate · L40s | $0.000725/sec | $0.22 | N/A (per-call) | Easiest API; most expensive per second |
| OpenRouter Gemma-4-26B FREE | $0.00 | $0.00 | $0 | Limit: 200 req/day, narration only |
| Local Seratonin (RTX 5090) | $0.011 | $31 | Owned hardware | |
| Local Big Apple (M4 Max) | $0.014 | $5 | Owned hardware | |
The honest sweet spots:
"Scale" for Cortex is two-dimensional: concurrent scans (TRIBE GPU-bound, ~3 min each) and narrations per second (Gemma, ~30 tok/s aggregate per GPU). The cluster shape changes by load.
| Scale tier | Scans/day | Cluster shape | Monthly cost |
|---|---|---|---|
| 1× (current) | 10–50 | Seratonin (5090) primary + Big Apple (M4 Max) overflow + OpenRouter cloud failover | ~$36/month electricity (both nodes 24/7) |
| 10× | 100–500 | 2× workstations with 5090s OR 1× workstation + 2× RunPod A4000 24/7 | ~$70/month local or ~$520/month cloud-blend |
| 100× | 1k–5k | 4× 5090 nodes behind a load balancer + autoscaling RunPod pool for bursts | ~$160 electricity + ~$800 burst = $960/month |
| 1000× | 10k–50k | 2-region GCP deployment: 8× L4 reserved + 16× L4 burst + Cloud Run frontend + GCS for scan storage + Pub/Sub for queue | ~$8,400/month + storage egress |
| 10000× | 100k+ | H100s become economical. 8× H100 nodes (1 H100 ≈ 6× L4 throughput on Gemma); ~$40K/month committed-use | ~$40K/month + $5K eng overhead |
The asymptote: at 1000× scale, the bottleneck stops being GPU and becomes shared storage + queue throughput. Each scan produces ~3 MB of BOLD predictions (.npy) plus 4 narration JSONs. At 50k scans/day that's ~150 GB/day = 4.5 TB/month of write throughput. GCS multi-regional bucket + Pub/Sub batched delivery handles this for ~$300/month.
Our two-node local setup (Seratonin + Big Apple) is a working miniature of the cloud cluster topology. Same routing logic, same failure modes, same recovery.
Browser / phone
↓ HTTPS
┌──────────────────── Tailscale Funnel ────────────────────┐
│ │
│ Active node (Seratonin OR Big Apple) │
│ FastAPI (8773) ── /api/scan, /api/scans, static │
│ Inference Router (8766) │
│ round-robin pool: │
│ ├─ local Ollama (low latency) │
│ ├─ peer Ollama (5–20 ms over Tailscale) │
│ └─ OpenRouter (cloud failover, $0) │
│ │
│ Per-request retry: 3 backends × 2 attempts = 6 tries │
│ before user sees a 502. (We've never hit one.) │
└──────────────────────────────────────────────────────────┘
This is "active-active with intelligent local preference":
Browser / phone
↓ HTTPS
Cloudflare front-door (DDoS, TLS, geo-routing)
↓
GCP HTTPS Load Balancer (Anycast IP, regional pools)
↓
┌──────── us-central1 ────────┐ ┌──────── europe-west4 ────────┐
│ FastAPI on Cloud Run │ │ FastAPI on Cloud Run │
│ ↳ scale 1 → 50 pods │ │ ↳ scale 1 → 50 pods │
│ Inference Router (sidecar) │ │ Inference Router (sidecar) │
│ pool: │ │ pool: │
│ ├─ GKE node pool 4× L4 │ │ ├─ GKE node pool 4× L4 │
│ ├─ GKE burst 16× L4 │ │ ├─ GKE burst 16× L4 │
│ └─ OpenRouter overflow │ │ └─ OpenRouter overflow │
│ Pub/Sub queue (TRIBE jobs) │ │ Pub/Sub queue (TRIBE jobs) │
│ GCS bucket for .npy + thumbnails (multi-regional) │
└─────────────────────────────┘ └──────────────────────────────┘
Health monitor: Cloud Monitoring + uptime checks every 30s,
PagerDuty on 3-strike failure, automatic regional failover
via DNS health-check withdrawal.
The only thing the cloud version adds beyond what we run locally: geographic failover (region-level) and storage durability (multi-regional GCS). Everything else — round-robin GPU pools, request-level retry, OpenRouter cloud underlay — is identical to what's running on Seratonin + Big Apple right now. The local setup is a fractal of the cloud setup.
Mercury is fundamentally different from Cortex. Cortex is GPU-bound (TRIBE inference); Mercury is conversation-bound (LLM token generation, tool calls, memory I/O).
| Mercury tier | Concurrent users | Architecture | Cost/month |
|---|---|---|---|
| Personal (current) | 1 | One Mercury process on one GPU. Discord + web + terminal = 3 surfaces. | $5 (electricity) |
| Family / small team | 5–10 | One Mercury process, shared session manager. 5090 fits easily. | $5–10 |
| Small business | 25–50 | 2× 5090 nodes load-balanced; per-user session affinity. Move memory to Postgres. | $60 |
| Department | 200 | 4× 5090 nodes + Postgres + Redis for session pool. Move embeddings to a vector DB (LanceDB). | $160 |
| Enterprise | 1000+ | Cloud-native: Cloud Run for the agent loop + Vertex AI Endpoints for the model + Firestore for memory. Per-user data residency. | $3,500+ depending on tool/MCP usage |
The thing every Mercury tier preserves: the single agent's memory. Mercury isn't N stateless replicas — it's one logical brain per user, with their conversation history, skill state, and MCP tool graph all bound to that user. The per-user session lives wherever the load balancer sends it. Adding nodes adds capacity, not new agents.