Scaling

By default an app runs as one Fargate task doing everything — Octane plus the bundled queue worker and scheduler. That's the cheap floor and fine at low scale. The three workloads have different scaling shapes, though, so each can be extracted into its own ECS service that scales independently:

Service	How it scales	Opt in with
web	target tracking (request concurrency + CPU), `min`→`max`	`tasks.web.autoscaling`
queue	backlog-per-task, scales to zero	top-level `tasks.queue`
scheduler	never — pinned singleton (exactly one task)	top-level `tasks.scheduler`

Extraction is opt-in by presence — there are no tasks.web.queue / tasks.web.scheduler flags. Add a top-level tasks.queue block to peel the worker tier (the queue worker and the scheduler) out of web, leaving web as pure Octane; add tasks.scheduler as well to give cron its own pinned-singleton task (see the scheduler).

You can scale the web (and queue) service two ways:

Autoscaling — let AWS adjust the task count automatically from live metrics.
yolo scale — set the capacity yourself, out of band, without a deploy.

Autoscaling bounds (min/max) live in the manifest and are reconciled by sync, so they're declarative and never drift — with a guard so a stale manifest can't scale production down unattended (see Reducing capacity is guarded). A fixed service's desired count is create-only — set once, then owned by yolo scale, never reset by a routine sync or deploy.

Autoscaling

autoscaling is required on web and queue — there's no implicit default, and neither accepts the bare tasks.web: true / tasks.queue: true shorthand (only the scheduler does). yolo init scaffolds new apps with tasks.web.autoscaling: true (bounds 1–5), so a fresh app scales out of the box. To set your own bounds, expand the shorthand into a block:

yaml

tasks:
  web:
    autoscaling:
      min: 1
      max: 6
      cpu-utilization: 65   # optional — the safety-net policy's target

The scaffolded shorthand takes the defaults (min: 1, max: 5):

yaml

tasks:
  web:
    autoscaling: true       # shorthand for the defaults; `false` = a fixed single task

An enabled web tier must declare autoscaling — omitting it (or using the bare tasks.web: true shorthand) hard-fails the manifest check, so a tier's scaling posture is always a deliberate decision rather than an inherited default. With true (or a block), the next yolo sync / yolo sync:app registers an Application Auto Scaling scalable target on the ECS service (bounded by min/max) and attaches target-tracking policies to it. Set autoscaling: false to keep the service a fixed single task instead.

Two metrics, composed

YOLO runs two target-tracking policies at once. Application Auto Scaling takes the maximum desired count any policy asks for, so they compose rather than fight — scale-out always wins.

Policy	Metric	Role
Request concurrency	in-flight requests per task (derived)	The default, leading signal — concurrency climbs the instant traffic does, ahead of CPU. Scales the web tier under normal HTTP load. No tuning: its target comes from the task's pinned worker pool (sized from vCPU).
CPU	`ECSServiceAverageCPUUtilization`	The safety net. Catches load that pegs the CPU without raising request concurrency — a few heavy, low-rate requests. Target defaults to 65%.

Both are on the moment the web tier declares autoscaling: true (or a block) — there's nothing to seed from a load test first. Scaling on the requests a task is actively serving rather than trailing CPU means faster responses need fewer tasks for the same traffic, and a spike is caught as it arrives.

How the concurrency target is derived

The ALB doesn't publish in-flight concurrency, so YOLO derives it with CloudWatch metric math from two metrics it does — request rate and response time (Little's Law, concurrency = rate × latency):

concurrency_per_task = (RequestCountPerTarget / 60) × TargetResponseTime

and target-tracks it against the task's pinned worker pool held at 70% utilisation. The web tier pins its FrankenPHP worker count from the task's real vCPU allocation — 16 × vCPU, capped by memory — rather than letting FrankenPHP auto-detect it, which on Fargate reads the microVM's fixed ~2 vCPUs and so pins ~4 workers on every task whatever its size. Each worker serves one request at a time and blocks for that request's whole lifetime (including an SSR render it can't yield during), so the pool size is the per-task concurrency ceiling. A 1 vCPU task → 16 workers → a target of ~11 concurrent requests, leaving headroom for the within-minute peak and the next task's cold start. Resize the task (tasks.web.cpu) and both the pool and the target follow; there's no separate knob.

Because the signal includes latency, a slow downstream dependency (a struggling database) raises concurrency and scales the web tier out even when more tasks won't help — the max bound is the backstop there, since CPU stays low when the stall is downstream.

Faster scale-out: burst

The two policies above scale on ALB metrics, which are 1-minute resolution — a good signal, but ~1 min behind a sudden spike. So once you're autoscaling, YOLO also runs a burst path. There's no knob for it: it's near-free and fails safe, and no app wants slower scaling, so — like the concurrency and CPU policies — it's just part of how web autoscaling works, provisioned wherever the scalable target is. (The signal is FrankenPHP's worker metrics, which only worker mode — Octane, the default — populates; a classic-mode tier simply never emits it, so the alarm sits inert and burst is a no-op there. Nothing to switch on or off.)

Burst adds a step-scaling policy driven by a high-resolution alarm (10s) on a signal the container reports about itself: each web task publishes its saturation — the peak in-flight request count over the worker-pool size — an earlier indicator than the ALB, since concurrency fills the pool before latency even climbs. The numerator is counted directly (the task brackets every request) rather than read from FrankenPHP's busy_workers gauge, because that gauge — sampled from the same after-response hook that publishes — under-reports the very pin burst exists to catch: the sampling worker counts itself idle and catches the pool the instant a worker freed, so a genuinely pinned box reads low (high when idle, low when pinned — the signal inverted). The pool size is still read from FrankenPHP — a static gauge that reads correctly even under load. Detection drops from ~60s to ~10–15s; once saturation clears 70% it adds a task, and beyond 80% it adds two. The threshold sits at 70% rather than higher because saturation quantises to in-flight ÷ pool — so a 70% line trips a step below a full pin (e.g. 6-of-8 busy on a 0.5 vCPU task) while a tighter threshold would need a sustained 100% pin that rarely holds. Scale-in stays with the target-tracking policies, so burst can only ever scale out faster, never fight them.

How it works, and what it costs:

YOLO's service provider publishes the saturation directly via PutMetricData from an after-response hook on the autoscaling web tier — so the work rides a request that already holds a CPU slice rather than a separate loop competing for one on a pinned box. It publishes only while hot (≥50%, the step just below the trip, so the alarm is fed a not-breaching reading as load ramps), debounces to at most one read + put per ~5s per task (Redis), and holds the cooldown after a tripping datapoint, so CloudWatch is touched only during a spike. If the pool-size scrape fails under load, it corroborates with a cheap local cgroup CPU read and breaches when CPU is high — a read the worker can always do, independent of the (possibly starved) endpoint (taken as a percentage of the task's allocated vCPU, which YOLO injects on the task-def, since the Fargate microVM reports more vCPUs than a fractional task is throttled to — so a percent-of-visible-cores reading would never trip). Going direct lands the datapoint synchronously; an EMF log line instead rides the logs pipeline — and the ECS awslogs driver exposes no flush-interval knob (AWS recommends ≤5s for high-res EMF alarms) while extraction is async, so it surfaces on a cadence you don't control. The cost is one namespace-scoped cloudwatch:PutMetricData grant on the task role plus the aws/aws-sdk-php SDK — which ships in the image transitively via YOLO itself, a production dependency of every deployed app (a build preflight hard-fails if YOLO is only a dev dependency).
To turn that endpoint on, YOLO runs Octane against a Caddyfile it generates — your installed Octane stub with the top-level metrics global option added, passed via --caddyfile. It's generated only for an autoscaling Octane web tier and is your stub untouched bar that one line (so Octane still fills its own placeholders). An env var won't do here: octane:start rebuilds CADDY_GLOBAL_OPTIONS itself, discarding any value set on the task. The endpoint binds container-loopback (localhost:2019) only — never the load-balanced port — so it adds no external surface.
Cost is the one high-resolution alarm ($0.30/app/month) plus the custom metric (another ~$0.30, but only in months the service actually bursts — a metric is billed only when it receives data). The puts themselves are effectively free: inside CloudWatch's 1M-request/month API free tier, and $0.01 per million beyond it.

Burst is not a substitute for warm capacity

Even instant detection still waits ~55s for the new task to boot and pass ALB health. So reactive scaling — burst included — bottoms out at ~1 min to relief; below that you need a task that's already running (min ≥ N). Burst makes the spike that exceeds your warm headroom land faster; it doesn't remove the need for the headroom.

The in-request publish is also best-effort: on a single hard-pinned task (min 1 on a small box at ~99% CPU) where no request even completes, nothing inside the container escapes and burst can go dark. The CPU fallback covers the busy-but-serving case, but the CPU/concurrency target-tracking policies are the guaranteed backstop and min ≥ 2 or a larger task is the lever — burst sharpens the light-pin and multi-task case, it isn't a substitute for either.

The burst signal is graphed on the app's CloudWatch dashboard: the Worker saturation panel charts the busiest task's saturation with the Burst trip threshold (70%) drawn as a reference line, so you can see how close the tier runs to a burst and when one fired. The panel appears only on an autoscaling Octane web tier — the only place the metric exists.

The burst alarm and step policy aren't taggable, so (like the target-tracking policies) they don't appear in yolo audit; setting autoscaling: false (or switching the web tier to classic mode) deletes both on the next sync.

Shedding SSR under load

On an Inertia SSR app, the same worker-saturation reading drives a second, instant lever. Scale-out still bottoms out at ~1 min to relief (a new task has to boot), so to cover that window YOLO routes SSR through a saturation-aware gateway: while a task is flagged hot (the burst trip), it skips the Node render and serves CSR instead. Server-side rendering is the most expensive per-request CPU on the box, so shedding it the instant saturation trips frees the worker immediately and keeps the task responsive while the new capacity lands — one signal, a slow lever (add a task) and an instant local one (stop rendering). The flag is per task and self-expires on the burst cooldown, so SSR resumes automatically once the task stops tripping. The same gateway also bounds each render with a timeout so a single slow render can't pin a worker (a worker stuck on a synchronous CPU-bound render is how one hot task spirals into a health-check death-loop). It needs nothing in the manifest — it talks the stable Inertia SSR config/protocol so it's version-agnostic across Inertia v2/v3, is active wherever burst runs and Inertia SSR is enabled, a no-op on a non-Inertia app, and degrades to CSR on any failure, never an error.

YOLO owns the Inertia Gateway binding

On the autoscaling web tier YOLO binds Inertia\Ssr\Gateway to its saturation-aware gateway during its own service-provider boot. Container bindings are last-writer-wins, so an app that rebinds Inertia\Ssr\Gateway in its own provider silently drops the load-shedding — the saturation bypass and the render timeout both vanish, with no error, re-opening the death-loop the gateway exists to close. If you need custom SSR behaviour, extend Codinglabs\Yolo\Runtime\Ssr\SaturationAwareSsrGateway and call parent::dispatch() rather than binding the interface fresh.

Turning autoscaling off

Autoscaling is declarative — sync reconciles live state down to what the manifest asks for. Since the key is required you turn it off by setting autoscaling: false (not by removing it); that deregisters the scalable target on the next yolo sync, which cascades the delete to every policy and alarm on it.

Deregistering doesn't drop tasks — the service reverts to a fixed task count frozen at its current live count. Bring it down with yolo scale if you no longer need the extra capacity.

What isn't tagged

Application Auto Scaling targets and policies can't carry tags, so they don't show up in yolo audit — they're reconciled by config (above) rather than by the tag-driven audit.

Manual scaling

yolo scale changes capacity without a build or deploy. Like env:push, it shows a current → new comparison and asks before applying.

bash

yolo scale production --web --min=3 --max=10     # web autoscaled: set the bounds
yolo scale production --web 3                      # web fixed: set the desired count
yolo scale production --queue --min=0 --max=20     # queue bounds (min 0 = scale to zero)

Under autoscaling you set the bounds (--min/--max), never a desired count — the policies own desired count and would override it. Crucially, scale writes the bounds back to the manifest (surgically — your comments and formatting survive), so the manifest stays the single source of truth and the next yolo sync reconciles to the same values rather than clobbering your change.

For a fixed service (autoscaling: false — web or queue) a positional count sets the ECS desired count directly. An autoscaling service (autoscaling: true or a block) only takes --min/--max. The scheduler is a singleton and can't be scaled (--scheduler errors out).

Reducing capacity is guarded

Because the manifest is authoritative, a yolo sync run with a stale manifest could otherwise scale production down — exactly the wrong thing during an incident. So lowering a live bound is gated:

yolo scale down → an explicit confirm that defaults to no.
yolo sync (interactive) → the reduction shows in the plan and the normal confirm gate guards it; abort and nothing changes.
yolo sync --force / non-interactive → the reduction is refused (skipped + warned). Lowering capacity must be deliberate and attended — an interactive sync or yolo scale. Raises always apply.

So an emergency yolo scale production --web --min=10 is durable: it's written to the manifest and live, and no unattended sync can quietly walk it back.

The queue (scale to zero)

Add a top-level tasks.queue block to give the queue worker its own ECS service, separate from web:

yaml

tasks:
  web:
    autoscaling: true
  queue:
    autoscaling:
      min: 0          # scale to zero when idle (opt-in; the default floor is 1)
      max: 20
      backlog-per-task: 100
    spot: true        # optional: ~70% cheaper interruptible capacity

Like web, the queue must declare autoscaling — true takes the defaults (min: 1, max: 5), false pins a fixed single task. It scales on backlog per task — ApproximateNumberOfMessagesVisible / RunningTaskCount, computed with CloudWatch metric math (no Lambda) and held at backlog-per-task messages per running task. As the backlog grows it scales out toward max; as it drains it scales back in toward min.

With autoscaling.min: 0 the queue scales to zero: no tasks and no compute cost when idle. Target tracking can't lift it off zero (dividing by zero running tasks is undefined), so YOLO also attaches a step-scaling alarm that sets the service to exactly one task the instant a message becomes visible; target tracking owns it from one upward. The cost is a ~30–60s cold start (image pull + boot) on the first message after idle.

That makes the choice of where the queue lives a latency decision:

Topology	Idle cost	Pickup latency	Use for
Bundled (no `tasks.queue` block)	included in web	instant (worker always warm)	light, latency-sensitive jobs
Standalone, `min: 0`	~$0	~30–60s cold start from idle	bursty, latency-tolerant async
Standalone, `min: 1+`	one always-on task	instant, then autoscales	high-volume, always-busy

For multi-tenant apps, a single queue service works the app's default queue; per-tenant queue fan-out is on the roadmap and isn't covered here.

The scheduler

The scheduler (supercronic firing schedule:run every minute) must run as a singleton — if it runs on N tasks, every scheduled job fires N times (N× emails, N× billing, N× reports). The queue is safe to multiply (SQS hands each message to one worker); the scheduler is not. There's no stable per-task identity on Fargate to elect one from, so pick one of two strategies.

1. `->onOneServer()`

Keep the scheduler bundled in its default container (the web container, or the standalone queue if you've extracted one) and add Laravel's onOneServer() to every scheduled task. It takes an atomic lock in the shared cache so only one replica runs each task per minute:

php

$schedule->command('reports:send')->daily()->onOneServer();

This requires a shared lock store (the Valkey/Redis cache YOLO provisions, or a database cache) — which production apps run anyway. It keeps the simple single-service topology and lets the bundled task scale freely.

The catch: it's per-task. A scheduled task registered by a package (Telescope pruning, backups, etc.) that you can't annotate will still multi-fire — which is your signal to reach for strategy 2.

2. Extract the scheduler (recommended once web scales)

Give the scheduler its own service with a top-level tasks.scheduler block:

yaml

tasks:
  web:
    autoscaling: { min: 1, max: 6 }
  scheduler: {}     # its own pinned-singleton service

YOLO pins it at exactly one task (never a scalable target) and deploys it stop-then-start (minimumHealthyPercent: 0 / maximumPercent: 100) so a rollout stops the old cron before starting the new one — a deploy never briefly runs two schedulers (a missed cron minute is harmless; a double-run isn't). This removes the onOneServer() requirement entirely — it's genuinely a singleton now — though leaving onOneServer() on is harmless. The web tier then scales without any scheduler concern.

TIP

When the scheduler is bundled into a host that runs more than one task — an autoscaling web task, or a standalone queue (both must declare autoscaling) — yolo sync lists an advisory under the plan's Warnings section pointing at these two strategies. It's a nudge, not a gate — YOLO can't see inside your kernel to know whether you've used onOneServer().

Scaling ​

Autoscaling ​

Two metrics, composed ​

How the concurrency target is derived ​

Faster scale-out: burst ​

Shedding SSR under load ​

Turning autoscaling off ​

What isn't tagged ​

Manual scaling ​

Reducing capacity is guarded ​

The queue (scale to zero) ​

The scheduler ​

1. ->onOneServer() ​

2. Extract the scheduler (recommended once web scales) ​

Scaling

Autoscaling

Two metrics, composed

How the concurrency target is derived

Faster scale-out: burst

Shedding SSR under load

Turning autoscaling off

What isn't tagged

Manual scaling

Reducing capacity is guarded

The queue (scale to zero)

The scheduler

1. `->onOneServer()`

2. Extract the scheduler (recommended once web scales)