Scaling out with state persistence
When you run Solara across multiple instances, a websocket reconnect can land on an instance that never held the session's kernel. Sticky sessions keep clients pinned to their original instance and remain the routing fast path — but they cannot help when that instance is gone: a crash or OOM kill, an autoscaler scale-in, spot reclamation, an AZ failover, or a rolling deploy whose sessions outlive the drain window.
State persistence is the recovery layer for those cases. Opted-in reactive variables are written to a shared Redis backend; when a reconnect lands on a fresh instance it restores them and the client re-mounts in place. See State persistence and failover recovery for the application-side API and — importantly — the recovery model your app must follow. This page is the operations side.
Single-instance deployments should not enable this. With one process the kernel and its state die together; a networked backend only adds latency and a shorter orphan-cull window.
Enabling Redis
Install the Redis client (imported lazily, only when the backend is redis):
$ pip install redis # or: pip install "solara-ui[redis]"
Generate a strong secret key (do this once and store it in your secrets manager — every instance in the fleet must use the same value):
# Python (always available with Solara):
python -c "import secrets; print(secrets.token_urlsafe(32))"
# ...or with openssl:
openssl rand -base64 32
Then point Solara at your Redis and set the secret:
export SOLARA_STATE_BACKEND=redis
export SOLARA_STATE_URL=redis://localhost:6379/0
export SOLARA_STATE_SECRET_KEYS="paste-the-generated-secret-here"
# recommended hardening: the session cookie is a state-recovery credential, keep it out of JS
export SOLARA_SESSION_HTTP_ONLY=true
The server validates this configuration at startup and refuses to start on a misconfiguration
(missing secrets when a backend is enabled, or SOLARA_STATE_ALLOW_PICKLE=true with default
secrets). It also warns if SOLARA_SESSION_HTTP_ONLY is off while persistence is on.
Valkey, KeyDB and Dragonfly
The Redis backend speaks the Redis protocol through redis-py, so it works transparently on
Valkey, KeyDB and Dragonfly as well — useful since much managed "Redis" is Valkey after the
license fork. Managed Redis or Redis Sentinel is sufficient; async-replication failover may lose
the last few writes, which is acceptable for a recovery cache. For Redis Cluster, the per-kernel
hash is a single key, so it lives in one slot.
Secret keys and rotation
SOLARA_STATE_SECRET_KEYS signs every stored value with HMAC-SHA-256, verified before anything
is deserialized. It is required and must be non-default whenever a backend is enabled. It is a
dedicated secret — deliberately not your session/OAuth cookie secret — so that one secret does
not span cookie forgery, state tampering and (if you enable pickle) code execution.
Every instance in the fleet must use the same keys. Mismatched keys across replicas are the single most common cause of silent restore failure: cross-instance restores fail HMAC verification and bail out on one side of the fleet only, with no automatic fleet-wide check.
The setting is a comma-separated list, which enables zero-downtime rotation. Verification accepts any listed key; new envelopes are always signed with the first. Rotate in two phases so both old and new instances can verify each other during the roll:
- Add-new-verify-only — generate a new key
(
python -c "import secrets; print(secrets.token_urlsafe(32))") and prepend it, keeping the old:SOLARA_STATE_SECRET_KEYS="new-secret,old-secret". New writes are signed withnew-secret; envelopes signed withold-secretstill verify. Roll this out to the whole fleet. - Drop-old — once every instance signs with the new key and old-signed envelopes have aged
out (past the TTL):
SOLARA_STATE_SECRET_KEYS="new-secret".
Restores are rotation-safe end to end: state written under the old key still authenticates against the new key set (verify-any) and is migrated to the new primary on its next write, so promoting a new key does not orphan in-flight sessions.
Redis memory and eviction policy
The recovery backend must never double as a shared allkeys-lru cache. Under memory pressure
an LRU policy would silently evict live session state — the invisible worst case.
Configure the Redis instance with:
maxmemory-policy noeviction— refuse writes rather than evict live sessions. Solara already puts a TTL on every key, so abandoned sessions are reclaimed without eviction.- A memory alert well below
maxmemory, so you scale beforenoevictionstarts rejecting writes (a rejected write degrades gracefully — see Degraded mode — but it is a signal to size up). - A dedicated Redis instance (or at least a dedicated database / key namespace via
SOLARA_STATE_PREFIX), scoped with Redis AUTH/ACL, TLS across networks, and never exposed to the internet. Persisted state may contain PII — treat it as sensitive.
Solara logs the server's maxmemory-policy at startup so a misconfigured allkeys-lru is visible.
Sizing
A rule of thumb: memory ≈ concurrent sessions × opted-in state per session. With the JSON-first codec, opted-in state is typically tens of KB per session, so 10,000 concurrent sessions is on the order of hundreds of MB worst case. Large dataframes do not belong here — persist a reference (an id, a URL) and recompute the frame from your database on restore.
Two guard rails bound a single runaway variable: a value whose serialized envelope exceeds
SOLARA_STATE_WARN_VALUE_BYTES (default 1 MB) is logged, and one exceeding
SOLARA_STATE_MAX_VALUE_BYTES (default 5 MB) is skipped — the rest of that session's state
still persists, and the skip is counted as sync_oversize_dropped on /resourcez. So one oversize
reactive can never fill Redis or spike server memory; it simply will not be restored. Set
SOLARA_STATE_MAX_VALUE_BYTES=0 to disable the hard cap.
Lifetime: TTL versus culling
Two independent timers govern how long state lives:
- Backend TTL (
SOLARA_STATE_TTL, default the 24hkernel.cull_timeout) is refreshed on every write and on every connect. This is how long a session's state survives in Redis after the last activity, and therefore how long a late reconnect can still restore. - Orphan cull (
SOLARA_STATE_ORPHAN_CULL_TIMEOUT, default5m) is how long a disconnected in-memory kernel is kept alive on an instance before it is culled. With a shared backend there is no reason to hold an orphaned kernel in memory for 24h — its state is safe in Redis.
The honest trade: the shortened orphan cull is exactly what reduces the same-instance
live-kernel window from 24h to ~5m. A slow reconnect within 5m (to the same instance) still gets
everything, including non-persisted state. A reconnect after 5m gets only the opted-in state
back — which is the intended behavior for multi-instance, and precisely why the recovery model
matters. (For single-instance / memory-backend setups this shortened cull does not apply; state
and kernel die together, so shortening would only lose state.)
State is deleted from Redis only on a genuine tab close (a fenced delete). Culls, supersessions and server shutdown flush-and-leave the state for the TTL to reclaim — so the state survives the exact events (deploys, scale-in) the feature exists for.
Graceful shutdown
The final best-effort flush on shutdown only runs if uvicorn's lifespan teardown runs — and
uvicorn waits for connections indefinitely by default, so a lingering websocket can stall the
drain until Kubernetes SIGKILLs the pod and nothing flushes. Set uvicorn's
--timeout-graceful-shutdown (comfortably below your terminationGracePeriodSeconds) so the
drain is bounded; the shutdown flush is one bounded, batched pass over all sessions.
A clean tab-close, client-initiated close, observed TCP reset and lifespan shutdown all get a final flush. An OOM kill, SIGKILL, spot reclamation or a silent half-open TCP produce no observed disconnect and so get no final flush — their loss window is the debounce interval at best. This is the at-most-once guarantee stated plainly.
Observability
Solara has no separate metrics system; state-persistence health is exposed on the existing
/resourcez
endpoint under a state block (no backend I/O — it answers "is the feature on right now?"):
"state": {
"status": "healthy", // off | healthy | degraded
"circuit_breaker": "closed", // closed | half_open | open
"restore_attempts": 128,
"restore_success": 120,
"restore_bailout": 0,
"restore_miss": 8,
"restore_schema_reset": 0,
"flush_ok": 4210,
"flush_rejected": 0,
"flush_failures": 0,
"breaker_transitions": 0,
"superseded_closes": 2,
"superseded_while_connected": 0,
"backend_last_ok_age_seconds": 0.4,
"backend_last_error": null,
"sync_count": 8420, // fields written and ACKed
"sync_bytes_total": 1264000, // envelope bytes actually written
"sync_mb_total": 1.264,
"restore_bytes_total": 52000, // envelope bytes read back on restores
"sync_keys_dropped": 0, // per-key table overflow (capped at 500, then "(other)")
"sync_kernels_dropped": 0,
"sync_by_key": [ // top keys by bytes: WHICH VARIABLE costs the most
{"key": "myapp.filters", "syncs": 4200, "bytes": 940000, "bytes_per_sync": 223}
],
"sync_by_kernel": [ // top kernels by bytes: which session syncs the most
{"kernel": "3308f3b8", "syncs": 120, "bytes": 26000, "bytes_per_sync": 216}
]
}
The numbers that matter most: the restore success ratio after a rolling deploy is the number
that says the feature works; superseded_while_connected is the signature of broken stickiness,
cross-instance multi-tab, or an attack, and should be loud; status / circuit_breaker tell you
whether the feature quietly turned itself off.
The sync tables answer "are we writing too much?": sync_by_key aggregates per persisted
variable across all kernels (a variable with a large bytes_per_sync is a candidate for
persisting a reference instead of the value), sync_by_kernel shows which session writes the
most (kernel ids are truncated to an 8-character prefix — enough to grep the full id in the
solara.state log lines). Both tables show the top 10 by default and the top 100 with
/resourcez?verbose=1. Only ACKed writes are counted, so retries never double-count.
Persist key names can embed identifiers (e.g. key=f"user:{id}:filters"), so in production
/resourcez hashes the sync_by_key labels ("key": "sha256:…") — the byte/count numbers stay
visible, only the identifier is hidden. To see the raw labels for live debugging, generate a token
and set it:
# generate once, store as a secret:
export SOLARA_RESOURCEZ_TOKEN="$(python -c 'import secrets; print(secrets.token_urlsafe(32))')"
# then read the un-redacted breakdown:
curl -H "Authorization: Bearer $SOLARA_RESOURCEZ_TOKEN" https://your-host/resourcez?verbose=1
In non-production mode the labels are shown in full and no token is needed. A missing or wrong
token simply falls back to the hashed view (never an error), and /readyz is never gated.
/readyz is deliberately backend-independent — gating readiness on a recovery cache would turn
a Redis blip into a fleet-wide NotReady. Backend health appears only on /resourcez.
Structured logs
The solara.state logger emits one greppable, alertable line per event:
restore result=success|timeout|bailout|miss|fresh-schema kernel=… key=… cause=hmac|codec
flush result=ok|rejected|error kernel=… n_fields=…
breaker transition=open|half_open|closed reason=…
close reason=page-close|cull|superseded|server-shutdown|evicted deleted=true|false
Alert on: a rising restore result=bailout rate, any breaker transition=open, and a rising rate
of close reason=superseded.
Runbook
"Users see refresh dialogs after a deploy." Expected once iff the client bundle (served JS/assets) changed — the browser genuinely needs the new assets. If the bundle did not change, this is an incident:
- Check
restore_bailouton/resourcez. A spike means envelopes are failing verification/decode. - Check that all replicas share the same
SOLARA_STATE_SECRET_KEYS— the #1 real cause. There is no automatic fleet-level check; verify by hand. - Check for
SOLARA_STATE_SCHEMA_TAGstragglers (an instance on an old tag). A schema mismatch is a graceful reset (soft-remount, no dialog), so a dialog points at bailout or secrets, not the tag.
"State is not restoring (fresh start, no dialog)."
/resourcezstateblock first:statusandcircuit_breaker.degraded/openmeans the backend is unreachable and Solara has degraded to a stateless fresh start (see below).- Backend reachability from the instances (
redis-cli -u $SOLARA_STATE_URL ping). - Secret-key uniformity across replicas.
- Whether the affected variable actually had a stable explicit
key=— a renamed or derived key resets state by design.
Degraded mode
Redis is a recovery cache, not a database: when it is down, Solara degrades to exactly today's
behavior. A per-process circuit breaker opens after SOLARA_STATE_BREAKER_FAILURES consecutive
backend errors and stays open for SOLARA_STATE_BREAKER_WINDOW before a single half-open probe.
While open, connects skip the takeover read instantly (they do not each pay the connect timeout
during a brownout) and writes are skipped — so a Redis outage never taxes the interaction path.
Writes are off-thread, and keys stay dirty until acknowledged, so recovery is complete once the
backend returns. Every breaker transition is logged and counted.
See the full configuration reference
for all SOLARA_STATE_* settings.