State persistence and failover recovery
The problem
A Solara virtual kernel is stateful: each browser tab owns a kernel on one server instance, holding its widget tree, reactive values and hook state. Behind a load balancer a websocket reconnect can land on a different instance, which has no kernel for that kernel id. Today the client then shows the blocking "Could not restore session. Please refresh." dialog and all state is lost.
Sticky sessions keep a client pinned to its original instance and remain the routing fast path. But stickiness cannot cover the cases where the original instance is gone: a pod crash or OOM kill, an autoscaler scale-in, spot-instance reclamation, an availability-zone failover, or a rolling deploy whose sessions outlive the drain window.
State persistence is the recovery layer for exactly those cases. You opt selected reactive variables into a shared backend (Redis). When a reconnect lands on a fresh instance, that instance restores the opted-in values from the backend and the client re-mounts in place — same URL, opted-in state intact, no refresh dialog.
When to enable it
- Multi-instance deployments where sessions can outlive the instance that created them (Kubernetes, autoscaling groups, rolling deploys, spot instances).
When not to enable it
- Single-instance deployments. With one process the kernel and its state die together; there is nothing on another instance to recover to, and enabling a networked backend only adds latency and a shortened orphan-cull window (see the deployment guide). Keep it off (the default).
Persistence is opt-in per reactive variable. Most state is derived, or cheaply recoverable from a database, and much of it does not serialize well. You declare what actually matters.
Enabling persistence
Persistence is off until you configure a backend. In production that backend is Redis — see
Scaling out with state persistence
for the full Redis, security and ops story. For local development you can use the in-process
memory backend, which stores the same signed envelope bytes Redis would (see
Testing failover in development below).
The minimum to turn it on:
export SOLARA_STATE_BACKEND=redis
export SOLARA_STATE_URL=redis://localhost:6379/0
# generate a strong secret once (same value on every instance); REQUIRED whenever a backend is enabled
export SOLARA_STATE_SECRET_KEYS="$(python -c 'import secrets; print(secrets.token_urlsafe(32))')"
A full configuration reference is at the bottom of this page.
The API
Persistence is a keyword on the existing solara.reactive:
import solara
count = solara.reactive(0, persist=True, key="myapp.count")
persist=True opts the variable in; key= is its stable, cross-process persistence key.
The same key must resolve to the same variable across every instance and every deploy, so an
explicit key= is the norm for anything meant to be long-lived (it survives file moves and
renames).
Derived keys (prototype convenience)
If you omit key=, Solara derives one from the definition site — but only for the two
statically unambiguous patterns:
import solara
# module-level single-name assignment -> key "myapp.pages:count"
count = solara.reactive(0, persist=True)
class Settings:
# class attribute -> key "myapp.pages:Settings.theme"
theme = solara.reactive("light", persist=True)
Derived keys are name-only (no line numbers), so reformatting is safe. But renaming or moving
the variable changes the derived key, which discards the old state (a clean reset — the same
semantics as a schema-tag bump, below). Each auto-derived key is logged once at INFO
(derived persistence key "myapp.pages:count"), so a rename-induced reset is greppable. Use
derived keys for prototyping; switch to an explicit key= before you rely on the state.
The factory refusal
Anything ambiguous — a reactive created inside a function, factory, loop, conditional, a
tuple/attribute target, or a source-less environment — raises ValueError demanding an
explicit key, rather than silently guessing. This is deliberate: the naive "fix" of a constant
key inside a factory would make every user session share one backend field, disclosing one
user's state to another on restore.
def make_query_state():
return solara.reactive("", persist=True) # raises: no stable per-instance key
The error names the definition site and shows the correct per-instance fix:
persist=True requires an explicit key= here: a reactive created inside a function or
factory produces one instance per call; instances cannot share a key.
Definition site: myapp/state.py:42: `return solara.reactive("", persist=True)`
Give each instance a unique key, e.g.: solara.reactive(..., persist=True, key=f"user:{user_id}:query")
NEVER use a constant key for reactives created per-instance - instances would overwrite each other's state.
The store pattern
Solara's store pattern
creates reactives inside __init__, which the derivation refuses (one store instance per call).
Give each reactive a per-instance key built from an id you already have:
import solara
class UserQueryStore:
def __init__(self, user_id: str):
prefix = f"user:{user_id}"
self.query = solara.reactive("", persist=True, key=f"{prefix}:query")
self.page = solara.reactive(0, persist=True, key=f"{prefix}:page")
The user_id discriminator is what keeps two users' stores in two different backend fields.
Never use a constant key here.
PersistConfig
persist=True is sugar for persist=solara.PersistConfig(key=..., serializer="json"). Pass a
PersistConfig when you need a non-default serializer:
import solara
filters = solara.reactive(
{"country": "NL", "year": 2024},
persist=solara.PersistConfig(key="myapp.filters", serializer="json"),
)
PersistConfig has exactly two fields: key and serializer. An explicit key= argument
always wins over PersistConfig.key.
Serialization
Every persisted value is serialized by a codec and wrapped in an HMAC-signed envelope that is verified before anything is ever deserialized.
The default JSON codec
The default "json" codec is strict but coerces the types that come straight out of real
widget, filter and form state. It round-trips:
| Type | Result |
|---|---|
None, bool, int, float, str | round-trips faithfully |
list, dict (with the members below) | round-trips faithfully |
set, frozenset | round-trips faithfully |
datetime.datetime / date / time | round-trips faithfully (ISO format) |
uuid.UUID, decimal.Decimal | round-trips faithfully |
bytes | round-trips faithfully (base64) |
enum.Enum (incl. IntEnum / StrEnum) | round-trips faithfully |
Pydantic BaseModel (v2, v1 fallback) | round-trips faithfully (self-describing, see below) |
dataclass instances (incl. nested) | round-trips faithfully (self-describing, see below) |
tuple | coerced to list (JSON has no tuple type) |
bytearray | coerced to bytes |
numpy integer / float scalar | coerced to Python int / float |
Anything else raises a SerializeError. The strict codec fails deterministically on the first
write, so you hit it immediately in development, never silently in production. For exotic
objects, register a custom codec (see below).
Pydantic models and dataclasses
Models and dataclasses are self-describing: the envelope records the class
(module:qualname) with the value, so deserialization never needs a type declared anywhere —
which is exactly what makes the common Optional[Model] pattern work:
user = solara.reactive(None, persist=True, key="myapp.user") # holds None or a User
When a User lands in the reactive, the envelope tags it; on restore the class is imported,
checked (only BaseModel subclasses and dataclasses are ever instantiated — never an arbitrary
class), and validated with model_validate (pydantic) or reconstructed field-by-field
(dataclasses; init=False fields are recomputed by __post_init__, not stored). Nested
dataclasses, enums and dates inside models all round-trip. Containers of models work too:
a reactive holding [User, User] — or models nested in dicts/lists at any depth — restores
real model instances, because every element carries its own tag (a tuple of models comes
back as a list of models, the usual tuple caveat). Pydantic semantics are preserved
exactly: a typed field (members: list[User]) reconstructs its models, an untyped list field
round-trips to dicts — the same as pydantic's own model_validate(model_dump()). (Untyped
dataclass fields do reconstruct models — dataclass fields go through solara's recursive
tagging rather than model_dump().)
Two caveats:
Renaming or moving a persisted class breaks old envelopes (the stored
module:qualnameno longer resolves) — the restore bails out and the user gets the refresh dialog. BumpSOLARA_STATE_SCHEMA_TAGwhen you rename model classes for a clean reset instead.Model shape changes follow pydantic rules: added optional fields are compatible; a new required field makes old envelopes fail validation (bail-out). Breaking shape changes are what the schema tag is for. Validators and
__post_init__run during restore, so keep them side-effect-free.Watch out for the
tuple->listround trip. A reactive holding atuplerestores as alist. If your code branches onisinstance(x, tuple), normalize on read or hold alist.In-place mutation is not persisted. With mutation detection off (the production default), mutating a persisted value in place (
items.value.append(...)) fires no change and is therefore never persisted. The existing "re-assign, don't mutate" rule becomes load-bearing here:items.value = [*items.value, x].
Custom codecs
Register a (dumps, loads) pair under a name and select it with serializer=. Use this for
types the JSON codec does not cover, or when you want rename-robust envelopes (a custom codec
names its class in code, not in the stored data, so renaming the class cannot break old
envelopes as long as the codec keeps up):
import json
import solara
import solara.state
solara.state.register_codec(
"geo",
lambda value: value.to_wkb(),
lambda blob: shapely.from_wkb(blob),
)
area = solara.reactive(
DEFAULT_AREA, persist=solara.PersistConfig(key="myapp.area", serializer="geo")
)
The pickle codec
serializer="pickle" is fast and general but version-skew fragile, and — because pickle
executes arbitrary code on load — an RCE risk if the backend is ever compromised. It is
therefore behind a deployer-side gate: it raises unless the deployment sets
SOLARA_STATE_ALLOW_PICKLE=true. Application or library code cannot silently opt a deployment
into a pickle-deserialize path. The HMAC envelope narrows the exposure ("Redis write access"
alone is not enough — an attacker also needs the app secret), but JSON or a custom codec is the
safer default. Prefer them.
Tightening JSON decode (optional)
The default JSON codec restores tagged dataclasses / pydantic models / enums by resolving the
class named in the (HMAC-verified) envelope, gated to that kind. If you want a stricter boundary —
so that even an attacker holding the secret can only reconstruct types you explicitly allow — call
solara.state.allow_decode_types("myapp.models:Renter", "myapp.models:Unit", ...) at startup.
Once any type is registered the allow-list is enforced; leaving it empty keeps the default
(kind-gated) behavior, which breaks nothing.
The recovery model
This is the heart of using persistence well. Persistence restores your opted-in reactives; everything else must be re-derivable.
The invariant
After a failover, your app is re-run from scratch as a pure function of:
- the URL / route (preserved by the client on reconnect),
- the persisted reactives (restored before the first render), and
- external sources of truth (your database, object store, APIs).
Everything else re-derives automatically, because the re-mount re-runs the whole component
tree: use_memo recomputes, use_task / use_effect re-fire, Computed recomputes. A task
keyed on persisted inputs recomputes its dataframe from the database on the new instance exactly
as it did on the original cold start. Recovering "the rest" is not a special recovery hook — it
is the normal first render doing its normal job.
The discipline in one line: persist inputs, recompute outputs. Never persist what a task or memo derives, and never hold non-derivable truth in a non-persisted reactive.
Dependent selections: the orphaned-selection trap
A selection that indexes into a derived list — Continent → Country → City — is where
persist=True on the wrong reactive produces the subtlest failure: a restore that reports
success and is still wrong.
# WRONG
continent = solara.reactive("") # not persisted (looks harmless)
@solara.lab.computed
def country_options() -> list[str]: # derived: recomputed on the new kernel
return countries_in(continent.value) # continent == "" -> []
country = solara.reactive("", persist=True, key="geo.country") # persisted leaf into a non-persisted set
After a failover continent is back at "", country_options recomputes to [], and the
restore installs country = "France" — a value not in the list it was chosen from. Depending on
the consumer you get a select rendering a phantom value absent from its items, an
options.index(country.value) raising ValueError, or a widget that validates/clamps on mount
and silently rewrites the persisted value. lastRestore.status is "success" throughout —
the persistence layer did its job; the set of keys you chose was not self-consistent.
The principle: persisted keys are restored onto a blank kernel together, and alone. Persisting a selection obligates you to (a) make the set it selects from reconstructible on that blank kernel, and (b) re-validate the selection against the recomputed set on restore (the new instance may also see newer data). A decision ladder:
- Persist the leaf identifier only; derive the parents upward. If the leaf has a stable
unique id,
country/continentare a pure lookup from it, so a restore cannot be internally inconsistent — and it is fewer persisted keys, not more:
city_id = solara.reactive(None, persist=True, key="geo.city_id")
@solara.lab.computed
def country() -> str | None:
return city_lookup(city_id.value).country if city_id.value is not None else None
This is the opposite of the reflex "then I must persist the continent too… and the list… and
it recurses." Persisting the whole chain works but is usually the worst fix: each key flushes
on its own debounce window, so a continent changed at t and a
country picked at t+2s are separate writes — a failover between them restores a torn pair —
and you still owe the re-validation. Never persist a selection as an object embedding its
parents (a stale copy), nor as a list index (drifts when the list reorders).
- When upward derivation is impossible, persist the top input and recompute + re-validate downward — reset an orphaned value, never render it:
continent = solara.reactive("", persist=True, key="geo.continent")
country = solara.reactive("", persist=True, key="geo.country")
@solara.component
def CountrySelect():
options = country_options.value # recomputed from the restored continent
if country.value and country.value not in options:
country.set("") # orphaned by restore or a data change: reset
...
The guard lives in the component (not in a restore hook) because an orphaned selection is not failover-specific — any data refresh can invalidate a list under its selection.
- Never persist the derived list (
country_options) — recompute it; a persisted copy can disagree with what the new instance computes from fresh data, and it may be large.
"But the derivation is expensive." Persisting the expensive output is still the wrong
answer — a failover can restore a stale output that disagrees with its recomputed inputs, with no
signal. Let it recompute on the new instance (a failover is a cold start, which already pays that
cost — the examples/state-persistence-failover/ demo recomputes deliberately), or cache it in a
store keyed on and invalidated by its inputs (a cache, not persist). Persisting the output
plus a persisted hash of its inputs, discarded on mismatch, is a last resort — that is
hand-rolling the same cache inside the state envelope.
Litmus test: if you restore only the persist=True keys onto a blank kernel, is each one
valid on its own — or does it depend on a sibling you did not persist? If it names a sibling,
derive that sibling from the persisted key (rung 1) or persist the root input and add the
re-validation guard (rung 2).
The honest residue
Be aware of what does not come back for free:
- Non-derivable state you did not opt in is lost, by definition. The fix is to opt it in (or save it continuously to a database). No framework can conjure it back.
- Mid-flight external side effects. A wizard fired the step-3 email, state says step 3, but
failover happened mid-step. State is at-most-once; external effects need app-level
idempotency. Use
solara.state_generation()as a fencing token for the strict cases. - In-flight input at the moment of the drop (the keystroke you were typing) is lost.
- A background computation lost mid-run re-runs from scratch on re-mount — fine as long as it is pure/idempotent.
Fencing external side effects
solara.state_generation() returns a monotonically increasing epoch for this kernel (it needs
no backend round trip). It increases on every failover, and returns None when persistence is
off or this instance does not own the state. Pass it into your own external writes so that a
superseded instance — an old kernel whose background task comes back to life after another
instance took over — cannot re-apply a stale effect:
import solara
order_id = solara.reactive("", persist=True, key="checkout.order_id")
def charge():
epoch = solara.state_generation() # increases on failover; None when unowned
payment.charge(
order_id=order_id.value,
idempotency_key=order_id.value, # stable across a re-run -> charged at most once
fencing_epoch=epoch, # lets the payment service reject an older instance
)
The idempotency key (derived from persisted state) makes a re-run safe; the fencing epoch lets your external system reject a write from a superseded, lower-epoch instance.
Failure semantics
Two outcomes are treated deliberately differently on reconnect:
- Schema reset (expected, graceful). If the persisted value shape changed — a live redeploy,
a new Solara version, or an explicit
SOLARA_STATE_SCHEMA_TAGbump — the whole stored hash is discarded, the kernel starts fresh, and the client soft-remounts into fresh state. Live redeploys "just work", with a clean state reset. No dialog. - Bail-out (unexpected, visible). If any restored envelope fails HMAC verification or decoding, restore is all-or-nothing: it discards everything, deletes the poisoned hash (so a bookmarked kernel id cannot loop forever), and the client shows the refresh dialog. A partially restored app — some variables restored, others silently at defaults — is a state no author ever tested; a visible "we lost, please refresh" is more honest than a silently inconsistent app.
Because writes are debounced and best-effort, the guarantee is at-most-once: on failover you get your state back as of at most the flush-debounce window before the disconnect — when the disconnect is observed or the shutdown is graceful. A hard kill or a silent network drop can lose a little more.
Testing failover in development
Static verification is impossible — "can my app recover?" has the same status as "is my app correct after F5?" It is achieved through discipline plus cheap, continuous testing.
You can exercise real failover in a single solara run process, with zero infrastructure,
using the in-process memory backend plus a dev-only eviction route:
export SOLARA_STATE_BACKEND=memory
export SOLARA_STATE_SECRET_KEYS="dev-secret" # required for any backend, including memory
export SOLARA_STATE_TEST_EVICTION=true # dev/test only; refused in production
solara run myapp.py
Then, from the browser devtools console, run:
solara.debug.simulateFailover()
This evicts the in-memory kernel and drops the socket, so the reconnect behaves exactly like landing on a fresh instance — only backend state survives, and restore runs for real.
The documented practice: walk every page of your app, trigger a failover at each interesting state, and verify what you see. Cheap enough to become habit, like checking hot-reload. If a value you expected is gone, opt it in (or make sure it re-derives from URL + persisted inputs + database).
Stopping user threads cleanly
If you spawn your own threads, capture the kernel's closed event inside the kernel context at spawn time and stop when it fires. This covers every close reason — a supersession or cull, not just a tab close:
import threading
import solara
def start_worker():
closed = solara.kernel_closed_event() # capture inside the kernel context
def run():
while not closed.is_set():
do_some_work()
closed.wait(timeout=1.0)
threading.Thread(target=run, daemon=True).start()
Configuration reference
Every setting is read from an environment variable with the SOLARA_STATE_ prefix (or a .env
file).
| Environment variable | Default | Meaning |
|---|---|---|
SOLARA_STATE_BACKEND | "" (disabled) | Backend name: redis, memory, or empty to disable persistence. |
SOLARA_STATE_URL | "" | Backend DSN, e.g. redis://localhost:6379/0. |
SOLARA_STATE_SECRET_KEYS | "" | Comma-separated HMAC keys (verify-any, sign-first). Required and must be non-default whenever a backend is enabled. |
SOLARA_STATE_ALLOW_PICKLE | False | Deployer gate for serializer="pickle"; it raises without this. |
SOLARA_STATE_TTL | (kernel.cull_timeout, 24h) | Backend key lifetime, refreshed on every write and on connect. |
SOLARA_STATE_ORPHAN_CULL_TIMEOUT | 5m | How long a disconnected kernel lives before culling — applies only with a shared backend. |
SOLARA_STATE_PREFIX | solara:state: | Backend key prefix / table name. |
SOLARA_STATE_FLUSH_DEBOUNCE | 300ms | Coalescing window for write-behind flushes; also the at-most-once loss window. |
SOLARA_STATE_CONNECT_TIMEOUT | 0.3 | Hard cap (seconds) on any takeover/flush backend call. |
SOLARA_STATE_BREAKER_FAILURES | 3 | Consecutive backend failures before the circuit breaker opens. |
SOLARA_STATE_BREAKER_WINDOW | 30s | How long the breaker stays open before a half-open probe. |
SOLARA_STATE_SCHEMA_TAG | "" (derived from the Solara version) | Value-shape tag; a mismatch triggers a clean state reset + soft-remount. |
SOLARA_STATE_AUTO_REMOUNT | None (on iff a backend is set) | Force client soft-remount on/off. |
SOLARA_STATE_BAILOUT_STORM_THRESHOLD | 0.5 | Fraction of recent restores that may bail out before the refresh dialog is suppressed and the fleet degrades to a silent fresh start. |
SOLARA_STATE_WARN_VALUE_BYTES | 1000000 | Serialized-envelope size (per variable) above which a flush logs a warning; still persisted. |
SOLARA_STATE_MAX_VALUE_BYTES | 5000000 | Serialized-envelope size above which a value is skipped (not persisted); the rest of the session still flushes. 0 disables the cap. |
SOLARA_STATE_TEST_EVICTION | False | Enables the dev/test kernel-eviction route for simulateFailover(); refused in production mode. |
These non-SOLARA_STATE_ settings also matter when persistence is enabled:
| Environment variable | Default | Meaning |
|---|---|---|
SOLARA_SESSION_HTTP_ONLY | False | Make the session cookie unreadable to JavaScript. Recommended with persistence (the cookie is a state-recovery credential); the server warns if it is off. |
SOLARA_KERNEL_MAX_PER_SESSION | 100 | Max live kernels one session cookie may create (bounds the client-chosen-kernel_id amplifier). Enforced only when a backend is configured; 0 disables. |
SOLARA_RESOURCEZ_TOKEN | "" | Bearer token that unlocks the un-redacted /resourcez per-key breakdown in production. Unset ⇒ production hashes the key labels. |