Why Dynamic Configuration Delivery is Hard
At Airbnb, configuration changes happen several times per minute. Every change must reach thousands of service instances reliably, within tens of seconds, without requiring a redeploy. This is the job of Sitar-agent: a lightweight Kubernetes sidecar that runs alongside every subscribed service pod.
In this post, we'll walk through the lifecycle of a configuration change, the key architectural decisions the team made during a complete rewrite from Ruby to Java, and how they balanced reliability, performance, and operational simplicity.
This is an engineering insight from Airbnb's team. The original post can be found here.
The Configuration Delivery Lifecycle
Here's how a config change travels from a developer's commit to a running pod:
- Config creation/update – Developer commits via Git or web UI. Changes are stored with full versioning and ACL enforcement.
- Hourly snapshot upload – The Snapshot Service packages all config groups and uploads compressed snapshots to S3.
- Pod startup – The sidecar downloads the latest snapshot from S3 (fast bootstrap), then syncs with Sitar Service for any changes since the snapshot. Only after this does the main container start.
- Periodic update – Every ~10 seconds, the agent polls the Sitar Service for changes.
- Read config – The application reads from a shared mounted disk via a client library with in-memory cache.
This snapshot-based preload is a key optimization: it dramatically reduces cold start time and decouples startup reliability from the availability of the Sitar Service itself.
Key Design Decisions
Sidecar vs Library
The team considered moving the config agent into the main container as a library, which would save memory and CPU (no separate JVM). However, the cons won:
- Multi-language complexity – Airbnb uses Java, Python, Go, TypeScript, and Ruby. A library would need to be implemented in all of them.
- No isolation – A bug in config logic could crash the main container.
- Operational noise – Logs and metrics would be mixed.
Decision: Keep the sidecar pattern. The cost savings were insufficient to justify the reliability and maintenance tradeoffs.
Pull Model vs Push Model
A push-based architecture (e.g., gRPC streaming or message queue) could reduce server load and propagation latency. But the team opted for a simple pull model with two optimizations:
- Server-side cache with short TTL (10s) – Most requests hit the cache, avoiding heavy database access.
- Token-based pagination – On cache miss, the request includes a token indicating the last scanned DB row, so the server only scans new changes.
Decision: Keep the pull model. Most config changes are manual, so a few seconds of delay is acceptable. The stateless simplicity is a strong operational advantage.
Local Datastore: SQLite vs RocksDB vs Sparkey
The legacy datastore was Sparkey-backed, but it had critical limitations:
- No native concurrency support – required an external lock that blocked writes.
- Re-indexes the entire datastore on each write – expensive with frequent updates.
- Limited multi-language support.
The team benchmarked SQLite and RocksDB. Here's a summary:
| Dimension | SQLite | RocksDB |
|---|---|---|
| Raw performance | Good enough for workload | Best across all tests |
| Multi-language support | Excellent (official bindings for all Airbnb languages) | Less mature, unevenly maintained |
| Operational complexity | Simple (single file, WAL mode) | Requires tuning (compaction, block cache, etc.) |
| Concurrency | Native WAL mode supports concurrent reads during writes | Excellent, but more complex to configure |
Decision: SQLite. While RocksDB had better raw performance, SQLite's simplicity and first-class multi-language support made it the better fit for a team supporting multiple runtimes.
Safe Migration: Shadow Reads + Feature Flags
Migrating from Sparkey to SQLite across thousands of services required extreme caution. The team used:
- Shadow reads – Before switching any service, both datastores were read in parallel, and results compared.
- Feature flag-graded rollout – Migration started with the least critical services and progressed to Tier 0 services last, with dedicated coordination at each step.
This two-pronged approach ensured that any discrepancy would be caught before impacting production.
Limitations and Caveats
- Polling latency is not real-time. If your use case requires sub-second propagation of config changes (e.g., feature flags for critical incidents), a push-based or streaming model would be more appropriate.
- SQLite's read latency grows linearly with data size. For very large config datasets (gigabytes), RocksDB's more sophisticated indexing might become necessary.
- The sidecar pattern adds resource overhead. Each pod runs an extra JVM process. For very small or ephemeral pods, this overhead might be significant.
Next Steps for Learning
- Explore push-based config delivery using gRPC streaming or Apache Kafka for lower latency.
- Learn about Kubernetes sidecar lifecycle hooks (preStop, postStart) to handle graceful shutdown.
- Compare SQLite vs RocksDB in your own workload using this benchmark framework.
Conclusion
Airbnb's Sitar-agent is a well-engineered solution to a hard problem: delivering dynamic configuration at scale across a polyglot service fleet. The key takeaway is that every decision – sidecar vs library, pull vs push, SQLite vs RocksDB – came back to the same constraints: availability, propagation speed, and operational simplicity. There's no one-size-fits-all answer, but the tradeoff analysis framework used here is universally applicable.
For more on related architectural patterns, check out our guide on building trustable AI agents and the Modal vs. Separate Page UX decision tree.


