How Airbnb Built a Reliable Dynamic Config Sidecar at Scale

Why Dynamic Configuration Delivery is Hard

At Airbnb, configuration changes happen several times per minute. Every change must reach thousands of service instances reliably, within tens of seconds, without requiring a redeploy. This is the job of Sitar-agent: a lightweight Kubernetes sidecar that runs alongside every subscribed service pod.

In this post, we'll walk through the lifecycle of a configuration change, the key architectural decisions the team made during a complete rewrite from Ruby to Java, and how they balanced reliability, performance, and operational simplicity.

This is an engineering insight from Airbnb's team. The original post can be found here.

The Configuration Delivery Lifecycle

Here's how a config change travels from a developer's commit to a running pod:

Config creation/update – Developer commits via Git or web UI. Changes are stored with full versioning and ACL enforcement.
Hourly snapshot upload – The Snapshot Service packages all config groups and uploads compressed snapshots to S3.
Pod startup – The sidecar downloads the latest snapshot from S3 (fast bootstrap), then syncs with Sitar Service for any changes since the snapshot. Only after this does the main container start.
Periodic update – Every ~10 seconds, the agent polls the Sitar Service for changes.
Read config – The application reads from a shared mounted disk via a client library with in-memory cache.

This snapshot-based preload is a key optimization: it dramatically reduces cold start time and decouples startup reliability from the availability of the Sitar Service itself.

Key Design Decisions

Sidecar vs Library

The team considered moving the config agent into the main container as a library, which would save memory and CPU (no separate JVM). However, the cons won:

Multi-language complexity – Airbnb uses Java, Python, Go, TypeScript, and Ruby. A library would need to be implemented in all of them.
No isolation – A bug in config logic could crash the main container.
Operational noise – Logs and metrics would be mixed.

Decision: Keep the sidecar pattern. The cost savings were insufficient to justify the reliability and maintenance tradeoffs.

Pull Model vs Push Model

A push-based architecture (e.g., gRPC streaming or message queue) could reduce server load and propagation latency. But the team opted for a simple pull model with two optimizations:

Server-side cache with short TTL (10s) – Most requests hit the cache, avoiding heavy database access.
Token-based pagination – On cache miss, the request includes a token indicating the last scanned DB row, so the server only scans new changes.

Decision: Keep the pull model. Most config changes are manual, so a few seconds of delay is acceptable. The stateless simplicity is a strong operational advantage.

Local Datastore: SQLite vs RocksDB vs Sparkey

The legacy datastore was Sparkey-backed, but it had critical limitations:

No native concurrency support – required an external lock that blocked writes.
Re-indexes the entire datastore on each write – expensive with frequent updates.
Limited multi-language support.

The team benchmarked SQLite and RocksDB. Here's a summary:

Dimension	SQLite	RocksDB
Raw performance	Good enough for workload	Best across all tests
Multi-language support	Excellent (official bindings for all Airbnb languages)	Less mature, unevenly maintained
Operational complexity	Simple (single file, WAL mode)	Requires tuning (compaction, block cache, etc.)
Concurrency	Native WAL mode supports concurrent reads during writes	Excellent, but more complex to configure

Decision: SQLite. While RocksDB had better raw performance, SQLite's simplicity and first-class multi-language support made it the better fit for a team supporting multiple runtimes.

Safe Migration: Shadow Reads + Feature Flags

Migrating from Sparkey to SQLite across thousands of services required extreme caution. The team used:

Shadow reads – Before switching any service, both datastores were read in parallel, and results compared.
Feature flag-graded rollout – Migration started with the least critical services and progressed to Tier 0 services last, with dedicated coordination at each step.

This two-pronged approach ensured that any discrepancy would be caught before impacting production.

Limitations and Caveats

Polling latency is not real-time. If your use case requires sub-second propagation of config changes (e.g., feature flags for critical incidents), a push-based or streaming model would be more appropriate.
SQLite's read latency grows linearly with data size. For very large config datasets (gigabytes), RocksDB's more sophisticated indexing might become necessary.
The sidecar pattern adds resource overhead. Each pod runs an extra JVM process. For very small or ephemeral pods, this overhead might be significant.

Next Steps for Learning

Explore push-based config delivery using gRPC streaming or Apache Kafka for lower latency.
Learn about Kubernetes sidecar lifecycle hooks (preStop, postStart) to handle graceful shutdown.
Compare SQLite vs RocksDB in your own workload using this benchmark framework.

Conclusion

Airbnb's Sitar-agent is a well-engineered solution to a hard problem: delivering dynamic configuration at scale across a polyglot service fleet. The key takeaway is that every decision – sidecar vs library, pull vs push, SQLite vs RocksDB – came back to the same constraints: availability, propagation speed, and operational simplicity. There's no one-size-fits-all answer, but the tradeoff analysis framework used here is universally applicable.

For more on related architectural patterns, check out our guide on building trustable AI agents and the Modal vs. Separate Page UX decision tree.

Airbnb's Kubernetes sidecar architecture for dynamic configuration delivery Development Concept Image

Sitar agent polling cycle and snapshot preload from S3 on pod startup Coding Session Visual

Comparison of local datastore options: SQLite vs RocksDB vs Sparkey for sidecar config storage Algorithm Concept Visual

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

How Airbnb Built a Reliable Dynamic Config Sidecar at Scale

Why Dynamic Configuration Delivery is Hard

The Configuration Delivery Lifecycle

Key Design Decisions

Sidecar vs Library

Pull Model vs Push Model

Local Datastore: SQLite vs RocksDB vs Sparkey

Safe Migration: Shadow Reads + Feature Flags

Limitations and Caveats

Next Steps for Learning

Conclusion

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why Dynamic Configuration Delivery is Hard

The Configuration Delivery Lifecycle

Key Design Decisions

Sidecar vs Library

Pull Model vs Push Model

Local Datastore: SQLite vs RocksDB vs Sparkey

Safe Migration: Shadow Reads + Feature Flags

Limitations and Caveats

Next Steps for Learning

Conclusion

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!