The success of modern apps hinges on personalization—delivering tailored experiences to individual users. Simultaneously, experimentation is crucial for improving and evaluating these personalization systems. Interestingly, leading companies like Spotify maintain a clear separation between the technology stacks powering these two domains. Why? The reasons run deeper than what a single unified tool can address.

Personalization vs. Experimentation: A Fundamental Divergence in Goals
- Goal of Personalization: To build a system that delivers the single best experience for each individual user. It uses sophisticated ML models (neural networks, LLMs, reinforcement learning) to process rich features and generate context-aware, real-time recommendations.
- Goal of Experimentation: To compare and evaluate which alternative (e.g., a different button design, a different recommendation algorithm) performs better. It enables data-driven decisions through A/B tests or multi-armed bandits.
Contextual Bandits blur this boundary. This algorithm, which serves different 'arms' based on user features, is inherently a personalization system. Therefore, the bandit itself becomes a 'system' that must be the subject of an experiment, compared against another system (e.g., the old static button). The role of the experimentation platform is to evaluate the value of this personalization system, not to build the system itself.
![]()
The Decisive Reasons for Separate Tech Stacks
-
Diverging Infrastructure Needs:
- ML Stack: Requires low-latency, real-time feature access, fast model inference, and training/serving infrastructure for diverse model types (boosting, random forests, neural networks, LLMs).
- Experimentation Stack: Optimized for accurate randomization, metric aggregation, and statistical significance testing.
Forcing unification can lead to increased hidden technical debt in ML systems (
Sculley et al., 2015) or limit the sophistication of personalization.
-
Practical Limitations of Multi-Armed Bandits:
- Single-Objective Optimization: Most bandits optimize for a single metric (e.g., short-term click-through rate). In practice, balancing multiple metrics like long-term satisfaction and discovery is critical.
- Misconception About Decision Speed: Important business metrics (e.g., 2-week retention) take time to observe, making it difficult to update bandit weights quickly. At Spotify, simple, reliable A/B testing—enabling 300+ teams to run thousands of experiments concurrently—has delivered more business value than theoretically superior but complex bandits.
-
Efficiency at Scale: Stacks scale more efficiently when each focuses on its core competency. The ML platform standardizes building personalization systems at scale, while the experimentation platform (Confidence) enables evaluating these systems in parallel with thousands of other experiments. Further discussion on this can be found in the source material.

Practical Advice for Implementation
- Separate from Day One: When starting with personalization, it's tempting to opt for an all-in-one tool. However, given the fundamentally different infrastructure needs, investing in a proper ML stack early pays off long-term.
- Let Each Do What It Does Best: The ML stack should focus on serving recommendations, while the experimentation stack focuses on evaluating recommendation systems.
- Design for Smooth Integration: Like Spotify's Confidence platform, design seamless API integrations with external systems (ML platform, ad systems, etc.) so teams can set up experiments without extra steps.
In conclusion, while personalization and experimentation are complementary, their underlying technological approaches are most powerful when kept distinct. Infrastructure design that respects the unique requirements of each domain is key to sustainable innovation.