Netflix's Ranker service, responsible for the personalized rows on your homepage, operates at a massive scale. Profiling revealed a significant hotspot: the 'video serendipity scoring' feature, which consumed about 7.5% of total CPU per node. What began as a simple idea to batch this feature evolved into a comprehensive optimization journey. This post shares the key insights from that process, moving beyond theory to practical implementation details. You can find the original case study on the Netflix Tech Blog.

The Problem: An O(M×N) Inefficiency
The serendipity score answers, "How different is this new title from the user's watch history?" It involves comparing embeddings for M candidate videos against N history items, resulting in M×N cosine similarity calculations.
The initial implementation was straightforward but costly: a nested loop fetching embeddings and computing dot products one pair at a time, leading to poor cache locality and repeated memory access.
// Simplified Nested Loop Approach (Pseudo-code)
for (Video candidate : candidates) { // M times
Vector c = embedding(candidate);
double maxSim = -1.0;
for (Video h : history) { // N times
Vector v = embedding(h);
double sim = cosine(c, v); // Dot product occurs here
maxSim = Math.max(maxSim, sim);
}
double serendipity = 1.0 - maxSim;
emitFeature(candidate, serendipity);
}
// Total of M x N separate dot products

The 5-Step Optimization: Why Fundamentals Matter
- Batching & Matrix Transformation: Converted M×N small dot products into a single matrix multiplication (C = A * B^T), a shape CPUs are optimized for.
- Batching Wasn't Enough: Surprisingly, this caused a 5% regression. The culprit was GC pressure from short-lived
double[][]allocations and non-contiguous memory access. - Flat Buffers & ThreadLocal Reuse: Switched to flat
double[]buffers with a ThreadLocal pool for reuse, drastically reducing allocation overhead and improving cache efficiency. - The BLAS Pitfall: Native BLAS libraries introduced JNI transition overhead and layout conversion costs, negating theoretical gains in our pure-Java context.
- Enter the JDK Vector API: This was the game-changer. It allows expressing SIMD (Single Instruction, Multiple Data) operations in pure Java. We replaced scalar operations with vectorized FMA (Fused Multiply-Add) instructions, fully utilizing the CPU's vector units.
// Inner Loop using JDK Vector API (Simplified)
DoubleVector acc = DoubleVector.zero(SPECIES_PREFERRED);
for (; k + SPECIES.length() <= D; k += SPECIES.length()) {
DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);
DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat, j*D + k);
acc = a.fma(b, acc); // Vectorized FMA operation!
}
double dot = acc.reduceLanes(VectorOperators.ADD);
| Optimization Stage | Core Change | Primary Benefit | Consideration |
|---|---|---|---|
| 1. Batching | Nested Loop → Matrix Multiply | Algorithmic Efficiency | Requires data layout design |
| 2/3. Memory Opt. | double[][] → double[] + ThreadLocal | Better Cache Locality, Less GC | Adds buffer management logic |
| 5. JDK Vector API | Scalar Ops → SIMD Vector Ops | Maximizes CPU Hardware Efficiency | Depends on incubating module |

Conclusion: The Fastest Library Isn't the Answer
The key lesson was that focusing on computation shape, data layout, and eliminating overhead is more critical than finding the "fastest library." Once these fundamentals were in place, the JDK Vector API became the perfect tool to harness SIMD performance without JNI overhead in pure Java.
The results were substantial: ~7% lower CPU utilization, ~12% lower average latency, and the hotspot's CPU footprint dropped from 7.5% to around 1%. This success came from re-architecting the problem, not just tuning code. When considering performance improvements, start by examining your data flow and shape before jumping to new libraries.