The Hidden Cost of “How Different Is This New Title?”
Every time you open Netflix, the homepage is personalized by a service called Ranker. One of its most expensive features is video serendipity scoring — a simple question: “How different is this new title from what you’ve been watching?” That single feature consumed 7.5% of total CPU on every node.
At Netflix’s scale, that’s a massive operational cost. The team set out to optimize it, and the journey — from naive loops to batching, flat buffers, and finally the JDK Vector API — is a masterclass in applied systems engineering.
Insight: How Meta Scaled FFmpeg to Process Billions of Videos Daily
The Hotspot: Nested Loops and Poor Cache Locality
The original implementation was straightforward but expensive:
for (Video candidate : candidates) {
Vector c = embedding(candidate);
double maxSim = -1.0;
for (Video h : history) {
Vector v = embedding(h);
double sim = cosine(c, v);
maxSim = Math.max(maxSim, sim);
}
double serendipity = 1.0 - maxSim;
emitFeature(candidate, serendipity);
}
This is O(M×N) separate dot products — one per candidate-history pair. Each call fetches an embedding, does a scalar dot product, and stores the result. The memory access pattern is scattered, causing poor cache locality. A flamegraph confirmed this was the top hotspot.

Step 1: Batching – From Nested Loops to Matrix Multiply
The first insight: treat the problem as a single matrix multiplication. If D is the embedding dimension:
- Pack all candidate embeddings into matrix A (M×D)
- Pack all history embeddings into matrix B (N×D)
- Normalize rows to unit length
- Compute C = A × Bᵀ (M×N cosine similarities)
// Build matrices
double[][] A = new double[M][D]; // candidates
double[][] B = new double[N][D]; // history
for (int i = 0; i < M; i++) {
A[i] = embedding(candidates[i]).toArray();
}
for (int j = 0; j < N; j++) {
B[j] = embedding(history[j]).toArray();
}
// Normalize rows to unit vectors
normalizeRows(A);
normalizeRows(B);
// Compute C = A * B^T
double[][] C = matmul(A, B);
// Derive serendipity
for (int i = 0; i < M; i++) {
double maxSim = max(C[i][0..N-1]);
double serendipity = 1.0 - maxSim;
emitFeature(candidates[i], serendipity);
}
This turns O(M×N) separate dot products into a single matrix multiply — exactly what CPUs are optimized for. But the first implementation caused a 5% regression. Why?

Step 2: When Batching Isn’t Enough – Memory Layout Matters
The problem wasn’t the algorithm. It was the implementation details:
double[][]is non-contiguous memory → pointer chasing, poor cache behavior- Large per-request allocations → GC pressure
- Scalar Java matrix multiply → no SIMD
Lesson: Algorithmic improvements don’t matter if memory layout and allocation strategy work against you.
Step 3: Flat Buffers & ThreadLocal Reuse
They reworked the data layout to flat double[] buffers in row-major order, and used ThreadLocal to reuse buffers across requests:
class BufferHolder {
double[] candidatesFlat = new double[0];
double[] historyFlat = new double[0];
double[] getCandidatesFlat(int required) {
if (candidatesFlat.length < required) {
candidatesFlat = new double[required];
}
return candidatesFlat;
}
double[] getHistoryFlat(int required) {
if (historyFlat.length < required) {
historyFlat = new double[required];
}
return historyFlat;
}
}
private static final ThreadLocal<BufferHolder> threadBuffers =
ThreadLocal.withInitial(BufferHolder::new);
This eliminated per-request allocations and improved cache locality.
Step 4: BLAS – Great in Tests, Not in Production
They tried BLAS (Basic Linear Algebra Subprograms). Microbenchmarks looked great, but in production:
- Default
netlib-javaused F2J (Fortran-to-Java) BLAS, not native - JNI transitions added overhead
- Java row-major vs BLAS column-major required conversions and temporary buffers
Result: Gains didn’t materialize.
Step 5: JDK Vector API – Pure Java SIMD
The final piece: replace BLAS with a pure-Java SIMD implementation using the JDK Vector API (incubating). This lets you write data-parallel operations that the JIT maps to SSE/AVX2/AVX-512 instructions — no JNI, no native dependencies.
// Vector API inner loop (simplified)
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
DoubleVector acc = DoubleVector.zero(SPECIES);
int k = 0;
for (; k + SPECIES.length() <= D; k += SPECIES.length()) {
DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);
DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat, j*D + k);
acc = a.fma(b, acc); // fused multiply-add
}
double dot = acc.reduceLanes(VectorOperators.ADD);
// handle tail k..D-1
similaritiesFlat[i*N + j] = dot;
}
}
At class load time, a factory selects the best implementation:
- Vector API if available (needs
--add-modules=jdk.incubator.vector) - Otherwise, a highly optimized scalar fallback (inspired by Lucene)
Production Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| CPU (feature) | 7.5% | ~1% | -87% |
| CPU/RPS | baseline | -10% | -10% |
| Average latency | baseline | -12% | -12% |
At the assembly level, the shift was clear: from loop-unrolled scalar dot products to vectorized matrix multiply on AVX-512.
Limitations & Caveats
- Vector API is still incubating (requires runtime flag). The fallback path is essential for safety.
- Not all workloads benefit. This optimization works because the hot loop is dominated by large numbers of dot products on contiguous
double[]buffers. - Benchmarking must include production context. Microbenchmarks for BLAS looked great, but real-world gains depended on memory layout and allocation patterns.
Next Steps
If you’re considering the Vector API for your service:
- Profile first – confirm your hotspot is a data-parallel loop.
- Fix memory layout before touching compute kernels. Flat buffers and reuse are often 80% of the gain.
- Design a fallback – Vector API is not yet stable across all JVM versions.
- Measure at the system level – CPU/RPS and latency, not just microbenchmarks.
Related: StyleX – Meta’s Answer to CSS at Scale and Why Figma Adopted It
This article is based on a Netflix Tech Blog post by Harshad Sane and the Performance Engineering team.

Conclusion
This optimization wasn’t about finding the “fastest library.” It was about getting the fundamentals right:
- Algorithmic shape – batching turned O(M×N) dot products into a single matrix multiply
- Memory layout – flat buffers and ThreadLocal reuse eliminated GC pressure and improved cache locality
- Compute kernel – JDK Vector API provided pure-Java SIMD without JNI overhead
When those pieces aligned, the Vector API became a natural fit, delivering a 10% reduction in cluster footprint with readable, maintainable Java code.
Have you tried the Vector API in a real service? What workloads did it help (or not)? Share your experience in the comments.