The Hidden Cost of “How Different Is This New Title?”

Every time you open Netflix, the homepage is personalized by a service called Ranker. One of its most expensive features is video serendipity scoring — a simple question: “How different is this new title from what you’ve been watching?” That single feature consumed 7.5% of total CPU on every node.

At Netflix’s scale, that’s a massive operational cost. The team set out to optimize it, and the journey — from naive loops to batching, flat buffers, and finally the JDK Vector API — is a masterclass in applied systems engineering.

Insight: How Meta Scaled FFmpeg to Process Billions of Videos Daily

The Hotspot: Nested Loops and Poor Cache Locality

The original implementation was straightforward but expensive:

for (Video candidate : candidates) {
    Vector c = embedding(candidate);
    double maxSim = -1.0;
    for (Video h : history) {
        Vector v = embedding(h);
        double sim = cosine(c, v);
        maxSim = Math.max(maxSim, sim);
    }
    double serendipity = 1.0 - maxSim;
    emitFeature(candidate, serendipity);
}

This is O(M×N) separate dot products — one per candidate-history pair. Each call fetches an embedding, does a scalar dot product, and stores the result. The memory access pattern is scattered, causing poor cache locality. A flamegraph confirmed this was the top hotspot.

Netflix server rack with CPU utilization heatmap showing optimization before and after Vector API Development Concept Image

Step 1: Batching – From Nested Loops to Matrix Multiply

The first insight: treat the problem as a single matrix multiplication. If D is the embedding dimension:

  • Pack all candidate embeddings into matrix A (M×D)
  • Pack all history embeddings into matrix B (N×D)
  • Normalize rows to unit length
  • Compute C = A × Bᵀ (M×N cosine similarities)
// Build matrices
double[][] A = new double[M][D]; // candidates
double[][] B = new double[N][D]; // history

for (int i = 0; i < M; i++) {
    A[i] = embedding(candidates[i]).toArray();
}
for (int j = 0; j < N; j++) {
    B[j] = embedding(history[j]).toArray();
}

// Normalize rows to unit vectors
normalizeRows(A);
normalizeRows(B);

// Compute C = A * B^T
double[][] C = matmul(A, B);

// Derive serendipity
for (int i = 0; i < M; i++) {
    double maxSim = max(C[i][0..N-1]);
    double serendipity = 1.0 - maxSim;
    emitFeature(candidates[i], serendipity);
}

This turns O(M×N) separate dot products into a single matrix multiply — exactly what CPUs are optimized for. But the first implementation caused a 5% regression. Why?

Matrix multiplication diagram with SIMD lanes processing double precision vectors Technical Structure Concept

Step 2: When Batching Isn’t Enough – Memory Layout Matters

The problem wasn’t the algorithm. It was the implementation details:

  • double[][] is non-contiguous memory → pointer chasing, poor cache behavior
  • Large per-request allocations → GC pressure
  • Scalar Java matrix multiply → no SIMD

Lesson: Algorithmic improvements don’t matter if memory layout and allocation strategy work against you.

Step 3: Flat Buffers & ThreadLocal Reuse

They reworked the data layout to flat double[] buffers in row-major order, and used ThreadLocal to reuse buffers across requests:

class BufferHolder {
    double[] candidatesFlat = new double[0];
    double[] historyFlat = new double[0];

    double[] getCandidatesFlat(int required) {
        if (candidatesFlat.length < required) {
            candidatesFlat = new double[required];
        }
        return candidatesFlat;
    }

    double[] getHistoryFlat(int required) {
        if (historyFlat.length < required) {
            historyFlat = new double[required];
        }
        return historyFlat;
    }
}

private static final ThreadLocal<BufferHolder> threadBuffers =
    ThreadLocal.withInitial(BufferHolder::new);

This eliminated per-request allocations and improved cache locality.

Step 4: BLAS – Great in Tests, Not in Production

They tried BLAS (Basic Linear Algebra Subprograms). Microbenchmarks looked great, but in production:

  • Default netlib-java used F2J (Fortran-to-Java) BLAS, not native
  • JNI transitions added overhead
  • Java row-major vs BLAS column-major required conversions and temporary buffers

Result: Gains didn’t materialize.

Step 5: JDK Vector API – Pure Java SIMD

The final piece: replace BLAS with a pure-Java SIMD implementation using the JDK Vector API (incubating). This lets you write data-parallel operations that the JIT maps to SSE/AVX2/AVX-512 instructions — no JNI, no native dependencies.

// Vector API inner loop (simplified)
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        DoubleVector acc = DoubleVector.zero(SPECIES);
        int k = 0;
        for (; k + SPECIES.length() <= D; k += SPECIES.length()) {
            DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);
            DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat, j*D + k);
            acc = a.fma(b, acc);  // fused multiply-add
        }
        double dot = acc.reduceLanes(VectorOperators.ADD);
        // handle tail k..D-1
        similaritiesFlat[i*N + j] = dot;
    }
}

At class load time, a factory selects the best implementation:

  • Vector API if available (needs --add-modules=jdk.incubator.vector)
  • Otherwise, a highly optimized scalar fallback (inspired by Lucene)

Production Results

MetricBeforeAfterImprovement
CPU (feature)7.5%~1%-87%
CPU/RPSbaseline-10%-10%
Average latencybaseline-12%-12%

At the assembly level, the shift was clear: from loop-unrolled scalar dot products to vectorized matrix multiply on AVX-512.

Limitations & Caveats

  • Vector API is still incubating (requires runtime flag). The fallback path is essential for safety.
  • Not all workloads benefit. This optimization works because the hot loop is dominated by large numbers of dot products on contiguous double[] buffers.
  • Benchmarking must include production context. Microbenchmarks for BLAS looked great, but real-world gains depended on memory layout and allocation patterns.

Next Steps

If you’re considering the Vector API for your service:

  1. Profile first – confirm your hotspot is a data-parallel loop.
  2. Fix memory layout before touching compute kernels. Flat buffers and reuse are often 80% of the gain.
  3. Design a fallback – Vector API is not yet stable across all JVM versions.
  4. Measure at the system level – CPU/RPS and latency, not just microbenchmarks.

Related: StyleX – Meta’s Answer to CSS at Scale and Why Figma Adopted It

This article is based on a Netflix Tech Blog post by Harshad Sane and the Performance Engineering team.

Cloud infrastructure diagram representing Netflix recommendation system cluster with reduced footprint Developer Related Image

Conclusion

This optimization wasn’t about finding the “fastest library.” It was about getting the fundamentals right:

  • Algorithmic shape – batching turned O(M×N) dot products into a single matrix multiply
  • Memory layout – flat buffers and ThreadLocal reuse eliminated GC pressure and improved cache locality
  • Compute kernel – JDK Vector API provided pure-Java SIMD without JNI overhead

When those pieces aligned, the Vector API became a natural fit, delivering a 10% reduction in cluster footprint with readable, maintainable Java code.


Have you tried the Vector API in a real service? What workloads did it help (or not)? Share your experience in the comments.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.