Building a Production-Grade Multistage Multimodal Recommender System on Amazon EKS

Why This Architecture Matters

Building a recommender system that works at scale is not just about choosing the right model—it's about designing a multistage pipeline that balances latency, accuracy, and operational complexity. This post breaks down a production-style system deployed on Amazon EKS, covering everything from data preparation to autoscaling.

Core Challenges Addressed

Cold-start: Anonymous users and new items get meaningful recommendations through feature masking during training.
Latency: In-memory caching of item features reduced feature lookup time by 99.7%.
Freshness: Daily fine-tuning pipelines update models without full retraining.
Scale: Autoscaling Triton Inference Server on Kubernetes handles fluctuating request loads.

The full architecture is open-source and available on GitHub. For a broader context on modern CSS and web platform features, check out this CSS in 2026 Alpha Function, Grid Lanes, and What Happened at CSS Day article.

System Overview

The system follows a classic retrieve-rank-rerank pattern:

Retrieval: Two-Tower model + FAISS ANN index for fast candidate generation.
Filtering: Bloom filter removes already-seen items.
Ranking: DLRM model scores candidates using user, item, and context features.
Reranking: Score-based diversity sampling for final recommendations.

Key Components and Code Patterns

1. Cold-Start Handling via Feature Masking

To make the model robust to unknown users, 5% of training rows have user and context features replaced with sentinel values:

# Mask some users and context features in train data with 5% probability
ANONYMOUS_USER = -1
OOV_GENDER = -1
OOV_TOP_CATEGORY = -1
OOV_DEVICE = -1

masked_train_dir = os.path.join(input_path, "masked_train")
os.makedirs(masked_train_dir, exist_ok=True)

for i in range(train_days):
    day = cudf.read_parquet(os.path.join(input_path, f"train_day_{i:02d}.parquet"))
    n = len(day)
    
    user_mask = cupy.random.random(n) < 0.05
    day.loc[user_mask, "user_id"] = ANONYMOUS_USER
    day.loc[user_mask, "gender"] = OOV_GENDER
    day.loc[user_mask, "top_category"] = OOV_TOP_CATEGORY
    
    device_mask = cupy.random.random(n) < 0.05
    day.loc[device_mask, "device_type"] = OOV_DEVICE
    
    day.to_parquet(os.path.join(masked_train_dir, f"train_day_{i:02d}.parquet"), index=False)
    del day
    gc.collect()

This forces the model to rely on context features (device type, time) and learned OOV embeddings—so even new users get personalized results.

2. In-Memory Caching to Fix Latency Bottleneck

Profiling revealed that feast_item_lookup consumed 195 ms per request (52% of total latency). The fix was replacing network calls with a local NumPy cache:

# At initialization, load all item features once into memory
class FeastItemLookup:
    def __init__(self, feast_client, item_ids):
        self.cache = {}
        for item_id in item_ids:
            features = feast_client.get_online_features(...)
            self.cache[item_id] = np.array(features)
    
    def lookup(self, item_ids):
        # O(1) in-memory lookup instead of network call
        return np.array([self.cache[i] for i in item_ids])

Result: 99.7% latency reduction for feature lookup, 54% end-to-end improvement, and 310% throughput gain at concurrency=4.

3. Triton Inference Server Ensemble

The serving graph is a DAG of 14 models orchestrated by Triton:

# Triton startup script
set -e
MODELS_DIR=${1:-"/model/triton_model_repository"}
echo "Starting Triton Inference Server"
echo "Models directory: $MODELS_DIR"

tritonserver \
  --model-repository="$MODELS_DIR" \
  --model-control-mode=explicit \
  --load-model=nvt_user_transform \
  --load-model=nvt_item_transform \
  --load-model=nvt_context_transform \
  --load-model=multimodal_embedding_lookup \
  --load-model=query_tower \
  --load-model=faiss_retrieval \
  --load-model=dlrm_ranking \
  --load-model=item_id_decoder \
  --load-model=feast_user_lookup \
  --load-model=feast_item_lookup \
  --load-model=filter_seen_items \
  --load-model=softmax_sampling \
  --load-model=context_preprocessor \
  --load-model=unroll_features \
  --load-model=ensemble_model

This explicit loading ensures predictable cold-start behavior and clean version management.

Data pipeline diagram showing feature engineering with NVTabular and Kubeflow for recommender models System Abstract Visual

Limitations and Critical Considerations

1. Context Only at Ranking Stage

Currently, request-side context (device, time) is only used by the ranker, not the retriever. This means candidates that are irrelevant for a given context might be filtered out before ranking. Fix: Add context features to the query tower—this has been implemented in a separate branch.

2. Data Versioning and Reproducibility

The training pipeline lacks explicit data snapshots. While a git commit pins code, the same commit with new interaction data may produce different results. Recommended: Use tools like DVC or LakeFS for data versioning.

3. No Online Quality Monitoring

Latency and throughput are tracked, but recommendation quality (e.g., CTR, engagement) is not monitored in production. Performance drift can go undetected. Next step: Integrate a monitoring component that triggers alerts when online metrics deviate from baselines.

4. Session Modeling Gap

The current top_category feature is a crude proxy for short-term interest. A session-based transformer encoder would capture richer behavioral patterns.

Future Work

Context-aware retrieval: Add device/time features to the query tower for better candidate quality.
Experiment tracking: Integrate MLflow or Weights & Biases to compare model variants.
Model registry: Formalize lineage between training runs, data, and deployed models.
Online evaluation: Add A/B testing framework for monitoring recommendation quality.

Kubernetes HPA and Karpenter autoscaling Triton Inference Server pods for production ML serving Developer Related Image

Conclusion

This architecture demonstrates a production-ready approach to building recommender systems that are:

Fast: In-memory caching and multistage design keep latency under control.
Fresh: Daily fine-tuning adapts to new user behavior without full retraining.
Scalable: Kubernetes HPA and Karpenter autoscale GPU nodes on demand.
Robust: Cold-start handling and Bloom filter filtering prevent common failure modes.

The complete code is available in the GitHub repository. For more on how large-scale AI platforms handle optimization, see this analysis of Meta's Unified AI Agent Platform.

Next Steps for Learning

Dive deeper into Two-Tower models with the NVIDIA Merlin documentation.
Explore Kubeflow Pipelines for MLOps automation.
Experiment with session-based models for short-term interest modeling.

This article is based on a production-grade open-source project. All code and configurations are shared for educational purposes.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

Building a Production-Grade Multistage Multimodal Recommender System on Amazon EKS

Why This Architecture Matters

Core Challenges Addressed

System Overview

Key Components and Code Patterns

1. Cold-Start Handling via Feature Masking

2. In-Memory Caching to Fix Latency Bottleneck

3. Triton Inference Server Ensemble

Limitations and Critical Considerations

1. Context Only at Ranking Stage

2. Data Versioning and Reproducibility

3. No Online Quality Monitoring

4. Session Modeling Gap

Future Work

Conclusion

Next Steps for Learning

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why This Architecture Matters

Core Challenges Addressed

System Overview

Key Components and Code Patterns

1. Cold-Start Handling via Feature Masking

2. In-Memory Caching to Fix Latency Bottleneck

3. Triton Inference Server Ensemble

Limitations and Critical Considerations

1. Context Only at Ranking Stage

2. Data Versioning and Reproducibility

3. No Online Quality Monitoring

4. Session Modeling Gap

Future Work

Conclusion

Next Steps for Learning

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!