Why This Architecture Matters
Building a recommender system that works at scale is not just about choosing the right model—it's about designing a multistage pipeline that balances latency, accuracy, and operational complexity. This post breaks down a production-style system deployed on Amazon EKS, covering everything from data preparation to autoscaling.
Core Challenges Addressed
- Cold-start: Anonymous users and new items get meaningful recommendations through feature masking during training.
- Latency: In-memory caching of item features reduced feature lookup time by 99.7%.
- Freshness: Daily fine-tuning pipelines update models without full retraining.
- Scale: Autoscaling Triton Inference Server on Kubernetes handles fluctuating request loads.
The full architecture is open-source and available on GitHub. For a broader context on modern CSS and web platform features, check out this CSS in 2026 Alpha Function, Grid Lanes, and What Happened at CSS Day article.
System Overview
The system follows a classic retrieve-rank-rerank pattern:
- Retrieval: Two-Tower model + FAISS ANN index for fast candidate generation.
- Filtering: Bloom filter removes already-seen items.
- Ranking: DLRM model scores candidates using user, item, and context features.
- Reranking: Score-based diversity sampling for final recommendations.
![]()
Key Components and Code Patterns
1. Cold-Start Handling via Feature Masking
To make the model robust to unknown users, 5% of training rows have user and context features replaced with sentinel values:
# Mask some users and context features in train data with 5% probability
ANONYMOUS_USER = -1
OOV_GENDER = -1
OOV_TOP_CATEGORY = -1
OOV_DEVICE = -1
masked_train_dir = os.path.join(input_path, "masked_train")
os.makedirs(masked_train_dir, exist_ok=True)
for i in range(train_days):
day = cudf.read_parquet(os.path.join(input_path, f"train_day_{i:02d}.parquet"))
n = len(day)
user_mask = cupy.random.random(n) < 0.05
day.loc[user_mask, "user_id"] = ANONYMOUS_USER
day.loc[user_mask, "gender"] = OOV_GENDER
day.loc[user_mask, "top_category"] = OOV_TOP_CATEGORY
device_mask = cupy.random.random(n) < 0.05
day.loc[device_mask, "device_type"] = OOV_DEVICE
day.to_parquet(os.path.join(masked_train_dir, f"train_day_{i:02d}.parquet"), index=False)
del day
gc.collect()
This forces the model to rely on context features (device type, time) and learned OOV embeddings—so even new users get personalized results.
2. In-Memory Caching to Fix Latency Bottleneck
Profiling revealed that feast_item_lookup consumed 195 ms per request (52% of total latency). The fix was replacing network calls with a local NumPy cache:
# At initialization, load all item features once into memory
class FeastItemLookup:
def __init__(self, feast_client, item_ids):
self.cache = {}
for item_id in item_ids:
features = feast_client.get_online_features(...)
self.cache[item_id] = np.array(features)
def lookup(self, item_ids):
# O(1) in-memory lookup instead of network call
return np.array([self.cache[i] for i in item_ids])
Result: 99.7% latency reduction for feature lookup, 54% end-to-end improvement, and 310% throughput gain at concurrency=4.
3. Triton Inference Server Ensemble
The serving graph is a DAG of 14 models orchestrated by Triton:
# Triton startup script
set -e
MODELS_DIR=${1:-"/model/triton_model_repository"}
echo "Starting Triton Inference Server"
echo "Models directory: $MODELS_DIR"
tritonserver \
--model-repository="$MODELS_DIR" \
--model-control-mode=explicit \
--load-model=nvt_user_transform \
--load-model=nvt_item_transform \
--load-model=nvt_context_transform \
--load-model=multimodal_embedding_lookup \
--load-model=query_tower \
--load-model=faiss_retrieval \
--load-model=dlrm_ranking \
--load-model=item_id_decoder \
--load-model=feast_user_lookup \
--load-model=feast_item_lookup \
--load-model=filter_seen_items \
--load-model=softmax_sampling \
--load-model=context_preprocessor \
--load-model=unroll_features \
--load-model=ensemble_model
This explicit loading ensures predictable cold-start behavior and clean version management.

Limitations and Critical Considerations
1. Context Only at Ranking Stage
Currently, request-side context (device, time) is only used by the ranker, not the retriever. This means candidates that are irrelevant for a given context might be filtered out before ranking. Fix: Add context features to the query tower—this has been implemented in a separate branch.
2. Data Versioning and Reproducibility
The training pipeline lacks explicit data snapshots. While a git commit pins code, the same commit with new interaction data may produce different results. Recommended: Use tools like DVC or LakeFS for data versioning.
3. No Online Quality Monitoring
Latency and throughput are tracked, but recommendation quality (e.g., CTR, engagement) is not monitored in production. Performance drift can go undetected. Next step: Integrate a monitoring component that triggers alerts when online metrics deviate from baselines.
4. Session Modeling Gap
The current top_category feature is a crude proxy for short-term interest. A session-based transformer encoder would capture richer behavioral patterns.
Future Work
- Context-aware retrieval: Add device/time features to the query tower for better candidate quality.
- Experiment tracking: Integrate MLflow or Weights & Biases to compare model variants.
- Model registry: Formalize lineage between training runs, data, and deployed models.
- Online evaluation: Add A/B testing framework for monitoring recommendation quality.

Conclusion
This architecture demonstrates a production-ready approach to building recommender systems that are:
- Fast: In-memory caching and multistage design keep latency under control.
- Fresh: Daily fine-tuning adapts to new user behavior without full retraining.
- Scalable: Kubernetes HPA and Karpenter autoscale GPU nodes on demand.
- Robust: Cold-start handling and Bloom filter filtering prevent common failure modes.
The complete code is available in the GitHub repository. For more on how large-scale AI platforms handle optimization, see this analysis of Meta's Unified AI Agent Platform.
Next Steps for Learning
- Dive deeper into Two-Tower models with the NVIDIA Merlin documentation.
- Explore Kubeflow Pipelines for MLOps automation.
- Experiment with session-based models for short-term interest modeling.
This article is based on a production-grade open-source project. All code and configurations are shared for educational purposes.