Introduction: The 24-Hour Challenge
Training a competitive text-to-image diffusion model used to cost millions of dollars and weeks of GPU time. Times have changed. In this post, we share the complete recipe behind a 24-hour speedrun using 32 H200 GPUs (approx. $1500 total compute). The goal was to stack every trick that actually works — not just measure them in isolation — and see how far we could push performance under a strict budget.
We open-sourced all the code in the PRX repository, so you can reproduce, modify, and extend the experiments yourself. This post summarizes the key architectural decisions and training settings. For a detailed experimental breakdown of individual components, check out our earlier posts (Part 1 and Part 2).
If you're interested in how AI-powered tools can simplify debugging in other domains, you might also enjoy our guide on building an AI troubleshooting assistant for Kubernetes.

The Training Recipe: A Stack of Proven Techniques
1. Pixel-Space Training with X-Prediction
We adopted the x-prediction formulation from Back to Basics: Let Denoising Generative Models Denoise (Li & He, 2025). Unlike latent diffusion models that rely on a VAE, this approach trains directly in pixel space. It simplifies the pipeline and removes the VAE bottleneck.
- Patch size: 32
- Bottleneck dimension: 256 in the initial token projection layer
- Resolution schedule: Start at 512px → fine-tune at 1024px (skipping 256px)
At 512px, the sequence length is roughly 256 tokens. At 1024px, it becomes 1024 tokens. With modern hardware, this is entirely manageable.
2. Perceptual Losses (LPIPS + DINOv2)
A major advantage of pixel-space training is that you can directly apply classic perceptual losses. We added two auxiliary objectives on top of the standard flow-matching loss:
- LPIPS loss (weight: 0.1) — captures low-level perceptual similarity.
- DINOv2-based perceptual loss (weight: 0.01) — provides stronger semantic alignment.
We applied these losses on pooled full images (not patch-wise) at all noise levels. This gave consistently better results than the original paper's approach.
3. Token Routing with TREAD
To reduce computational cost per step, we used TREAD (Krause et al., 2025). It randomly selects 50% of tokens and routes them from the 2nd transformer block to the penultimate block, skipping the middle layers entirely.
We chose TREAD over SPRINT for its simplicity. To compensate for the quality drop under vanilla CFG, we implemented a self-guidance scheme that guides using dense vs. routed conditional predictions.
4. Representation Alignment with REPA + DINOv3
REPA (Yu et al., 2024) aligns the model's intermediate representations with a pretrained teacher. We used DINOv3 as the teacher and applied the alignment loss at the 8th transformer block (weight: 0.5).
Since we combine REPA with TREAD routing, we only compute the alignment loss on non-routed tokens to keep the signal consistent.
5. Optimizer: Muon
We switched from Adam to Muon for all 2D parameters (matrices). Biases, norms, and embeddings remained on Adam. The configuration:
| Parameter Group | Optimizer | Key Hyperparameters |
|---|---|---|
| 2D parameters | Muon | lr=1e-4, momentum=0.95, nesterov=true, ns_steps=5 |
| All others | Adam | lr=1e-4, betas=(0.9, 0.95), eps=1e-8 |
Training Schedule & Data
We used three synthetic datasets:
- Flux generated (1.7M samples)
- FLUX-Reason-6M (6M samples)
- Midjourney v6 (1M samples, re-captioned with Gemini 1.5 for consistency)
The schedule:
- 512px: 100k steps, batch size 1024
- 1024px: 20k steps, batch size 512 (REPA turned off)
We kept an EMA of the weights (smoothing=0.999, update every 10 steps, start from step 0).
Code Example: Configuring the Training Loop
Below is a simplified Python snippet showing how to configure the key components. The full code is in the PRX repository.
# config.py (simplified)
import torch
from prx import Trainer, TREADRouter, REPAAlignment, PerceptualLoss
# Model configuration
model_config = {
"patch_size": 32,
"bottleneck_dim": 256,
"num_layers": 24,
"hidden_dim": 1024,
}
# Training components
router = TREADRouter(
route_ratio=0.5,
start_block=2,
end_block=-2 # penultimate block
)
repa = REPAAlignment(
teacher_name="dinov3",
loss_weight=0.5,
apply_at_block=8
)
perceptual_loss = PerceptualLoss(
lpips_weight=0.1,
dino_weight=0.01,
apply_at_all_noise_levels=True
)
# Optimizer
param_groups = [
{"params": model.muon_params, "optimizer": "muon", "lr": 1e-4, "momentum": 0.95},
{"params": model.adam_params, "optimizer": "adam", "lr": 1e-4, "betas": (0.9, 0.95)}
]
trainer = Trainer(
model=model,
router=router,
repa=repa,
perceptual_loss=perceptual_loss,
param_groups=param_groups,
batch_size=1024,
steps_512=100000,
steps_1024=20000
)
trainer.run()

Results & Limitations
After 24 hours, the model produces coherent, aesthetically pleasing images with strong prompt adherence. The 1024px fine-tuning stage sharpens details without breaking composition. However, some artifacts remain:
- Occasional texture glitches
- Weird anatomy on complex prompts
- Performance degrades on very hard or out-of-distribution prompts
These issues are consistent with undertraining and limited data diversity, not a structural flaw in the recipe. With more compute and broader data coverage, the same setup should continue improving predictably.
What This Means for the Field
This speedrun demonstrates that diffusion training has become remarkably accessible. By combining pixel-space training, efficient routing, representation alignment, and lightweight perceptual guidance, a meaningful model is achievable in a single day on a budget that would have been unimaginable a few years ago.
Limitations & Cautions
- Synthetic data bias: All training data was AI-generated. Real-world images might behave differently.
- Hardware dependency: The recipe assumes H200 GPUs with large memory. Results on consumer hardware will vary significantly.
- No safety filter: The model was not fine-tuned for safety or bias mitigation. Use with caution in production.
Next Steps
This 24-hour run is just the beginning. Future work will focus on:
- Scaling up with more compute and broader datasets
- Improving caption quality and diversity
- Exploring self-guidance schemes further
We invite the community to build on this work. All code is open-source in the PRX repository. If you're building your own diffusion pipeline, you might also find our guide on native popover API for tooltips useful for creating better UI interactions in your demo apps.

Conclusion
Training a competitive text-to-image diffusion model from scratch in 24 hours is no longer science fiction. The combination of pixel-space training, TREAD routing, REPA alignment, and Muon optimizer works synergistically to deliver strong results on a modest budget.
The key takeaway: diffusion research has democratized. You don't need millions of dollars to contribute meaningful work. Start with the PRX codebase, tweak the knobs, and see what you can discover.
Thanks for reading. Join the community discussion on Discord and stay tuned for the next round of experiments. 🚀