How We Trained a Text-to-Image Model in 24 Hours (Full Recipe)

Introduction: The 24-Hour Challenge

Training a competitive text-to-image diffusion model used to cost millions of dollars and weeks of GPU time. Times have changed. In this post, we share the complete recipe behind a 24-hour speedrun using 32 H200 GPUs (approx. $1500 total compute). The goal was to stack every trick that actually works — not just measure them in isolation — and see how far we could push performance under a strict budget.

We open-sourced all the code in the PRX repository, so you can reproduce, modify, and extend the experiments yourself. This post summarizes the key architectural decisions and training settings. For a detailed experimental breakdown of individual components, check out our earlier posts (Part 1 and Part 2).

If you're interested in how AI-powered tools can simplify debugging in other domains, you might also enjoy our guide on building an AI troubleshooting assistant for Kubernetes.

AI-generated image from a diffusion model trained in 24 hours showing a futuristic cityscape Developer Related Image

The Training Recipe: A Stack of Proven Techniques

1. Pixel-Space Training with X-Prediction

We adopted the x-prediction formulation from Back to Basics: Let Denoising Generative Models Denoise (Li & He, 2025). Unlike latent diffusion models that rely on a VAE, this approach trains directly in pixel space. It simplifies the pipeline and removes the VAE bottleneck.

Patch size: 32
Bottleneck dimension: 256 in the initial token projection layer
Resolution schedule: Start at 512px → fine-tune at 1024px (skipping 256px)

At 512px, the sequence length is roughly 256 tokens. At 1024px, it becomes 1024 tokens. With modern hardware, this is entirely manageable.

2. Perceptual Losses (LPIPS + DINOv2)

A major advantage of pixel-space training is that you can directly apply classic perceptual losses. We added two auxiliary objectives on top of the standard flow-matching loss:

LPIPS loss (weight: 0.1) — captures low-level perceptual similarity.
DINOv2-based perceptual loss (weight: 0.01) — provides stronger semantic alignment.

We applied these losses on pooled full images (not patch-wise) at all noise levels. This gave consistently better results than the original paper's approach.

3. Token Routing with TREAD

To reduce computational cost per step, we used TREAD (Krause et al., 2025). It randomly selects 50% of tokens and routes them from the 2nd transformer block to the penultimate block, skipping the middle layers entirely.

We chose TREAD over SPRINT for its simplicity. To compensate for the quality drop under vanilla CFG, we implemented a self-guidance scheme that guides using dense vs. routed conditional predictions.

4. Representation Alignment with REPA + DINOv3

REPA (Yu et al., 2024) aligns the model's intermediate representations with a pretrained teacher. We used DINOv3 as the teacher and applied the alignment loss at the 8th transformer block (weight: 0.5).

Since we combine REPA with TREAD routing, we only compute the alignment loss on non-routed tokens to keep the signal consistent.

5. Optimizer: Muon

We switched from Adam to Muon for all 2D parameters (matrices). Biases, norms, and embeddings remained on Adam. The configuration:

Parameter Group	Optimizer	Key Hyperparameters
2D parameters	Muon	lr=1e-4, momentum=0.95, nesterov=true, ns_steps=5
All others	Adam	lr=1e-4, betas=(0.9, 0.95), eps=1e-8

Training Schedule & Data

We used three synthetic datasets:

Flux generated (1.7M samples)
FLUX-Reason-6M (6M samples)
Midjourney v6 (1M samples, re-captioned with Gemini 1.5 for consistency)

The schedule:

512px: 100k steps, batch size 1024
1024px: 20k steps, batch size 512 (REPA turned off)

We kept an EMA of the weights (smoothing=0.999, update every 10 steps, start from step 0).

Code Example: Configuring the Training Loop

Below is a simplified Python snippet showing how to configure the key components. The full code is in the PRX repository.

# config.py (simplified)
import torch
from prx import Trainer, TREADRouter, REPAAlignment, PerceptualLoss

# Model configuration
model_config = {
    "patch_size": 32,
    "bottleneck_dim": 256,
    "num_layers": 24,
    "hidden_dim": 1024,
}

# Training components
router = TREADRouter(
    route_ratio=0.5,
    start_block=2,
    end_block=-2  # penultimate block
)

repa = REPAAlignment(
    teacher_name="dinov3",
    loss_weight=0.5,
    apply_at_block=8
)

perceptual_loss = PerceptualLoss(
    lpips_weight=0.1,
    dino_weight=0.01,
    apply_at_all_noise_levels=True
)

# Optimizer
param_groups = [
    {"params": model.muon_params, "optimizer": "muon", "lr": 1e-4, "momentum": 0.95},
    {"params": model.adam_params, "optimizer": "adam", "lr": 1e-4, "betas": (0.9, 0.95)}
]

trainer = Trainer(
    model=model,
    router=router,
    repa=repa,
    perceptual_loss=perceptual_loss,
    param_groups=param_groups,
    batch_size=1024,
    steps_512=100000,
    steps_1024=20000
)

trainer.run()

Server rack with H200 GPUs used for the 24-hour text-to-image training speedrun Technical Structure Concept

Results & Limitations

After 24 hours, the model produces coherent, aesthetically pleasing images with strong prompt adherence. The 1024px fine-tuning stage sharpens details without breaking composition. However, some artifacts remain:

Occasional texture glitches
Weird anatomy on complex prompts
Performance degrades on very hard or out-of-distribution prompts

These issues are consistent with undertraining and limited data diversity, not a structural flaw in the recipe. With more compute and broader data coverage, the same setup should continue improving predictably.

What This Means for the Field

This speedrun demonstrates that diffusion training has become remarkably accessible. By combining pixel-space training, efficient routing, representation alignment, and lightweight perceptual guidance, a meaningful model is achievable in a single day on a budget that would have been unimaginable a few years ago.

Limitations & Cautions

Synthetic data bias: All training data was AI-generated. Real-world images might behave differently.
Hardware dependency: The recipe assumes H200 GPUs with large memory. Results on consumer hardware will vary significantly.
No safety filter: The model was not fine-tuned for safety or bias mitigation. Use with caution in production.

Next Steps

This 24-hour run is just the beginning. Future work will focus on:

Scaling up with more compute and broader datasets
Improving caption quality and diversity
Exploring self-guidance schemes further

We invite the community to build on this work. All code is open-source in the PRX repository. If you're building your own diffusion pipeline, you might also find our guide on native popover API for tooltips useful for creating better UI interactions in your demo apps.

Developer reviewing training logs and generated samples on a laptop screen Programming Illustration

Conclusion

Training a competitive text-to-image diffusion model from scratch in 24 hours is no longer science fiction. The combination of pixel-space training, TREAD routing, REPA alignment, and Muon optimizer works synergistically to deliver strong results on a modest budget.

The key takeaway: diffusion research has democratized. You don't need millions of dollars to contribute meaningful work. Start with the PRX codebase, tweak the knobs, and see what you can discover.

Thanks for reading. Join the community discussion on Discord and stay tuned for the next round of experiments. 🚀

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

How We Trained a Text-to-Image Model in 24 Hours (Full Recipe)

Introduction: The 24-Hour Challenge

The Training Recipe: A Stack of Proven Techniques

1. Pixel-Space Training with X-Prediction

2. Perceptual Losses (LPIPS + DINOv2)

3. Token Routing with TREAD

4. Representation Alignment with REPA + DINOv3

5. Optimizer: Muon

Training Schedule & Data

Code Example: Configuring the Training Loop

Results & Limitations

What This Means for the Field

Limitations & Cautions

Next Steps

Conclusion

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Introduction: The 24-Hour Challenge

The Training Recipe: A Stack of Proven Techniques

1. Pixel-Space Training with X-Prediction

2. Perceptual Losses (LPIPS + DINOv2)

3. Token Routing with TREAD

4. Representation Alignment with REPA + DINOv3

5. Optimizer: Muon

Training Schedule & Data

Code Example: Configuring the Training Loop

Results & Limitations

What This Means for the Field

Limitations & Cautions

Next Steps

Conclusion

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!