How the Community Trained Gemma to Think A Full Reasoning Recipe Using Tunix and TPUs

Why General Reasoning Training Matters

Most open-source reasoning tutorials focus on narrow, verifiable tasks like math or coding. But real-world applications — medical diagnosis, legal analysis, robotics planning — require models to reason step-by-step without a single correct answer. Until now, a practical, reproducible recipe for general reasoning was missing.

The Google Tunix Hackathon on Kaggle changed that. With only 9 hours of Kaggle TPU v5e-8 compute, over 300 high-quality submissions proved that community-driven reasoning training is not only possible but surprisingly effective. This article distills the core innovations from the winners into a blueprint you can apply today.

Grounding source: The full hackathon results and technical details are available in the official Google Developers Blog post.

The Winning Recipe: Three-Stage Post-Training Pipeline

All top submissions followed a similar pattern: SFT → Alignment → RL. Here's the exact approach used by the first-place team (G-RaR).

Stage 1: Supervised Fine-Tuning (SFT)

Model: Gemma-2-2B-IT
Dataset: ~33k prompts with structured reasoning traces
Technique: LoRA fine-tuning to teach the model the <reasoning>...</reasoning> format

Stage 2: Preference Optimization (SimPO)

Why SimPO over DPO? SimPO is memory-efficient — critical when you only have 8 TPU cores.
Goal: Enforce strict XML formatting and prevent verbosity hacks (models that "yap" without logic).

Stage 3: GRPO with LLM-as-Judge

Reward System:
- Format Reward: Checks <reasoning> tags
- Exact Answer Reward: For verifiable tasks
- G-RaR Score: A novel rubric-based reward from a larger judge model (Gemma-3-12B)
Infrastructure: Split-mesh architecture on a single TPU v5e-8 — policy model on one mesh, judge model on the other for true parallel execution.

# Simplified Tunix GRPO training loop (for illustration)
import tunix
from tunix import GRPOTrainer, SimPOLoss

# Load base model and tokenizer
model = tunix.load_model("gemma-2-2b-it")
tokenizer = tunix.load_tokenizer("gemma-2-2b-it")

# Define reward functions
def format_reward(output):
    # Reward if output contains <reasoning>...</reasoning>
    return 1.0 if "<reasoning>" in output and "</reasoning>" in output else 0.0

def exact_answer_reward(output, target):
    return 1.0 if output.strip() == target.strip() else 0.0

# G-RaR: Rubric-based LLM judge reward
from tunix.rewards import GRaRReward
grar_reward = GRaRReward(judge_model="gemma-3-12b", rubrics=["logic_flow", "completeness"])

# Configure GRPO trainer
trainer = GRPOTrainer(
    model=model,
    reward_functions=[format_reward, exact_answer_reward, grar_reward],
    learning_rate=1e-5,
    batch_size=4,
    gradient_accumulation_steps=2
)

# Run training (9 hours on Kaggle TPU v5e-8)
trainer.train(dataset="reasoning_dataset.jsonl", num_epochs=1)

Diagram showing GRPO reinforcement learning pipeline for Chain-of-Thought reasoning Technical Structure Concept

Key Innovations from the Winners

1. G-RaR: Rubrics as Rewards (1st Place)

Problem: Exact-match rewards fail for open-ended tasks.
Solution: Use a larger judge model to evaluate reasoning quality based on task-specific rubrics (e.g., logical flow, evidence use).
Result: Continuous, normalized feedback that improves reasoning without requiring a single correct answer.

2. SimPO Over DPO (2nd Place)

Why it matters: DPO consumes 2x memory per batch. SimPO uses length-normalized preference optimization, making it feasible on limited TPU memory.
Customization: The team injected a custom SimPO loss function into Tunix's DPOTrainer.

3. TF-IDF Reward (3rd Place)

Problem: LLM judges are slow and memory-heavy.
Solution: Replace the judge with a fast TF-IDF reward that scores reasoning traces based on domain-specific vocabulary relevance.
Result: Non-blocking CPU-based reward calculation — no GPU overhead.

Honorable Mentions

On-Policy Distillation: Dynamically generate reasoning traces from a teacher model during training, creating a tighter feedback loop.
Domain-Specific Reasoning: Medical, chemistry, legal, and robotics — all achieved strong results using the same three-stage recipe.

Limitations and Caveats

Compute Budget: 9 hours on a single TPU v5e-8 is impressive but still limited. Larger models (7B+) may require more resources.
Judge Model Bias: Using an LLM-as-judge introduces potential bias — the judge may favor its own reasoning style.
Generalization: The recipes work best for structured reasoning tasks. Creative or highly open-ended tasks may need further tuning.

Next Steps: Train Your Own Reasoning Model

Ready to build? Here's your action plan:

Explore Tunix on GitHub: Access the official repository with code, docs, and community examples.
Try a Colab Tutorial: Spin up a free TPU instance and run your first SFT or RL loop.
Dive Deeper into RL: Read the Tunix reinforcement learning documentation to understand advanced reward shaping.

For a broader understanding of multi-agent architectures in production, check out our guide on Deconstructing Complexity: A Multi-Agent Architecture for Intelligent Advertising. And if you're interested in generative AI beyond text, see How We Trained a Text-to-Image Model in 24 Hours (Full Recipe).

Cloud infrastructure with TPU v5e chips running distributed reasoning model training Algorithm Concept Visual

Conclusion: The Democratization of Reasoning Training

The Tunix Hackathon proved that general reasoning training is no longer reserved for big labs with unlimited compute. With open-source tools, free Kaggle TPUs, and the recipes shared here, any developer can transform a base model into a structured reasoning engine in under 24 hours.

The key takeaway: Combine SFT for foundational skills, SimPO for formatting discipline, and GRPO with creative reward functions (G-RaR, TF-IDF) for logical depth. Start small, iterate fast, and share your results — the community is waiting.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.

How the Community Trained Gemma to Think A Full Reasoning Recipe Using Tunix and TPUs

Why General Reasoning Training Matters

The Winning Recipe: Three-Stage Post-Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Preference Optimization (SimPO)

Stage 3: GRPO with LLM-as-Judge

Key Innovations from the Winners

1. G-RaR: Rubrics as Rewards (1st Place)

2. SimPO Over DPO (2nd Place)

3. TF-IDF Reward (3rd Place)

Honorable Mentions

Limitations and Caveats

Next Steps: Train Your Own Reasoning Model

Conclusion: The Democratization of Reasoning Training

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why General Reasoning Training Matters

The Winning Recipe: Three-Stage Post-Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Preference Optimization (SimPO)

Stage 3: GRPO with LLM-as-Judge

Key Innovations from the Winners

1. G-RaR: Rubrics as Rewards (1st Place)

2. SimPO Over DPO (2nd Place)

3. TF-IDF Reward (3rd Place)

Honorable Mentions

Limitations and Caveats

Next Steps: Train Your Own Reasoning Model

Conclusion: The Democratization of Reasoning Training

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!