Why General Reasoning Training Matters
Most open-source reasoning tutorials focus on narrow, verifiable tasks like math or coding. But real-world applications — medical diagnosis, legal analysis, robotics planning — require models to reason step-by-step without a single correct answer. Until now, a practical, reproducible recipe for general reasoning was missing.
The Google Tunix Hackathon on Kaggle changed that. With only 9 hours of Kaggle TPU v5e-8 compute, over 300 high-quality submissions proved that community-driven reasoning training is not only possible but surprisingly effective. This article distills the core innovations from the winners into a blueprint you can apply today.
Grounding source: The full hackathon results and technical details are available in the official Google Developers Blog post.
![]()
The Winning Recipe: Three-Stage Post-Training Pipeline
All top submissions followed a similar pattern: SFT → Alignment → RL. Here's the exact approach used by the first-place team (G-RaR).
Stage 1: Supervised Fine-Tuning (SFT)
- Model: Gemma-2-2B-IT
- Dataset: ~33k prompts with structured reasoning traces
- Technique: LoRA fine-tuning to teach the model the
<reasoning>...</reasoning>format
Stage 2: Preference Optimization (SimPO)
- Why SimPO over DPO? SimPO is memory-efficient — critical when you only have 8 TPU cores.
- Goal: Enforce strict XML formatting and prevent verbosity hacks (models that "yap" without logic).
Stage 3: GRPO with LLM-as-Judge
- Reward System:
- Format Reward: Checks
<reasoning>tags - Exact Answer Reward: For verifiable tasks
- G-RaR Score: A novel rubric-based reward from a larger judge model (Gemma-3-12B)
- Format Reward: Checks
- Infrastructure: Split-mesh architecture on a single TPU v5e-8 — policy model on one mesh, judge model on the other for true parallel execution.
# Simplified Tunix GRPO training loop (for illustration)
import tunix
from tunix import GRPOTrainer, SimPOLoss
# Load base model and tokenizer
model = tunix.load_model("gemma-2-2b-it")
tokenizer = tunix.load_tokenizer("gemma-2-2b-it")
# Define reward functions
def format_reward(output):
# Reward if output contains <reasoning>...</reasoning>
return 1.0 if "<reasoning>" in output and "</reasoning>" in output else 0.0
def exact_answer_reward(output, target):
return 1.0 if output.strip() == target.strip() else 0.0
# G-RaR: Rubric-based LLM judge reward
from tunix.rewards import GRaRReward
grar_reward = GRaRReward(judge_model="gemma-3-12b", rubrics=["logic_flow", "completeness"])
# Configure GRPO trainer
trainer = GRPOTrainer(
model=model,
reward_functions=[format_reward, exact_answer_reward, grar_reward],
learning_rate=1e-5,
batch_size=4,
gradient_accumulation_steps=2
)
# Run training (9 hours on Kaggle TPU v5e-8)
trainer.train(dataset="reasoning_dataset.jsonl", num_epochs=1)

Key Innovations from the Winners
1. G-RaR: Rubrics as Rewards (1st Place)
- Problem: Exact-match rewards fail for open-ended tasks.
- Solution: Use a larger judge model to evaluate reasoning quality based on task-specific rubrics (e.g., logical flow, evidence use).
- Result: Continuous, normalized feedback that improves reasoning without requiring a single correct answer.
2. SimPO Over DPO (2nd Place)
- Why it matters: DPO consumes 2x memory per batch. SimPO uses length-normalized preference optimization, making it feasible on limited TPU memory.
- Customization: The team injected a custom SimPO loss function into Tunix's
DPOTrainer.
3. TF-IDF Reward (3rd Place)
- Problem: LLM judges are slow and memory-heavy.
- Solution: Replace the judge with a fast TF-IDF reward that scores reasoning traces based on domain-specific vocabulary relevance.
- Result: Non-blocking CPU-based reward calculation — no GPU overhead.
Honorable Mentions
- On-Policy Distillation: Dynamically generate reasoning traces from a teacher model during training, creating a tighter feedback loop.
- Domain-Specific Reasoning: Medical, chemistry, legal, and robotics — all achieved strong results using the same three-stage recipe.
Limitations and Caveats
- Compute Budget: 9 hours on a single TPU v5e-8 is impressive but still limited. Larger models (7B+) may require more resources.
- Judge Model Bias: Using an LLM-as-judge introduces potential bias — the judge may favor its own reasoning style.
- Generalization: The recipes work best for structured reasoning tasks. Creative or highly open-ended tasks may need further tuning.
Next Steps: Train Your Own Reasoning Model
Ready to build? Here's your action plan:
- Explore Tunix on GitHub: Access the official repository with code, docs, and community examples.
- Try a Colab Tutorial: Spin up a free TPU instance and run your first SFT or RL loop.
- Dive Deeper into RL: Read the Tunix reinforcement learning documentation to understand advanced reward shaping.
For a broader understanding of multi-agent architectures in production, check out our guide on Deconstructing Complexity: A Multi-Agent Architecture for Intelligent Advertising. And if you're interested in generative AI beyond text, see How We Trained a Text-to-Image Model in 24 Hours (Full Recipe).

Conclusion: The Democratization of Reasoning Training
The Tunix Hackathon proved that general reasoning training is no longer reserved for big labs with unlimited compute. With open-source tools, free Kaggle TPUs, and the recipes shared here, any developer can transform a base model into a structured reasoning engine in under 24 hours.
The key takeaway: Combine SFT for foundational skills, SimPO for formatting discipline, and GRPO with creative reward functions (G-RaR, TF-IDF) for logical depth. Start small, iterate fast, and share your results — the community is waiting.