TorchTPU Running PyTorch Natively on Google TPUs at Scale

Why TorchTPU Matters

Modern AI workloads demand hardware that can scale across thousands of accelerators. Google’s Tensor Processing Units (TPUs) are custom ASICs designed for this, powering models like Gemini and Veo. But the machine learning ecosystem runs on PyTorch. Without native PyTorch support, TPUs remained inaccessible to a huge community of developers.

TorchTPU changes that. It’s a new stack that lets you take an existing PyTorch script, change your device initialization to "tpu", and run training loops without modifying core logic. No wrappers, no subclasses — just familiar PyTorch tensors on TPU hardware.

This isn’t just about compatibility. TorchTPU is built on three engineering principles: usability, portability, and performance. Let’s look under the hood.

Google TPU pod server rack with inter-chip interconnect for distributed AI training

Architecture: Eager First, Fused for Speed

TorchTPU integrates at the deepest level via PyTorch’s PrivateUse1 interface. It prioritizes eager execution — the default PyTorch experience — and offers three modes for different stages of development:

Mode	Behavior	Use Case
Debug Eager	One op at a time, syncs CPU after each	Debugging shape mismatches, NaNs
Strict Eager	Single-op dispatch, async execution	Default training loop
Fused Eager	On-the-fly op fusion into dense chunks	Maximum throughput (50–100%+ faster)

# Example: Switching to TPU with TorchTPU
import torch
import torchtpu

# Change device to "tpu" — that's it
device = torch.device("tpu")
model = MyModel().to(device)
optimizer = torch.optim.Adam(model.parameters())

for batch in dataloader:
    x = batch["input"].to(device)
    y = batch["label"].to(device)
    
    optimizer.zero_grad()
    loss = model(x, y)
    loss.backward()
    optimizer.step()

All three modes share a Compilation Cache that learns your workload, reducing recompilation over time. For peak performance, TorchTPU integrates with torch.compile via XLA as the backend compiler, capturing FX graphs with Torch Dynamo and mapping operators directly into StableHLO.

Distributed Training Made Natural

TorchTPU supports DDP, FSDPv2, and DTensor out of the box. A key improvement over its predecessor (PyTorch/XLA) is handling divergent execution (MPMD). In real PyTorch code, rank 0 often does extra logging — TorchTPU isolates communication primitives to preserve correctness without forcing SPMD purity.

# FSDPv2 example — works unchanged
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(MyModel(), device_id=torch.device("tpu"))

Custom Kernels for Hardware Optimization

For advanced users, TorchTPU supports custom kernels written in Pallas and JAX. Decorate a JAX function with @torch_tpu.pallas.custom_jax_kernel to write low-level instructions that interface directly with the TPU lowering path.

import jax
import torchtpu.pallas as tp

@tp.custom_jax_kernel
def my_kernel(x):
    # Low-level TPU operations here
    return jax.nn.relu(x)

Limitations & Caveats

While TorchTPU is production-ready for training and serving, there are important considerations:

Optimal model design differs from GPU. For example, TPU TensorCores achieve peak efficiency at head dimensions of 128 or 256, not 64. You may need to refactor attention layers.
Recompilation overhead remains a challenge for dynamic sequence lengths and batch sizes. The team is working on bounded dynamism in XLA to mitigate this.
Ecosystem maturity: Some third-party libraries that rely on CUDA-specific ops may not work without adaptation. Check the TPU Developer Hub for compatibility lists.

What’s Next: 2026 Roadmap

Precompiled TPU kernel library to reduce first-iteration latency.
Helion kernel support alongside existing Pallas/JAX integration.
Bounded dynamism in XLA to handle shape changes without recompilation.
Deeper integration with PyTorch’s distributed ecosystem (e.g., torch.distributed.pipeline).

Pro tip: If you’re porting a model from GPU, start by verifying correctness with Debug Eager, then switch to Fused Eager for performance. Use our upcoming deep-dive guidelines to identify suboptimal architectures like hardcoded head dimensions.

Cloud infrastructure diagram showing TPU clusters connected via ICI torus topology IT Technology Image

Conclusion

TorchTPU is a major step toward making Google’s TPU infrastructure accessible to the entire PyTorch community. By prioritizing eager execution, supporting distributed APIs out of the box, and enabling custom kernel injection, it removes the friction that previously kept developers away.

If you’re building next-generation AI models and want to leverage TPU’s scale and efficiency, now is the time to experiment with TorchTPU. Start with the official documentation and join the community on GitHub.

TorchTPU Running PyTorch Natively on Google TPUs at Scale

Why TorchTPU Matters

Architecture: Eager First, Fused for Speed

Distributed Training Made Natural

Custom Kernels for Hardware Optimization

Limitations & Caveats

What’s Next: 2026 Roadmap

Conclusion

Related Reading

Share this post

Did you find this post helpful?
It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Why TorchTPU Matters

Architecture: Eager First, Fused for Speed

Distributed Training Made Natural

Custom Kernels for Hardware Optimization

Limitations & Caveats

What’s Next: 2026 Roadmap

Conclusion

Related Reading

Share this post

Did you find this post helpful?It helps the author a lot!

Subscribe

RSS / Atom Feed

Real-time Alerts

Comments 0

Did you find this post helpful?
It helps the author a lot!