The Challenge: Migrating Thousands of Dataset Consumers
Any data team knows the pain of deprecating a heavily-used dataset. When Spotify needed to sunset two of its most critical user datasets to release new versions with enhanced dimensions, the scale was staggering: ~1,800 direct downstream data pipelines, indirectly impacting thousands more across the company. The deadline was six months, and the pipelines spanned three radically different frameworks: SQL-based BigQuery Runner, dbt, and Scala-based Scio. Manual effort was estimated at 10 engineering weeks.
This is where their internal background coding agent, codenamed Honk, entered the picture. Honk is not a ChatGPT wrapper—it's a purpose-built agent that integrates deeply with Spotify's Backstage developer portal and Fleet Management platform to autonomously rewrite code at scale.
For a deeper look at how AI agents can deliver measurable ROI beyond the hype, check out our strategic guide on maximizing AI ROI and managing costs.

How Honk Handled the Migration: Context Engineering Was Everything
Step 1: Finding the Needles in the Haystack with Backstage
Before any code changes, the team had to understand the full lineage of the deprecated datasets. Backstage's endpoint lineage and Codesearch plugins made this possible. Each endpoint's Backstage page showed a clear list of downstream consumers, and Codesearch queries identified target repositories across Spotify's GitHub Enterprise landscape. These were marked in-scope using the Fleetshift plugin.
Step 2: Context Engineering—The Make-or-Break Phase
As discussed in Part 2 of Spotify's Honk series, context engineering is the most critical—and most time-consuming—part of working with background coding agents. The major challenge here was the heterogeneity of pipeline frameworks:
- BigQuery Runner & dbt: Relatively consistent in style across teams.
- Scio: Highly flexible, with wildly varying implementations per team.
At the time of this migration, Honk lacked Claude skills and custom configurability (a deliberate design choice for safety). This meant the prompt had to be fully comprehensive upfront—Honk couldn't read external schemas or docs on its own.
The team tried two approaches:
- Repurposing human-written migration guides: Failed because the context was too vague. Honk made incorrect assumptions about field mappings.
- Explicit mapping tables in the context file: Succeeded. By providing clear, tabular mappings for every field transformation, Honk delivered solid performance across most repositories.
This teaches a crucial lesson: when an agent can't gather its own context, your context file must leave zero ambiguity.
# Example: Context file snippet for BigQuery Runner migration
MAPPING_TABLE = {
"user_id": {
"old_column": "user_id",
"new_column": "user_identifier",
"transform": "CAST(user_id AS STRING)"
},
"session_start": {
"old_column": "session_start",
"new_column": "session_start_ts",
"transform": "TIMESTAMP_MILLIS(session_start)"
}
}
# Honk uses this table to rewrite SQL SELECT statements
# If a field requires a human judgment call, Honk leaves it unchanged
# and adds a comment with a link to the human migration guide
Step 3: The Testing Gap
Scio pipelines often included unit tests, but BigQuery Runner and dbt repositories rarely did. This meant Honk's key feedback loop—verify work, then adjust—was unavailable. The team had to rely on downstream owners for manual testing before merging automated PRs.
Despite this, Honk and Fleetshift successfully rolled out 240 automated migration PRs. The combination of Backstage's overview UI and Fleetshift's monitoring made it easy to track progress, troubleshoot, and communicate with owning teams.
For a real-world example of AI-driven platform architecture in healthcare, see our analysis of Artera's scalable AI diagnostics platform on AWS.

Key Lessons and Limitations
What Worked Well
- Backstage + Codesearch: Rapidly identified all affected repositories.
- Explicit mapping tables: Removed ambiguity for the agent.
- Fleetshift UI: Simplified monitoring and PR management.
What Didn't Work
- Scio migrations: Abandoned due to framework flexibility making comprehensive prompts impractical without agent self-context capabilities.
- No automated testing: Honk couldn't verify its own work, reducing confidence in PRs.
Critical Limitations & Warnings
- Context engineering is expensive: Writing a bulletproof context file took more time than expected. If your data landscape isn't standardized, this cost multiplies.
- Agent autonomy is bounded: Without access to external tools (e.g., reading schemas, JIRA tickets), the agent is only as good as your prompt.
- Testing infrastructure is non-negotiable: Agents that can't run tests can't self-correct. Invest in CI/CD and unit tests before scaling agent usage.
- Not all frameworks are equal: Highly flexible frameworks (like Scio) are harder to automate. Standardization is a prerequisite for agent success.
Next Steps for Learning
- Explore Claude Code skills and MCP (Model Context Protocol) to give agents self-context capabilities.
- Implement mandatory unit testing across all pipeline repositories.
- Study Spotify's upcoming Honk features: agent-driven context gathering from JIRA and documentation.
- Read the full Honk series: Part 1, Part 2, Part 3.

Conclusion: The Future of Autonomous Code Agents at Scale
Spotify's Honk experiment proves that background coding agents can deliver massive time savings—240 automated PRs, 10 engineering weeks saved—but only when the foundation is right. Standardization of frameworks, mandatory testing, and deep integration with developer portals like Backstage are not optional; they are prerequisites.
The roadmap is promising: future Honk versions will gather their own context from JIRA and documentation, reducing the burden of writing exhaustive prompts. As Claude Code agents improve, the ceiling for what agents can achieve will only rise.
For engineering leaders: Start standardizing your data pipelines and testing practices today. The agents are coming, and they'll only be as effective as the soil you plant them in.