Once you have a background coding agent capable of editing code, running builds, and opening pull requests automatically, a critical challenge emerges: How do you effectively tell it what to do? Moving beyond simply running the agent, producing reliable and mergeable Pull Requests (PRs) across real-world codebases requires meticulous Context Engineering. This post shares hard-won insights from Spotify's experience automating thousands of PRs across hundreds of repositories, delving into effective prompt crafting and agent tool design. You can find the detailed source material here.

Prompt Pitfalls and Hard-Earned Lessons
While experimenting with early open-source agents and their own agentic loop, the Spotify team identified two major anti-patterns.
- The Overly Generic Prompt: Expects the agent to telepathically guess intent and desired outcome. An instruction like "make this code better" provides no verifiable goal, leading to agent confusion.
- The Overly Specific Prompt: Attempts to cover every possible case but falls apart upon encountering the unexpected. Overly rigid step-by-step instructions stifle agent flexibility and quickly exhaust the context window during complex, multi-step changes.
The transition to Claude Code yielded key principles:
- Tailor Prompts to the Agent: Claude Code performs better with prompts that describe the desired end state and leave room for it to figure out the path.
- State Preconditions: Include instructions that preemptively block impossible tasks, e.g., "Do not take action if the repository uses a language level below Java 11."
- Use Examples: A handful of concrete code examples heavily influence the outcome.
- Define a Verifiable Goal: Ideally, define the desired end state in the form of tests. The agent needs a benchmark to iteratively improve its solution.
- Do One Change at a Time: Bundling several related changes into one elaborate prompt may seem convenient but highly increases the risk of context window exhaustion or partial results.
- Ask the Agent for Prompt Feedback: After a session, the agent itself is surprisingly adept at identifying what was missing in the prompt. Use this to refine future prompts.

The Art of Tool Limitation for Predictability
While you could connect an agent to numerous tools (like MCP tools) to fetch context dynamically for complex tasks, this reduces testability and predictability. More tools introduce more dimensions of unpredictability.
Spotify keeps its background coding agent's tools and hooks very limited, designed to focus on generating the correct code change from a prompt.
Current Core Toolset:
verifytool: Runs formatters, linters, and tests. It abstracts the invocation of in-house build systems across thousands of disparate repos and summarizes logs into an agent-digestible format.- Git tool: Provides limited, standardized Git access. Dangerous commands like
pushor changingoriginare not exposed, while actions like setting the committer and using standard commit message formats are standardized. - Bash tool: A strict allowlist Bash tool with access to a few commands like
ripgrep.
Notably, code search or documentation tools are not currently exposed to the agent. Instead, users are asked to condense relevant context into the prompt upfront, or use separate workflow agents to generate prompts for the coding agent from various sources.

Limitations and Critical Considerations
The current state of agent and prompt engineering still relies heavily on intuition and trial-and-error. There's a lack of structured ways to evaluate which prompts or models perform best, and feedback loops to verify if a merged PR actually solved the original problem are still nascent. The long-term maintainability impact of agent-generated code also remains an open question.
The Path Forward: Learning Next Steps
True maturity of this technology hinges on building measurable feedback loops. Beyond PR merge rates, we must explore ways to quantitatively assess the quality of generated code, the reduction in review time, and the ultimate impact on system stability. Furthermore, cultivating a culture where prompts are version-controlled and tested like code is crucial for sustainable automation.