Once you have a background coding agent capable of editing code, running builds, and opening pull requests automatically, a critical challenge emerges: How do you effectively tell it what to do? Moving beyond simply running the agent, producing reliable and mergeable Pull Requests (PRs) across real-world codebases requires meticulous Context Engineering. This post shares hard-won insights from Spotify's experience automating thousands of PRs across hundreds of repositories, delving into effective prompt crafting and agent tool design. You can find the detailed source material here.

AI agent analyzing code structure and generating pull requests on a server cluster Coding Session Visual

Prompt Pitfalls and Hard-Earned Lessons

While experimenting with early open-source agents and their own agentic loop, the Spotify team identified two major anti-patterns.

  1. The Overly Generic Prompt: Expects the agent to telepathically guess intent and desired outcome. An instruction like "make this code better" provides no verifiable goal, leading to agent confusion.
  2. The Overly Specific Prompt: Attempts to cover every possible case but falls apart upon encountering the unexpected. Overly rigid step-by-step instructions stifle agent flexibility and quickly exhaust the context window during complex, multi-step changes.

The transition to Claude Code yielded key principles:

  • Tailor Prompts to the Agent: Claude Code performs better with prompts that describe the desired end state and leave room for it to figure out the path.
  • State Preconditions: Include instructions that preemptively block impossible tasks, e.g., "Do not take action if the repository uses a language level below Java 11."
  • Use Examples: A handful of concrete code examples heavily influence the outcome.
  • Define a Verifiable Goal: Ideally, define the desired end state in the form of tests. The agent needs a benchmark to iteratively improve its solution.
  • Do One Change at a Time: Bundling several related changes into one elaborate prompt may seem convenient but highly increases the risk of context window exhaustion or partial results.
  • Ask the Agent for Prompt Feedback: After a session, the agent itself is surprisingly adept at identifying what was missing in the prompt. Use this to refine future prompts.

Developer reviewing automated code changes and merge requests on a dual monitor setup Developer Related Image

The Art of Tool Limitation for Predictability

While you could connect an agent to numerous tools (like MCP tools) to fetch context dynamically for complex tasks, this reduces testability and predictability. More tools introduce more dimensions of unpredictability.

Spotify keeps its background coding agent's tools and hooks very limited, designed to focus on generating the correct code change from a prompt.

Current Core Toolset:

  • verify tool: Runs formatters, linters, and tests. It abstracts the invocation of in-house build systems across thousands of disparate repos and summarizes logs into an agent-digestible format.
  • Git tool: Provides limited, standardized Git access. Dangerous commands like push or changing origin are not exposed, while actions like setting the committer and using standard commit message formats are standardized.
  • Bash tool: A strict allowlist Bash tool with access to a few commands like ripgrep.

Notably, code search or documentation tools are not currently exposed to the agent. Instead, users are asked to condense relevant context into the prompt upfront, or use separate workflow agents to generate prompts for the coding agent from various sources.

Flowchart diagram showing the agentic loop process from prompt to verified PR Dev Environment Setup

Limitations and Critical Considerations

The current state of agent and prompt engineering still relies heavily on intuition and trial-and-error. There's a lack of structured ways to evaluate which prompts or models perform best, and feedback loops to verify if a merged PR actually solved the original problem are still nascent. The long-term maintainability impact of agent-generated code also remains an open question.

The Path Forward: Learning Next Steps

True maturity of this technology hinges on building measurable feedback loops. Beyond PR merge rates, we must explore ways to quantitatively assess the quality of generated code, the reduction in review time, and the ultimate impact on system stability. Furthermore, cultivating a culture where prompts are version-controlled and tested like code is crucial for sustainable automation.