Agentic Driven Development (ADD): AGENTS.md, Skills, and the Full Workflow

I’ve been spending a lot of time over the past year integrating AI agents into my development workflow across several projects. It started with .github/copilot-instructions.md, which was GitHub Copilot’s first attempt at letting you give the AI project context. That was a good start, but it was a single monolithic file tied to one tool. Then AGENTS.md came along as an agent-agnostic standard, and more recently agent skills have taken things even further with domain-specific instruction files that get loaded on demand. The progression from “one big instruction file for one tool” to “a curated set of skills that any agent can use” has been really interesting to watch and build on.

I wanted to share what I’ve learned across that journey, what’s working, and where I think this needs to go.

What Are AGENTS.md and Agent Skills?

AGENTS.md is an open format for giving coding agents project-specific context. It emerged from collaboration across the AI coding ecosystem and is now stewarded by the Agentic AI Foundation under the Linux Foundation. It’s a markdown file you place at the root of your repository that tells agents how to build, test, and navigate the codebase, what conventions to follow, what architectural decisions matter, and what pitfalls to avoid. Think of it as onboarding documentation, but for AI.

It’s already been adopted by over 60,000 open source projects, and the list of tools that support it keeps growing: VS Code (v1.105+), GitHub Copilot, Cursor, Windsurf, Gemini CLI, Zed, Warp, Aider, and RooCode. Claude Code is one notable exception here. It uses CLAUDE.md instead of AGENTS.md, which I’ll come back to later.

Agent skills take this further. They’re focused instruction files (SKILL.md) organized into directories like .agents/skills/, .github/skills/, or .claude/skills/ depending on the tool, and they give the AI deep domain knowledge about specific areas of your project. When the agent is working on frontend code, it loads the frontend skill. When it’s writing tests, it loads the testing skill.

What I really like about this is that a lot of this structure works across multiple agents. Support and invocation behavior still vary by tool, but you can increasingly write these instructions once and benefit across more than one agent, which is a big step away from vendor lock-in.

One thing to know: skills don’t always get invoked when you’d expect. Sometimes the agent will just announce that a skill exists instead of actually using it, or it’ll ignore a relevant skill entirely. In VS Code, I have had better results with the experimental chat.experimental.useSkillAdherencePrompt setting. It uses a stronger adherence prompt that pushes the model to invoke relevant skills instead of just mentioning them. I’ve found this makes a noticeable difference, but since it’s experimental, I’d still treat it as something to test in your own workflow rather than assume across every setup.

Skip the Hierarchy, Keep It Flat

Some tools support hierarchical AGENTS.md files: parent directories, subdirectories, nested overrides. VS Code even has a chat.useNestedAgentsMdFiles setting for it. In theory this lets you share common conventions at the top and specialize deeper in the tree. In practice, I’ve had bad luck with it. Agents don’t always pick up the right files in the right order, rules from different levels can conflict in confusing ways, and debugging which instructions actually got loaded is a pain.

My recommendation: use a single AGENTS.md at the repo root and put everything else in skills. The root file stays concise: build commands, project structure, high-level conventions, and a table pointing to your skills. Domain-specific knowledge goes in skill files that get loaded on demand. This is simpler to reason about, easier to maintain, and more predictable across different agents.

How We Use This in Practice

Foundatio

Foundatio is an open source library providing pluggable building blocks for distributed .NET applications (caching, queues, locks, messaging, jobs, file storage). The AGENTS.md file covers:

Repository overview, what each abstraction does and the design principles (interface-first, testable, swappable implementations)
Quick start commands, dotnet build, dotnet test, dotnet format
Project structure, mapping out src/, tests/, samples/, benchmarks/
Coding standards, from style rules to architecture patterns like async suffixes, cancellation token conventions, and when to use ValueTask<T>
Common patterns, extension methods, logging, exception types, single responsibility guidance

We also use the AGENTS.md to keep the documentation up to date. When the agent makes changes to Foundatio, the instructions tell it to update the relevant docs as part of the same task. This means the README and API docs stay in sync with the code without us having to remember to do it manually.

The key insight here is the “Craftsmanship Mindset” preamble: “Every line of code should be intentional, readable, and maintainable. Write code you’d be proud to have reviewed by senior engineers.” Setting the tone matters. The agent will mirror the standards you set.

This has been a really important pattern in my own prompting: the opening of your instructions sets the tone for everything that follows. If you lead with “write clean, production-ready code,” the model approaches the task differently than if you lead with a list of file paths. I wouldn’t treat this as a universal law for every model and harness, but in practice I’ve found that putting your standards right at the top of AGENTS.md noticeably improves the quality bar.

Some official model documentation suggests this matters less than it used to. I’ve seen it firsthand though, and in my workflow it makes a massive difference when working with system prompts. If you want to go deeper on prompting techniques, I’d recommend reading the Anthropic prompting guide, the OpenAI prompt engineering guide, the Codex prompting guide, and the Google Gemini prompting strategies. Each provider has slightly different recommendations, but they all agree that clarity and structure matter.

It’s also worth noting that some models and harnesses actually work better with less in the AGENTS.md. Codex is a good example of this, where a leaner, more focused context file can outperform a comprehensive one. The arxiv study on evaluating AGENTS.md found exactly this: unnecessary requirements made tasks harder for agents, not easier. So know your tools and test what works. More isn’t always better.

Exceptionless

Exceptionless is a real-time error monitoring platform (ASP.NET Core 10 + Svelte 5) and this is where we went much deeper with skills. The AGENTS.md is relatively concise, pointing agents to load specific skills from .agents/skills/<name>/SKILL.md based on the domain they’re working in:

Domain	Skills
Backend	dotnet-conventions, backend-architecture, dotnet-cli, backend-testing, foundatio
Frontend	svelte-components, tanstack-form, tanstack-query, shadcn-svelte, typescript-conventions, frontend-architecture, storybook, accessibility, frontend-design
Testing	frontend-testing, e2e-testing
Cross-cutting	security-principles, releasenotes
Billing	stripe-best-practices, upgrade-stripe
Agents	agent-browser, dogfood

That’s 22 skills total, including 6 third-party ones (browser automation, dogfood, frontend design, release notes, Stripe) that we manage separately. When an agent is working on a Svelte component, it knows to use shadcn-svelte patterns, TanStack Query for data fetching, and our specific component conventions. When it’s working on the backend, it understands our repository patterns, Foundatio abstractions, and how we use FluentValidation.

Third-party skills are managed with the Vercel Skills CLI. You add them with npx skills add and keep them up to date with npx skills update, which uses a lock file to track versions. It’s really clean. Companies like Stripe and Anthropic publish official skills for their products, so instead of writing your own Stripe integration guide, you just pull theirs. There’s also skills.sh, a directory for discovering community and official skills across platforms.

Vercel has done a really great job with the npx skills CLI. It’s the best tool out there right now for managing third-party skills. That said, it feels a bit odd running npx and pulling in Node.js as a dependency for a .NET-only project that doesn’t otherwise have a package.json. You end up with a Node toolchain just to manage your agent skill files, which is a weird ask for a team that’s all-in on dotnet. I’d really love to see a dotnet skills tool or even a dotnet tool install skills that does the same thing natively. If anyone’s looking for an open source project to start, there’s your idea!

Security Risks of Third-Party Skills

One thing I want to be really clear about: third-party skills are code that directly influences what your AI agent does. They’re instructions that get injected into the agent’s context, and a malicious skill could tell the agent to do things you really don’t want, like exfiltrating secrets, introducing backdoors, or modifying files outside the scope of the task. This is a real supply chain risk, similar to pulling in an untrusted npm package or NuGet dependency.

Here’s how I handle it:

Always commit skills to your repo. Don’t pull them at build time or on the fly. They should be checked into source control so you have a clear record of what’s in your project.
Review skill updates like code changes. When you run npx skills add to update a third-party skill, treat the diff the same way you’d treat a dependency update. Read what changed. Make sure nothing sketchy got introduced. Don’t just blindly accept updates.
Run independent security scans. Tools like CodeQL and Copilot code review can catch some of this, but you should also be scanning your skill files as part of your security pipeline. If a skill file suddenly tells the agent to ignore security conventions or access environment variables, that’s a red flag. skills.sh does a nice job here by having three different vendors scan each skill in their registry. That’s a good start, but you still need to review every single change yourself. At the end of the day, skills are instruction-level supply chain risk. Treat them with the same seriousness you would any other dependency that can influence execution.
Prefer official skills from known sources. Skills published by Stripe, Anthropic, Vercel, and Microsoft carry a lot more trust than random community skills. That doesn’t mean community skills are bad, but you should be more careful vetting them.

This is still a pretty new ecosystem, so the tooling around skill verification and trust is still evolving. But the principles are the same as any other supply chain security: know what you’re pulling in, review it before you trust it, and keep a record of changes.

This Blog

Even this Astro blog has skills. There’s a project-structure skill documenting the site layout, a brand-guidelines skill for visual identity and voice, and an seo-schema-markup skill for JSON-LD structured data. When I ask an agent to write a blog post, it knows the frontmatter schema, the content directory structure, and the tech stack (Astro 5, Tailwind CSS v4, Pagefind).

Self-Learning: The Promise and the Trap

Exceptionless has a “Continuous Improvement” section in its AGENTS.md:

Each time you complete a task or learn important information about the project, you must update the AGENTS.md, README.md, or relevant skill files.

This is the self-learning pattern. The agent documents what it discovers as it works. In theory, the knowledge base grows organically over time.

In practice, this needs guardrails. Without them, you end up with bloated instruction files full of edge cases, one-off fixes, and stale information. It becomes like a FAQ that nobody maintains, every entry technically true but not really helpful because the useful stuff gets buried.

What I think works better is treating skills more like curated knowledge than an append-only log. A few principles:

Be selective about what gets documented. Not every discovery is worth persisting. A quirky workaround for a dependency bug isn’t a “skill”, it’s a footnote that’ll be irrelevant after the next update.
New features should trigger pruning. When you add a new capability or refactor a system, that’s your cue to review the relevant skill files. Remove rules that reference old patterns. Update conventions that have evolved. Skills should reflect the codebase as it is, not as it was.
Guard against tech debt in your instructions. Ironically, the files meant to prevent tech debt can become tech debt themselves. If a skill file has contradictory rules, outdated examples, or references to deleted code, the agent gets confused and produces worse output. Maintain your instruction files with the same discipline you maintain your code.
Think FAQ, not encyclopedia. Document the patterns and decisions that come up repeatedly. If you’ve explained the same architectural choice three times in code reviews, that’s a skill entry. If it’s a one-off edge case, let it go.

There’s actually academic research that backs up the “less is more” approach. A recent study, “Evaluating AGENTS.md”, found that context files can actually reduce agent success rates compared to providing no context at all, while increasing inference costs by over 20%. The key finding is that unnecessary requirements make tasks harder for agents. LLM-generated context files performed worst, while human-written files with minimal requirements did slightly better.

Interestingly, Vercel’s own evals tell a different story. They found that embedding documentation directly in AGENTS.md dramatically outperformed skills, hitting 100% pass rates on build, lint, and test metrics while skills with default settings scored the same as having no docs at all (53%). The key difference is that AGENTS.md content is always in the system prompt, so the agent doesn’t have to decide whether to look something up. Skills require the agent to make a decision to invoke them, and that decision point is where things break down. Even with explicit instructions to use skills, they only got to 79%.

So how do you reconcile those two findings? I think the arxiv paper is really about bloated, low-quality context hurting more than helping. And Vercel’s results are about the right context, compressed aggressively (they went from 40KB to 8KB while keeping 100% accuracy), delivered passively. Both point to the same conclusion: less is more, and the best context is focused, curated, and always available.

It’s also worth mentioning Context7 here. Context7 came before the skills ecosystem and is still really relevant. It’s an MCP server that fetches up-to-date, version-specific library documentation directly into your prompt. So instead of baking framework docs into your AGENTS.md or skills (where they’ll go stale), Context7 pulls the latest docs on demand. I use it alongside skills, where skills handle project-specific conventions and Context7 handles third-party library documentation. They complement each other well.

The takeaway from all of this is: keep your AGENTS.md focused on what actually matters, compress aggressively, and use tools like Context7 for documentation that changes frequently. Don’t dump everything you know into one file. The agents in the arxiv study did follow the instructions and explore more thoroughly, but all that extra exploration didn’t translate to better results when the instructions were bloated.

MCP Servers: Powerful but Painful

I have to talk about Model Context Protocol (MCP) servers here because they’re becoming a really important part of the agent ecosystem, but the experience of using them is still pretty rough.

MCP servers let you give agents access to external tools and data sources: databases, APIs, browser automation, file systems, documentation fetchers like Context7, and more. The idea is great. Instead of the agent only being able to read and write files, it can interact with the outside world through a standardized protocol. In practice though, there are a few real pain points.

Context bloat just from being enabled. Every MCP server you have enabled adds its tool definitions to the agent’s context, even if you never use it in that session. If you’ve got ten MCP servers configured, that’s a lot of token budget being spent on tool descriptions before you’ve even asked a question. This ties back to the “less is more” theme. Only enable the MCP servers you actually need for the project you’re working on, and disable the rest.

Configuration is different everywhere. Every tool has its own way of setting up MCP servers. Claude Code uses a claude_desktop_config.json or .mcp.json, VS Code has its own config format, Cursor does it differently, and so on. If you’re using multiple tools (and you should be, as I talked about with the cross-model review workflow), you end up maintaining the same MCP server configuration in three or four different places. There’s no shared standard for this yet.

Discovery is inconsistent across tools. Some tools are getting better at this. Cursor has an MCP directory built right in, and VS Code has one too. But not all tools have caught up, and there’s no universal npx skills add equivalent for MCPs. Depending on which tool you’re using, you might have a nice browsable directory or you might be searching GitHub and hoping for the best. Compare that to the skills ecosystem where skills.sh gives you a searchable directory and the Vercel CLI handles installation regardless of which agent you’re using.

Security is a huge concern. This is the one that really worries me. MCP servers can execute arbitrary code, make network requests, access your file system, and interact with external services. A malicious or compromised MCP server has way more attack surface than a malicious skill file. A skill file can only influence what the agent thinks it should do. An MCP server can actually do things on your machine. What’s worse, some editors like VS Code ship MCP tools that haven’t been through a thorough security review and are enabled by default. That’s a lot of trust to place in tooling that most developers haven’t audited.

Here’s what I recommend:

Only use MCP servers from trusted sources. This is even more important than with skills. Check the source code, check who maintains it, and check how actively it’s maintained.
Audit the permissions. What can this server access? Does it need network access? File system access? Does it run with your user’s full permissions? The less access it needs, the better.
Pin versions. Don’t pull latest for MCP servers. Pin to a specific commit or version so you know exactly what you’re running. Review changes before updating.
Watch for prompt injection. MCP servers return data that gets fed into the agent’s context. If an MCP server returns data from an untrusted source (like user input from a database or content from the web), that data could contain prompt injection attacks that hijack the agent’s behavior.

The MCP ecosystem is moving fast and the protocol itself is really well designed. But the tooling around discovery, cross-tool configuration, and security still has a long way to go. I’m hopeful this gets better quickly because MCPs are genuinely powerful when they work.

Planning Mode: Think Before You Code

One thing I’ve changed in my workflow is that I almost always run in planning mode. In Claude Code you can hit Shift+Tab to toggle into plan mode, or just type /plan. This puts the agent into a read-only or planning-first exploration mode where it analyzes your codebase and proposes a plan before touching any files.

Why does this matter? Without planning mode, agents tend to just start writing code immediately. They’ll make assumptions about your architecture, pick the wrong file to modify, or go down a path that doesn’t fit your patterns. With planning mode, the agent explores first, reads the relevant files, understands the structure, asks clarifying questions, and then presents a plan for your approval before it writes a single line.

This is especially valuable when you’re working with skills. The agent loads the relevant skills during the planning phase and you can verify it’s actually applying them before it starts coding. I’ve caught plenty of cases where the agent’s initial approach would have conflicted with our conventions, and a quick redirect during planning saved a lot of back-and-forth.

I have to say, Cursor has the best planning experience I’ve used. I always get really great results from its planning mode. It’s worth trying if you haven’t already. The Claude CLI and GitHub Copilot CLI are also solid options for terminal-based planning and coding. One thing to note: Claude’s tooling centers CLAUDE.md rather than AGENTS.md, so keep that in mind if you’re working across multiple tools.

Steering with voice is something I’ve been really enjoying. VS Code 1.109 added message steering and queueing as an experimental feature, and with the VS Code Speech extension you can use voice to kick off a task, bounce ideas around, or learn new concepts by talking through them with an agent. It’s a really natural interaction model and it’s how I like to start my planning sessions now.

I also really like how Google Antigravity handles planning. It generates a structured plan and then lets you leave comments directly on it, similar to commenting on a Google Doc. The agent incorporates your feedback without stopping its execution flow. That’s a really nice interaction model. I wish all the tools would adopt something like that. And honestly, all of them would be better if you could print or share the plan more easily.

Plans as Architectural Documents

One thing I’ve started thinking about differently is that plans aren’t just throwaway artifacts. They’re architectural documents. A good plan captures the “why” behind a set of changes: what problem you’re solving, what approach you chose, what alternatives you considered. That’s incredibly valuable context that usually gets lost once the code is written.

I’ve started saving plans alongside the code, either in a docs/ directory or in the PR description. Six months from now when someone asks “why did we do it this way?” that plan is the answer. You can also keep plans outside the codebase entirely, in a wiki or shared docs system, especially for bigger architectural decisions that span multiple repos. The point is: don’t let that context disappear after the agent finishes coding.

This idea of “docs as the source of truth” extends beyond plans. Ralph is a really interesting project that takes PRDs (Product Requirements Documents) and turns them into the driving force of an autonomous agent loop. You write a PRD as a markdown doc, convert it into structured user stories, and Ralph picks up incomplete stories one at a time, implements them, runs quality checks, commits, and moves on. The PRD becomes the living spec that both humans and agents work from. I really like this pattern because it keeps the requirements front and center instead of buried in a ticket system. The PRD is the documentation, the acceptance criteria, and the progress tracker all in one.

OpenCode and Codex

I want to mention OpenCode here because it’s a really solid alternative in the terminal agent space. It’s open source, supports multiple LLM providers (75+ through Models.dev), and the TUI is really nice with multi-session support, LSP integration, and shareable session links.

I also really like Codex. For me it’s one of the best experiences for agentic development right now. You kick off a task, iterate quickly, and focus on the end result rather than staring at every line of code. Some people call this “vibing,” and honestly that’s a pretty good description. The key is that you’re not just blindly trusting the output. You’re pairing that fast iteration with the multi-agent review workflow I described earlier, so you get the speed of vibing with the code quality of a proper review process. That combination is really powerful.

The reason I bring up all these tools is that it validates the agent-agnostic approach. If you invest time in good AGENTS.md and skill files, you’re not tied to any one tool. You can switch between Claude Code, Codex, Copilot, OpenCode, and whatever comes next depending on the task, and they all benefit from the same project knowledge.

The Full Agent Workflow: PRs, Reviews, and Feedback Loops

Where this really clicks is when you wire it all together into a continuous feedback loop. Here’s the workflow I’ve settled into:

1. Plan and Implement

Planning is the most important step. I use Claude or Codex for planning since they’re really strong at architectural reasoning. Feed it all the context it needs: the ticket, acceptance criteria, relevant docs, architectural constraints. Let the agent explore the codebase, understand the architecture, and propose a plan. Then actually read the plan. Break it down step by step. Walk through the approach and iterate on it. Ask questions, push back on things that don’t feel right, and have the agent revise until you’re confident in the direction. This is where you catch bad approaches early, and the time you invest here saves you way more time later.

For implementation, I’ll use Claude, Codex, or Gemini depending on the task. Gemini has been really solid for frontend work in particular. Then for code review, I always use a different model than whatever did the implementation. If Claude wrote it, Codex or Copilot reviews it. If Codex wrote it, Claude reviews it. The whole point is getting a different perspective.

2. Cross-Model Review Before Committing

Once the agent has an implementation, have sub-agents on different models review the changes. This is the multi-agent review I talked about earlier. Check the current branch, make sure everything looks right, and iterate until you’re happy with the quality. The bottleneck in any agentic workflow is going to be at review time. This is where you need the most powerful (and expensive) models to catch what the cheaper ones miss.

It’s really important not to let tech debt sneak in during this process. I always look at the diff while the agent is working, before every push, and always after. Don’t just trust the agent and move on. Review the code as it’s going in. If something looks off, stop and fix it now because it’s way easier to address in the moment than after it’s merged.

If I’m not sure about the implementation at this point, I always tell the agent to commit but don’t push. This lets me iterate locally without anything going out to the team. I can review the diff, ask the agent to change things, commit again, and keep going until I’m confident. Only when I’m happy do I move to the next step.

3. Run Tests and E2E

Before you branch and push anything, have the agent run the full test suite. Unit tests, integration tests, whatever you’ve got. If anything fails, the agent fixes it and runs them again. Don’t push broken code.

For frontend work, this is where the agent-browser skill is one of my favorites. It lets the agent do end-to-end browser testing, actually navigating your app, filling out forms, clicking buttons, and verifying the UI works as expected. It’s a really nice way to catch issues that unit tests miss before anything goes out for review.

For .NET projects, the .NET Aspire MCP is a game changer here. It automatically launches all your resources (databases, caches, message brokers, everything), provides full logs for the entire stack, and makes running tests and agent code a breeze. Instead of manually spinning up dependencies or mocking everything, Aspire handles the orchestration so the agent can just run tests against a real environment.

4. Branch, Commit, and PR via GitHub CLI

I let the agent handle the entire git workflow using the GitHub CLI (gh). As defined in our AGENTS.md, the agent follows Git Flow, creates a feature or bug branch, commits with a structured message format like [feat] #123 Add retry logic to queue processor or [fix] #456 Handle null reference in queue, and opens a PR filling out the PR template. The agent knows how to do all of this. You don’t need to hand it specific commands.

The agent writes better commit messages than I do half the time because it has the full diff context and the skill files telling it our conventions. No more “fix stuff” commits.

5. PR Review: Copilot, CodeQL, and Team

Once the PR is up, kick off the review process. GitHub Copilot automatically reviews it using your skill files and copilot-setup-steps.yml workflow. CodeQL runs static analysis for security and quality issues. And your team gets notified to review as well.

6. Address Review Comments

This is the part most people miss. Just tell the agent to go review the PR comments and CodeQL findings. It already knows how to use the GitHub CLI to pull that information. You don’t need to hand it specific commands.

Ask the agent to review all the PR comments to determine which ones are valid. For valid issues, have it do a proper root cause analysis and fix the underlying problem, not just patch the symptom. Make sure it adds or updates test coverage for any fixes and that the overall code quality is solid. Then push the update, and have the agent reply to and resolve the comments on the PR.

7. The Review Loop

Now repeat. Copilot and CodeQL review the new changes. Your team looks at the updates. Pull any new comments back into the agent. Fix, push, reply, resolve. Keep running this loop until the feedback is clean.

For every PR, have the agent self-review the entire diff before the next round of human review. Sometimes stepping back and looking at the full picture catches things that incremental reviews miss. This should be a standard part of the loop, not just something you do for big changes.

8. The Compound Effect

This creates a real feedback loop:

Planning mode with a strong model ensures it thinks before coding
Cross-model review catches issues before you even push
Tests and E2E verify everything works before branching
GitHub CLI automates the branch, commit, and PR workflow
Copilot + CodeQL + team review against your conventions
Review comments feed back into the agent for RCA and fixes
Agent replies and resolves comments on the PR directly
Patterns that keep coming up get added to skills

The loop tightens over time. As your skills get better, the initial code gets better, the reviews find fewer issues, and the whole cycle speeds up.

You might ask, how is this process actually working? We’ve shipped some really technical features to the Foundatio codebase and apps with a very high quality bar to production using this workflow. Right now we’re iterating on a massive rewrite of Foundatio.Parsers and Foundatio.Repositories for Elasticsearch 9, and all tests are currently passing. This isn’t a toy workflow for side projects. It’s how we’re shipping real, complex infrastructure code.

Just remember: there’s a bad actor out there who also has agents, and they never sleep. Put security processes into your workflow. Even if it’s a one-liner that audits all changes for security issues and OWASP vulnerabilities, that’s better than nothing. The agents working against you are getting better at the same rate as the ones working for you.

But the automated PR review pipeline is only half the story. The other half is what happens before you even push.

Where Developers Fit: Multi-Agent Review and Iteration

So if agents are writing more and more of the code, what are developers actually doing? I’ve been thinking about this a lot lately.

From everything I’ve seen, we’re moving up the stack. Less time writing boilerplate, more time on the stuff that really matters: breaking down problems, reviewing architecture, defining acceptance criteria, and making sure the output is correct and secure. It’s less “person who writes all the code” and more “person who orchestrates and validates the work.” And honestly? I think that’s a pretty exciting shift.

But I want to be clear: I’m not just kicking things off and walking away. I still look at the architecture and implementation closely. I question edge cases and flag problems I see. I still influence the design a ton, pushing back on complexity, asking about security implications, challenging whether an approach is really the right one. That human feedback loop is still really important. The agents handle a lot of the execution, but the critical thinking, the “wait, what about this scenario?” questions, that’s still very much on you as the developer. At the end of the day, this is still code that you are responsible for and it has to ship at a very high quality bar.

Multi-Agent Review Before Human Review

Here’s what I’ve been really excited about: instead of one agent producing code and immediately throwing it at a human reviewer, you have multiple agents review each other’s work first. Think of it like having several reviewers with different perspectives do a first pass before the final review.

As I mentioned in the workflow above, I always use a different model for review than whatever did the implementation. They have different strengths and blind spots. A review from a different model catches things the original one missed, just like two developers with different backgrounds looking at the same code. Feed that feedback back to the implementing agent, let it address everything, then review again with the second model. Back and forth until it’s clean. After a few rounds of this ping-pong, the code quality is really solid before a human ever touches it.

I’ve found that saying “review all changes in this branch” or “review this PR” helps a ton. Tell the agent to make sure every change was absolutely needed. Check for security issues, edge cases, readability, and performance problems. Being explicit about what you want reviewed makes the output way more useful. Tell the review agent to review as if the code needs world-class integration and is being deployed to production. Sounds simple, but it puts the review on a whole other level. The agent stops giving you surface-level “looks good” feedback and starts really digging into edge cases, error handling, and whether the code is actually robust enough. The bar you set in the prompt directly affects the quality of the review.

There are community projects formalizing this same concept. Anvil is a Copilot CLI plugin that does adversarial review across up to three different AI models (GPT, Gemini, Claude) that each try to find bugs, security holes, and logic errors. I haven’t used it yet, but it’s doing exactly what I’ve been doing manually and it’s something I’m actively planning to look into. The disagreement between models is the feature. If one model thinks the code is fine but another flags a problem, that’s exactly what you want caught before a human reviews it.

You can take this even further with specialized review passes: one agent focused on security auditing, another checking architectural patterns, another on code quality and naming conventions. Each one catches different things. By the time a human sits down to review, the obvious issues are already resolved. The human reviewer can focus on the high-level stuff: does this approach make sense? Does it fit the broader architecture? Are there edge cases the agents missed?

Breaking Down Acceptance Criteria Into Your Flow

One of the most underrated things you can do is feed your acceptance criteria directly into the agent workflow. Take your ticket or spec, break it into concrete acceptance criteria, and let the agent work through them one by one. The agent can even help you break down vague requirements into testable criteria, which is super useful when a ticket is a little hand-wavy.

Then each criterion becomes a checkpoint. Did the agent’s implementation actually satisfy it? You can have a review agent validate each criterion against the code changes. This gives you a structured way to verify completeness instead of just eyeballing a diff and hoping nothing got missed.

Testing and Human QA

After the multi-agent review cycle, you hand off to testing agents. These can generate test cases from your acceptance criteria, write unit and integration tests, and even run them. If the tests don’t pass, the agent loops back and fixes the issues before a human ever sees them.

If you don’t have acceptance criteria yet, you can generate them quickly. Just describe what the feature should do and have the agent break it down into testable criteria. It’s a really fast way to get structured coverage even when the spec is thin.

But here’s the thing: you still need human QA. Agents are really good at verifying specific behaviors, but they don’t have the intuition for “this feels wrong” or “this flow is confusing.” Human QA catches the stuff that’s technically correct but practically bad. The sweet spot is letting agents handle the mechanical verification so your QA team can focus on exploratory testing and user experience.

There are skills like Impeccable that are trying to help agents evaluate UI quality. It’s not quite there yet in terms of “taste,” but it’s getting closer every day. I think this is one of the areas where we’ll see a lot of improvement soon.

The next logical step here is agent skills specifically for QA. Imagine a skill that takes your acceptance criteria, builds automated tests from them (I really wish these lived in the repo alongside the code), and verifies regression scenarios. You could give higher priority to testing areas with changing requirements so the feedback cycle is even faster. This would let QA teams adapt to shifting specs without manually rewriting test plans every sprint.

One thing that doesn’t get talked about enough is using agents as a learning tool. When I’m working on an unfamiliar part of the codebase, I’ll ask the agent to explain how something works, walk me through the data flow, or diagram the architecture. It’s like having a senior dev who’s already read every file sitting next to you.

This is super useful for understanding the impact of a bug. Ask the agent to trace through the code and show you everywhere a change could have downstream effects. Have it do a proper root cause analysis and explain not just what went wrong but why. I’ve found that agents are really good at this kind of investigation because they can hold the entire codebase in context and follow the threads way faster than I can by reading files one at a time.

The knowledge sharing side is where it gets even more interesting. Ask the agent to generate mermaid diagrams of the architecture, data flows, or system interactions. These are incredibly useful for onboarding new team members, documenting complex systems, or explaining a bug fix in a PR description. Instead of spending an hour drawing a diagram by hand, the agent generates one in seconds and you can iterate on it.

I’ve started making this a regular part of my workflow. After finishing a complex feature or fixing a tricky bug, I’ll ask the agent to write up a summary with diagrams explaining what changed and why. That documentation lives in the PR, in the docs, or in the skill files. It compounds over time and makes the whole team faster.

The Massive Leap in LLM Capabilities

I really cannot overstate how much these models have improved in just the past three to four months. The jump in code understanding, multi-file reasoning, and the ability to hold complex architectural context has been unlike anything I’ve seen before. Tasks that agents would consistently botch six months ago? They handle them reliably now. The multi-agent review pattern I described above wouldn’t have been practical even a few months ago because the individual agents weren’t reliable enough to trust their reviews.

I want to be honest though: none of this is perfect. Agents will hallucinate. They will lie to you with confidence. Code reviews from agents will miss things or flag false positives. But for everything they get wrong, they get a whole lot right. I’ve found some really impactful bugs, security issues, and architectural problems by following this process that I might have missed on my own. And it’s only going to get better with time.

SDL and Code Scanning Are More Important Than Ever

This is something I want to really emphasize: with agents writing more code, your Security Development Lifecycle (SDL) and code scanning practices are more important than ever, not less. It’s really easy for an LLM to sneak things in. Not maliciously, but because it’s optimizing for “make the code work” and doesn’t always think about security implications the way a human would. It might introduce a SQL injection vulnerability, hardcode a credential pattern, add an overly permissive CORS policy, or pull in a dependency with known CVEs, all while confidently telling you the code is correct.

This is why automated scanning needs to be a non-negotiable part of your pipeline:

Static analysis (SAST) with tools like CodeQL, Semgrep, or SonarQube should run on every PR. These catch the patterns that agents commonly introduce: injection vulnerabilities, insecure deserialization, hardcoded secrets, and more.
Dependency scanning with Dependabot, npm audit, or dotnet list package --vulnerable catches vulnerable packages the agent might add.
Secret scanning should be enabled at the repo level. Agents sometimes generate placeholder secrets or copy patterns from training data that look like real credentials.
Copilot code review adds another layer by reviewing against your skill files and conventions automatically on every PR.

The multi-agent review workflow I described helps here too. Having a second model review specifically for security issues catches a lot. But automated scanning is your safety net. Don’t rely on any single layer. The whole point of defense in depth is that each layer catches what the others miss.

So I think now is the time to invest in your agent infrastructure. The AGENTS.md files, the skills, the review workflows, all of it gets more valuable as the models improve. You’re not just building for today’s capabilities. You’re building the scaffolding that will compound as these models keep getting better. And based on the trajectory I’ve seen recently, they’re going to keep getting better fast.

Hooks: The Next Big Development

I think hooks are where the next big leap in agent workflows is going to come from. Right now, tools like Claude Code, VS Code (preview), and Cursor support hooks that fire shell commands in response to events like tool calls, file edits, or commits. That’s useful, but we’re just scratching the surface.

Imagine hooks that automatically run code formatting after every file edit. Or hooks that pause the session to ask you a clarifying question before the agent goes down a path you didn’t intend. Or hooks that kick off a sub-agent review, like Anvil’s adversarial multi-model review, every time the agent finishes an implementation task. All of this while keeping the session alive so you don’t lose context.

The hooks themselves already support all of this today. You can build any of these right now. The missing piece is that there’s no prebuilt ecosystem for hooks like skills.sh is for skills. You have to write them yourself. Here are some examples of what you can do:

Auto-formatting on save. The agent writes code, a hook runs your formatter (Prettier, dotnet format, whatever) before the agent moves on. No more style nits in code review.
Security scanning. A hook that runs a security scanner after every file edit. Catch secrets being committed, detect known vulnerable patterns, or flag changes to authentication and authorization code before the agent moves on.
Dependency auditing. When the agent adds a new package, a hook runs npm audit or dotnet list package --vulnerable to check for known vulnerabilities before it gets committed.
Interactive checkpoints. Hooks that prompt the user for input at key decision points. “The agent wants to add a new dependency, do you approve?” This keeps you in the loop without you having to watch every step.
Automated sub-agent review. After the agent finishes a task, a hook automatically sends the changes to a review agent (or multiple review agents) before proceeding. This is basically Anvil’s pattern built into the workflow as a hook.
Test-on-commit hooks. Before the agent commits, run the test suite. If tests fail, feed the failures back to the agent automatically. No human intervention needed for the fix cycle.
License compliance. A hook that checks any new dependencies against your approved license list. Block the agent from adding GPL code to your MIT project, for example.
Skill and docs updates. After a refactor, a hook could prompt the agent to review and update relevant skill files and documentation.

I’d love to see a community hooks registry similar to skills.sh emerge. Prebuilt, vetted hooks for common patterns like security scanning, formatting, and compliance would save everyone a lot of time.

GitHub is already moving in this direction with gh-aw (GitHub Agentic Workflows), which lets you write agentic workflows in natural language markdown and run them in GitHub Actions. It’s read-only by default with sandboxed execution, input sanitization, and team-gating for critical operations. This is the kind of thing I’m talking about, taking the agent workflow and making it a first-class part of your CI/CD pipeline.

I also think we’re going to see a lot more investment in sandboxing for agents. Right now, most agents run with your full user permissions, which is a pretty big trust surface. As agents get more autonomous and hooks let them do more without human intervention, proper sandboxing becomes critical. Containers, restricted file system access, network isolation, scoped API tokens. The tools that figure out how to give agents enough access to be useful while limiting the blast radius of mistakes (or compromises) are going to win.

The tools that get hooks and sandboxing right are going to have a huge advantage. Hooks are what turn an agent from a tool you interact with into a workflow you orchestrate. I’m really excited to see where this goes.

The CLAUDE.md Problem

If you’re using Claude Code, you need to know that Anthropic has chosen not to support the AGENTS.md standard. Claude Code uses its own CLAUDE.md format instead. As of March 4, 2026, there’s an open issue with nearly 3,000 upvotes requesting AGENTS.md support, but no official response yet. It’s frustrating because a growing set of major tools support the standard while Claude still requires its own parallel file.

So if you want your agent instructions to work across Claude Code and everything else, you need to maintain both files. The easiest workaround is to just copy your AGENTS.md content into a CLAUDE.md file and keep them in sync. It’s annoying, but it works. Some people use hard links (a way to make two filenames point to the same file on disk), but git treats them as separate files anyway, so you’re still managing two copies in version control. I’m really hoping Anthropic just adds AGENTS.md support so we can stop dealing with this.

Migrating from Copilot Instructions

If you’re still using .github/copilot-instructions.md, I’d recommend migrating to AGENTS.md and skills. Use the most advanced model you can (like Opus), feed it the documentation for AGENTS.md and agent skills, find existing skills or an AGENTS.md that’s really close to what you’re thinking or is specific to the tech stack you’re using as a base, tweak it in plan mode, and go from there. Tell it to be very surgical. The migration is straightforward since it’s all just markdown. The big advantage is that you’re no longer locked into one vendor’s instruction format. Copilot and a growing set of other tools support AGENTS.md, even though Claude Code still requires CLAUDE.md. And the skill-based approach scales much better than a single monolithic instruction file.

The Future: Orchestration by Powerful Models

Looking ahead, I think the biggest shift is going to be orchestration by a powerful language model. Instead of you manually kicking off agents, reviewing, sending to another agent, and managing the loop yourself, a top-tier model acts as the conductor. It breaks down the task, delegates to specialized sub-agents (planning, implementation, code review, testing, security), coordinates their work in parallel, and synthesizes the results. You describe what you want, and the orchestrator figures out how to get there.

We’re already seeing this take shape. VS Code 1.109 introduced agent orchestration as a first-class concept, where multiple specialized agents collaborate on complex tasks. Each sub-agent operates in its own context window (solving the context overflow problem), can use different models optimized for its role, and independent tasks run in parallel. The community has already built orchestration systems on top of this like Copilot Orchestra and GitHub Copilot Atlas.

Distribution has always been the hardest part of any computing problem, and agent orchestration is no different. You’re starting to see Google push this forward with Gemini, where the model itself handles the coordination across multiple agents and tools. I think this is the natural evolution of everything I’ve described in this post. The workflow I’m doing manually today (plan, implement, cross-model review, test, PR, review loop) is going to be something you describe once and an orchestrator runs end to end.

The key ingredient is that the orchestrating model needs to be really powerful. It needs to understand your codebase, your conventions (through AGENTS.md and skills), the strengths of different sub-agents, and when to escalate to a human. This is why investing in your agent infrastructure now matters so much. The orchestrators are coming, and the teams with the best AGENTS.md files, skills, and review patterns are going to get the most out of them.

The Cost Rug Pull

I do think there’s going to be a rug pull on costs at some point. Right now, models are really cheap with incredible value. Codex, Copilot, and Gemini subscriptions let you do a ton for very little money. But be careful with the “newer is cheaper” narrative. I’ve read breakdowns of both Gemini and OpenAI where a newer model claims lower per-token pricing, but the model emits way more tokens to accomplish the same task, so the actual cost ends up being higher. Always look at total cost for a task, not just the per-token rate.

This is why open source models are so important. You can run models locally today, and the latest developments around running models via Apple’s Neural Engine are really exciting. Projects like ANE have potential to bring real speed, efficiency, and cost savings for local inference. Having the option to run locally means you’re not entirely dependent on subscription pricing that could change at any time.

What’s Coming Fast

I think we’re going to see some really exciting iteration speeds on things that used to be painful. Localization, accessibility, documentation. Every major product is going to ship with an MCP server or skill. We’re already seeing it with Stripe, Vercel, and shadcn. That trend is only going to accelerate.

Relatedly, website traffic is also declining across the board. Look at what happened to Tailwind CSS and other documentation-heavy sites. A lot of this is because AI is being used for search now, and those models are being trained on your content. It’s worth asking: does your website have an llms.txt? If not, you might want to think about how your content is being consumed in an AI-first world.

Just remember, it’s okay not to be the first. Things are moving so fast right now. Just be aware of what’s out there and implement when there are clear winners.

Future Tech Debt (or the Lack of It)

This one is a little further out, but I think we’re going to get to a place where if you have enough tests, requirements, and acceptance criteria, you can just regenerate your application. Point it at your existing data stores, your API contracts, your test suites, and let the agent rebuild the implementation from scratch. New tech debt appears until it gets regenerated again. It sounds wild, but it’s the logical conclusion of everything I’ve described in this post.

That said, it’s still going to be really important for a person to put tools and processes in place (with agent assistance) to make sure APIs aren’t broken and users are happy. But will apps even be apps as we know them? Or will they be more like personal assistants feeding you data through your glasses or via voice? I think this is where OpenAI is going with their voice-driven Jarvis product. Just remember: your privacy and security should always be paramount.

Getting Better Over Time

The real value compounds. Here’s how I think about the improvement cycle:

Start small. An AGENTS.md with build commands, project structure, and your top 5 coding conventions.
Add skills as pain points emerge. When you notice the agent consistently getting something wrong, that’s a skill entry. Don’t preemptively document everything.
Review skills during refactors. Changing architecture? Update the skills. Dropping a dependency? Remove references to it. Skills are living documents.
Use code review feedback as input. If the same comment keeps coming up in PRs (from humans or agents), codify it in a skill.
Prune ruthlessly. A shorter, accurate skill file outperforms a longer, stale one every time. The agent has a context window. Respect it.

The goal isn’t to document everything. It’s to document the right things, the patterns and decisions that make your codebase yours, so that AI agents can be productive contributors rather than generic code generators.

One more thing worth keeping in mind: prompt placement and repetition matter more than people expect. I first saw this framed well in this tweet, and it stuck with me because it matches what I’ve seen in practice. This has real implications for how you structure your AGENTS.md. Your most critical conventions should be stated clearly and prominently, not buried in a wall of text. If the agent has to wade through dozens of rules to find the important ones, the important ones lose their impact.

If you’re not using AGENTS.md yet, I’d really encourage you to try it. Start with build commands and your most important conventions. You’ll be surprised how much better the AI output gets with even a little bit of context. Then try the cross-model review workflow: have one agent write, another review, and iterate until it’s clean. The combination of good skills and multi-agent review has been a pretty huge improvement for the quality of code coming out of my projects.

Prompting is a skill worth investing in. A great prompt has three things: context (what you’re working on, the tech stack, the constraints), intent (what you want the agent to do and why), and a definition of done (what a good result actually looks like). Be specific. “Refactor this function” is okay. “Refactor this function to use the repository pattern from our Foundatio skill, keep backward compatibility, and add unit tests” is way better. And if you’re not getting good results, don’t just ask the same thing louder. Come at it from a different angle. Try “tell me everything wrong with approach X” and then solve for it. Sometimes changing the prompt is like going for a walk and coming back with fresh eyes. A different framing can completely change the output.

And if you’re already doing all of this, I’d love to hear how you’re handling the maintenance and pruning side of things. That’s where I think the community still has a lot to figure out.

For those just entering the field: there is a whole lot more to learn than just prompting, especially around security. You don’t always have to be the first to do something, just execute well. Learn to prompt effectively, but also learn how these models actually work (it’s next-word prediction, all the way down). Use AI as a tool. Trust that it will get you most of the way there, but know that it’s always that last 10-20% that’s the hardest. That’s where your skills, your judgment, and your experience make all the difference.

One last thing: this blog was converted from Hugo to Astro in a few hours via OpenAI Codex. I didn’t write a single line of code, just reviewed it and iterated. This blog post was written by Claude Code (Opus 4.6) with a skill built from my writing voice across all my prior blog posts, reviewed by the Codex app, and proofread and agentic-looped by yours truly. That’s where we’re at right now. Never stop learning, trust your gut, and ship it.

Agentic Driven Development (ADD): AGENTS.md, Skills, and the Full Workflow

How-to get notifications when your mailbox is opened