the end of writing code and the beginning of directing it
a shift nobody prepared for
Nine months ago I wrote a post about how Anthropic’s pricing changes hurt indie builders. I was frustrated so I spent most of that post complaining about rate limits and weekly caps and the $500/month or more paywalls. I have spent some time away from paid models, and focused more on building with open-source and open-weight varieties. In that time, I learned a great deal about the gaps between retail tools and their counterparts.
What I didn’t fully appreciate was that tooling is changing underneath us all in a more fundamental way than pricing. Specifically, tools are getting more expensive but they are also getting more capable, shifting the nature of what it means to “build software” from writing code, to directing agents that write code for you. I also read The Scaling Era: An Oral History of AI, 2019-2025 by Dwarkesh Patel, which posits and then clarifies that improvements to LLMs and the tools they create are directly tied to increases in performance, compute, and cost.
I shipped lots last year using Claude, ChatGPT, and whatever else was available, you can read more about what I built in my YC post, if you are curious. At the time, I thought the hard part of building with AI tools was managing rate limits, context windows, and token costs but as I continued to build and develop more mature skills with these tools, I came to realize the hard part is actually more about learning how to think like an orchestrator. As anyone who has been building in the last year or two can attest, the term “developer” now means something entirely different than before. The actual work of “building” looks more like product management, platform engineering, and irritating quality assurance. Before AI tools, development was mostly designing and writing functions, debugging stack traces, and completing GitHub issues with PRs. Now, basically everyone is an engineering manager dealing with interns.
This post is about what changes AI tools have made to development workflows in practice, the gap between retail tools and open-source, and what the future probably holds, at least from the perspective of a former developer.
claude code in 2026
Claude Code has matured significantly since the rate limit fiasco of mid-2025 and is now probably the most capable agentic coding tool available for retail users. The best way to take advantage of Claude Code is to treat it less like an autocomplete engine and more like a junior engineer who needs clear direction, periodic review, and structured workflows, as I alluded to earlier.
Let’s talk about git tree workflows. For those not aware of what git trees are, they are one of the core objects in how Git actually stores your code. Every time you make a commit, Git creates a tree object that is a snapshot of your entire project directory. The tree contains references to blobs, which are the actual file contents, and to other trees, which are subdirectories. It is recursive, meaning a tree can contain other trees, all the way down until you hit individual files. When Claude Code references a git tree, it is understanding how your entire project fits together, which files import from which, what changed between branches, and where things might be inconsistent.
In practice, starting a Claude Code session looks like opening a terminal in your project root and typing claude. Because it can already see your git tree, you can immediately ask Claude to look at the diff between main and feature/auth-refactor and tell you if you missed anything. Claude will then traverse the tree objects for both branches, compare blobs, and give you a file-by-file breakdown of what changed, what looks wrong, and what might break downstream. It catches things like a renamed utility function in one file that is still being called by its old name in three others. The same approach works for onboarding into unfamiliar projects. Clone a repo you have never seen, ask Claude Code to read the tree and explain the project structure, and it will walk through the directories, identify the framework, note the config files, and summarize how things are organized. Useful for open-source contributions where you want to make a targeted change without reading every file first.
The workflow I rely on most involves using git trees to work on multiple tasks in parallel. Say you have three things on your plate: a bug fix in authentication, a new user profiles feature, and a database query refactor. You create three branches, open a Claude Code session on each one sequentially, and work through them. Because Claude Code reads the git tree fresh each time you switch branches, it picks up the correct state of the codebase for that branch without carrying over context or assumptions from the previous task so each branch gets a clean mental model. Where this gets really useful is at the end when you need to merge. You ask Claude Code to compare the trees across all three branches and flag any files that were touched by more than one. If two branches both modified src/db/connection.ts, it will show you exactly what each changed and suggest how to reconcile the differences before you run git merge. While building status.health, I would run this at the end of every work session, asking Claude Code to audit my branches against main and give me a merge order that minimized conflicts. It would suggest merging infrastructure config first since the other branches depended on it, then the API branch, then the frontend. That kind of sequencing advice from a tool that can read your project’s dependency graph through the tree is worth more than most people realize until they have spent an evening untangling a bad merge.
For really heavy workflows, Claude Code supports orchestration patterns where you can chain multiple tasks together. You can ask it to scaffold a new module, write tests for that module, run the tests, and iterate on failures, all in a single session if your context window holds. The key limitation here is still context. Once a session gets long enough, Claude Code starts losing track of earlier instructions, which is why breaking complex work into discrete, well-scoped tasks, produces better results than trying to do everything in one conversation. Claude can and will “compact” conversations during instances instead of forcing a new session, and you can preemptively compact as well, but state is still lost in these cases so different sessions is often optimal unless most of the work is already completed by the time compacting comes around.
Pairing Claude Code with VS Code is important for version control and visibility. You can use Claude Code for the heavy generation work, then switch to VS Code to review diffs, stage or revert changes, and commit with intention. I find this workflow essential because Claude Code does not always get things right on the first pass, and reviewing its output in a proper editor with syntax highlighting and git integration makes it much easier to catch errors, hallucinations, or misunderstanding before they hit a branch. Think of Claude Code as the drafter and VS Code as the editing room. Often times, I use Copilot with the latest Claude model in VS Code when I need to make small or medium changes, since it will achieve almost the same result as a Claude Code session. It is worth mentioning that you should avoid using VS Code in parallel with Claude Code because of race conditions unless you are working with git tree and branching workflows.
Compared to OpenAI’s Codex, Claude Code is stronger at maintaining context across large codebases and following nuanced instructions. Codex is faster for short, self-contained tasks and has better integration with GitHub through OpenAI’s partnerships. If you are doing quick prototyping or need something disposable, Codex is fine but if you are building something that needs to be maintained, Claude Code’s deeper context handling makes it worth the cost. Google’s Gemini Code Assist sits somewhere in between, strong on Google Cloud integrations but still catching up in terms of agentic autonomy.
Where Claude Code really shines is in building a product end-to-end. I have used it to go from a product requirements document (PRD) to a working MVP in a weekend. The general workflow I follow looks like this: write a PRD in plain English describing what the product does, who it is for, and what the core features are. Feed the PRD to Claude and ask it to generate a project structure. Iterate on the structure until it matches your mental model. Then go feature by feature, having Claude Code implement each one while you review and test. For website design, Claude Code can generate full frontend layouts using frameworks like Next.js or Astro, and you can deploy those directly to services like Vercel or GitHub Pages. Domain registration is still manual because DNS providers have not built agentic APIs yet, but everything from code to deployment can be orchestrated through a single tool. I used this exact workflow to build attest.ink, this blog, and the early versions of status.health. I still find working on each feature individually or each paragraph one by one, more effective than having an agent do everything in one go, mostly because of quality slippage as the context window decreases. You’ll get better results building piece by piece, especially as you project grows or as you onboard others.
prompt strategy and development
One of the most important skills you’ll need when working with AI tools is prompt construction. Specifically, understanding when to be detailed, when to be sparse, and how to structure your instructions so that a model produces useful output on the first or second pass instead of the fifth.
There are roughly three categories of prompts that I use depending on the task.
The first is a specification prompt. This is for when you are starting something new and the model has no prior context. Specification prompts need to be detailed. You should include the programming language, framework, file structure, naming conventions, and any constraints like “do not use any external dependencies” or “this needs to run in a serverless environment.” The more specific you are upfront, the less time you spend correcting output. A specification prompt for a new API endpoint might be three or four paragraphs long, and that is fine. The cost of being explicit for this prompt type is far lower than the cost of debugging implicit assumptions.
The second is an iteration prompt. This is for when a model has already generated something and you need it to change or improve. Iteration prompts should be short and surgical. Point to the exact file, function, and line if possible, and describe what is wrong, clarifying what a better result looks like. Try to avoid restating the entire specification. LLMs typically have the necessary context from project files or previous state for these prompt types. Restating everything wastes tokens and can actually confuse the model. A clear example of an iteration prompt is something like: “In auth.ts, the validateToken function is not checking for token expiration. Add an expiration check using the exp claim from the JWT payload.”
The third is a diagnostic prompt. This is for when something is broken and you do not know why. Diagnostic prompts benefit from including error messages, relevant code, and a description of what you expected to happen versus what did. The mistake most people make with diagnostic prompts is including too little context. If you just paste an error message and say “fix this,” the model will guess at the cause and often guess wrong. If you include the error, the function that produced it, and the input that triggered it, the model can reason about the problem with much higher accuracy.
The meta-skill underneath all three of these categories is knowing what a model is good at versus what it is bad at. Models are excellent at boilerplate, at translating well-defined specifications into code, at refactoring, and at explaining existing code. Models are bad at architectural decisions, understanding business logic, maintaining consistency across very large codebases, and creativity. I wrote about this in my hello world post: AI tools are generally intelligent but not expert at anything. That remains true. Knowing where the boundary is saves you from the frustration of expecting expert output from a general tool. I typically avoid asking models to produce long-form writing, designing digital assets, and creative ideation.
One more thing worth noting. The best prompts I have written are the ones where I have paused to think about what I am trying to achieve. Our instinct is to begin prompting immediately, to throw a problem at a model and see what comes back. Resist those instincts. Think about what you actually want. Write it down for yourself first, ideally on a whiteboard or paper. You’ll notice issues more often when writing prompts out by hand compared to when typing them. There is a reason software engineering interviews always had a whiteboarding component: there is something about using your hand and a marker, pen, or pencil that highlights gaps in your approach and engenders clarify of thought. Remember, LLMs are a mirror. If your instructions are vague, the outputs will be too. Garbage in, garbage out is not a new concept, but it has never been more directly applicable than with AI Tools.
open-source alternatives to claude code
The open-source ecosystem for AI-assisted development has improved dramatically since I last wrote about it. In my Lumo post, I mentioned that running a local LLM was still out of reach for most people. It is still mostly true, but the gap has narrowed.
For self-hosted model inference, the two tools worth knowing about are LM Studio and Ollama. LM Studio provides a desktop application with a clean interface for downloading, running, and chatting with open-weight models locally. It supports quantized models, which means you can run surprisingly capable models on consumer hardware with 16-32GB of RAM. Ollama takes a more developer-friendly approach, running as a local server you can interact with via API, which makes it easier to integrate into scripts and automated workflows. Both are free and both run entirely on your machine, meaning your prompts and code never leave your device. For anyone building in a privacy-sensitive domain, as I am with status.health, this matters.
For agentic orchestration on top of these local models, tools like Open Code are emerging as open-source alternatives to Claude Code. Open Code supports multiple model backends including Ollama and LM Studio, and provides a terminal-based interface for agentic coding workflows like file editing, command execution, and multi-step task completion. It also integrates with VS Code. Open Code is not as polished as Claude Code or Copilot and model quality is noticeably lower for complex tasks, but for straightforward development work it is genuinely usable and entirely free with the right GPU.
The downside of self-hosted models is hardware. Running a 70B parameter model with acceptable speed requires a GPU with at least 48GB of VRAM, which means an NVIDIA A6000 or better, and those cards start at $4,000. You can run smaller quantized models on consumer GPUs like the RTX 4090 (24GB VRAM), but you sacrifice quality and context length. For most indie developers, the cost of hardware that can match Claude or GPT quality exceeds the cost of just paying a subscription or per token for several years. The math isn’t mathing, at least not for most people.
The best-performing open-source models right now depend on the task. For development workflows, DeepSeek-V3 is exceptionally strong at code generation and refactoring, outperforming gpt-oss on several benchmarks after supervised fine-tuning, as I referenced in my Protocol Agent post. The Qwen3 series from Alibaba has been surprisingly competitive, especially the 30B variant which showed a 73.3% improvement after fine-tuning in the Protocol Agent benchmark. For general-purpose tasks, gpt-oss-120b from OpenAI remains solid out of the box but does not improve as dramatically with fine-tuning as the alternatives. Llama from Meta continues to be the most versatile option for self-hosting because of its broad community support and extensive fine-tuned variants. For rankings, I recommend OpenRouter.
The future is open-source and open-weight models. I believe this strongly, and I wrote about it in my Lumo post and in my Protocol Agent notes. Prioritizing open-source alternatives to paid variations supports work like Protocol Agent and ERC-8004 and democratizes access for everyone. That said, the future will also be better hardware. Inference chips are getting cheaper and more efficient every quarter, and consumer-grade hardware capable of running frontier-quality models locally is probably two to three years away. Right now, for most builders, it makes more sense to pay for the frontier through Claude or ChatGPT and supplement with local models for privacy-sensitive or high-volume tasks where the cost savings justify the quality trade-off.
the case for using AI tools agnostically
Claude Code, Codex, and web UI/app interfaces for top paid AI models provide great usability but also limit model variety. As such, my preference is to use frontier models through agnostic service providers because they offer the ability to test different models in the same interface. DuckDuckGo’s DuckAI and GitHub Copilot are my current favorites but Poe, Perplexity, and a number of others are gaining traction. The primary advantage of this approach is that it makes models feel less like products and more like interchangeable workers. When one model is being stubborn, or when I am not sure whether the output is actually good, I retry the exact same prompt with a different model and compare results. Doing so is useful for quality control, but also improves prompting because it forces me to clarify which parts of my instructions are underspecified. If Claude and GPT both misunderstand me in the same way, the problem is almost always my prompt. If they diverge, I usually learn something about what each model is optimizing for, especially for specific tasks like image generation or long-form text.
Using multiple models is also a practical debugging technique and somewhat like peer review. Sometimes you are staring at a piece of generated code and it seems plausible but you cannot tell if it is correct without investing another hour in testing and reading docs. Running the same question through a second model often surfaces a different interpretation, a missing edge case, or a cleaner implementation. One model is not always right and just like different engineers collaborate to produce a better outcome, using different models produces better results. If you want to send a single prompt to multiple models at once, you can use something like MultiAI or build your own solution through OpenRouter, like I have.
Using multiple models also trains your intuition for their limitations, which matters because each one has weird strengths that are non obvious from benchmarks or ranking boards. Claude is unusually good at producing SVGs and simple logos in a way that is consistent and reusable, and it has a knack for generating clean code layouts that make sense in a repo. At the same time, modifying and generating images in the broader sense is still work I hand off to ChatGPT because it is better at image-oriented iteration and tends to be more reliable when you want to transform something visual rather than emit a new asset from scratch. Once you start using tools agnostically, you tend to worry less about which single model is best, and choose models by workflow or task, like tools in a toolbox.
reducing the cost of AI tools and other strategies
Managing the financial overhead of AI tools is a requirement for any indie builder operating without a budget that can absorb random pricing rug pulls. The first cost hack is obvious but important. If you have a paid subscription with one of the frontier providers and you hit your cap for the day or month, you can often still use the same model through an agnostic provider without upgrading to a higher tier or switching to API keys that charge per token. I have used this on a number of occasions when I was at the end of a project or stuck on a specific feature and the last thing I wanted was to either stop for the day or pay for a higher tier plan.
The second strategy is to stop treating frontier models as the default for everything. For simpler tasks, I will often do the work myself or use a self-hosted model via Ollama or LM Studio. Since self-hosting costs nothing, it doesn’t really matter if the output is not perfect, just take the useful parts and feed them into a stronger model to improve quality, or have lower expectations like getting a feature from 10% to 40%. Local models are also great for brute-force iteration where you want to try five variations of an approach, see what shakes out, and only then spend tokens on the version that looks most promising.
A very underrated aspect of AI tool cost is the total amount of time it takes to achieve the result you are looking for. Often getting to 80% of what you want on the first try is easy but the remaining 20% is not. Sometimes you get to 20% of your goal but cannot improve beyond that no matter how many iteration prompts you try. In either situation it is usually a great time to pause, come up for air, and eat a snack. Your project will be waiting for you when you return. LLM outputs are usually a mirror of your own clarity as mentioned before, so if you are tired or annoyed, your prompts degrade, and the model reflects that back at you. Remember to take breaks, stand, and breathe! When you come back to your work, changing state can sometimes help, and so can changing models. I have found self-hosted models are the best in these situations because they give you a better read on the quality of your prompts since outputs are lower quality and less forgiving. If a local model cannot follow your instructions, it is a sign your prompt is sloppy. If it can, then the issue could be model bias, limitations, or oversimplification.
the value of toy applications
I learned early on that LLMs tend to produce outputs that look similar. If you ask Claude to create a logo, it is going to create an SVG more times than not. If you ask Claude to generate a boilerplate website, it will reach for the same UI patterns and familiar styling choices because it has learned which defaults tend to be accepted. This is not inherently bad, and it is part of why these tools feel productive. The problem is that sameness becomes friction when you want something specific, because you end up burning time fighting the model’s defaults instead of moving forward.
Toy applications are helpful in dealing with homogeneous or undesirable model outputs. Toy apps are small tools with a utility purpose that unblock or enhance specific outputs. When I noticed how often I ended up with SVG assets that eventually needed to be PNGs, I had Claude build a static converter to turn SVGs into PNGs or JPGs.
The same idea applies to data. Claude and ChatGPT both love to hardcode and statically type content because it streamlines outputs and consumes fewer tokens. I have found that using JSON objects for structured data streamlines future iterations, so I bake JSON objects into most prompts. I do this so often that I built a JSON reclassification tool that takes in JSON objects, allows the user to reclassify content in bulk, and exports the updated JSON cleanly. The reclassifier tool exists purely because I got tired of editing hardcoded lists buried in project files.
The most leveraged utility application I have built is a prompt improvement tool. It takes a prompt, rewrites it based on the task at hand, and returns a stronger version that is easier for models to follow. It works through a UI or as an API so it can be queried agentically. Using a more consistent high quality prompt reduces cost by improving prompts per token and reduces follow-up requests, making outputs slightly more deterministic. When you are using AI tools every day, small improvements in clarity compound, so a prompt tool that turns vague requests into higher quality prompts saves time, money, and reduces the emotional tax of sitting there arguing with a model that is only confused and producing poor results because you were unclear.
Everyone can create utility apps since an LLM can do it for you. The most effective approach is to have utility apps created as static web pages, because they are incredibly easy to host using GitHub Pages for free and provide instant results without the overhead of server management. For more complex cases where an API makes sense, such as the prompt improver I use to standardize instructions, I expose API functionality using a Vercel serverless function. This allows the tool to be queried programmatically by other agents or scripts, effectively creating a private API for my development needs. Anyone can do this to improve their personal workflows and the quality of their LLM outputs. By turning recurring friction into a collection of hosted micro-utilities, you can drastically reducing manual data transformation time. These small, targeted applications, are my secret to maintaining speed as a solo builder and reducing per token fees.
what comes next
Nobody really knows what the future of software development looks like other than what it means to create “software” and be a “developer” is permanently changed. Generally, these changes will be for the better, because anybody who wants to write software will be able to, but it is also dangerous for the same reason.
Engineers are taught the importance of not exposing API keys, testing regularly, and in some ways “red teaming” their software at the beginning of their career, usually as a student. The term “hacker” is just a category of software engineer that takes advantage of vulnerabilities exposed through code like forgotten API keys, security vulnerabilities, and poor encryption. If anyone can create software but not everyone is trained on security best practices, then are more people at risk of being hacked than ever before?
Will the best AI tools stay closed and costly, locked behind $200/month paywalls and enterprise contracts? Or will open-source alternatives catch up fast enough to keep the frontier accessible to non-developers and the “not-yet-funded” crowd I wrote about in my plea to Anthropic?
Will the next generation of builders need to learn programming at all, or will prompt engineering and system design replace syntax and data structures as the foundational skills? And if so, what does computer science education look like in five years or even one year?
I do not have answers to these questions. Nobody does. What I do know is that survival in the modern software development era looks like learning to work with AI tools on a regular basis, iterating existing workflows, adapting to new constraints, and constantly learning better strategies. It also seems like the key differentiator between good and great is developing taste, improving judgment, and building a muscle for knowing what is worth creating. It’s awesome that any idea is possible with a prompt, but it also means good ideas are what will be the most valuable.
Writing code has changed fundamentally and I think that is generally a good thing. I also think we should be careful, as developers, and as a society, to modulate AI use and limit it to serving us versus us serving it.
share your thoughts
Have feedback on this post? I'd love to hear from you.
As always, 'twas nice to write for you, dear reader. Until next time.