Why 'review my code' gets you a list of things you don't care about
I've reviewed a lot of pull requests, and I've taught a fair number of engineers how to review them too. So let me start with the most common mistake people make when they ask an AI to do it: they paste a function and type 'review my code.' That's the whole prompt. What comes back? A wall of generic feedback. 'Consider adding comments.' 'You might want to use more descriptive variable names.' 'This function could be split into smaller functions.' All technically true, all useless, none of it telling you whether the thing has a bug that will page you at 2am. The model gave you a review because you asked for a review. It just had no idea which kind of review you wanted, so it gave you the safest, blandest one — the equivalent of a reviewer who skims the diff and writes 'LGTM, maybe add tests.' The fix is not a better model. The fix is telling the model what 'good' means for this specific piece of code. A code review prompt generator's job is to ask you the questions a senior reviewer asks themselves before they read a single line: what is this code supposed to do, what would 'broken' look like, and what do you not want me to waste your time on? Answer those three and the review changes character completely. I think this is the single biggest lever most developers are leaving unpulled — they treat the AI like a linter when they could treat it like a reviewer.
The 3-layer review framework I teach (and where it breaks down)
When I walk someone through reviewing code — human or AI-assisted — I give them three layers to read in order, because reading them out of order is how you spend an hour on naming while a SQL injection sails through. Layer 1 is correctness: does it do what it's supposed to do, and what happens on the inputs nobody tested? Off-by-one, null handling, the empty-array case, the 'what if this is called twice' case. Layer 2 is safety: can this hurt someone? Injection, auth bypass, a secret in a log line, an unbounded loop that a hostile input could trigger, a resource that never gets closed. Layer 3 is craft: readability, naming, structure, the stuff that matters for the next person but won't take down production tonight. The generator builds your prompt around these layers and tells the model to report them separately and in that priority order. Here's the honest caveat, though, and I tell every junior the same thing: this framework won't work if you can't describe what 'correct' means. If you don't actually know what the code is supposed to do — if you inherited it and you're hoping the AI will tell you — the framework can't save you. The model will happily review the craft of code whose correctness nobody can vouch for. A review prompt is a force multiplier on understanding you already have, not a substitute for understanding you don't.
The common mistake: asking for everything at once
Here's a callout I find myself repeating in code review training: do not ask the AI to 'find all the issues' in a 400-line file. It can't, and the attempt makes the output worse. When you ask one model one prompt to simultaneously check correctness, security, performance, style, test coverage, and documentation across a big file, attention gets spread thin. The model produces a little of each and goes deep on none. You get the illusion of a thorough review — six categories, looks comprehensive — while the actual race condition on line 230 goes unmentioned because the model was busy suggesting a docstring on line 12. The better pattern, and the one the generator nudges you toward: scope each review. One prompt that does nothing but hunt for security issues in the auth path. A separate prompt that does nothing but check the concurrency. You run two or three focused prompts instead of one bloated one. Is that more work? Slightly. But a focused prompt that says 'you are reviewing this ONLY for SQL injection and auth bypass; ignore everything else' will out-find a kitchen-sink prompt every time, because you've told the model where to point its attention. Narrow the question and the answer gets sharper.
ChatGPT vs Claude vs Copilot for reviewing code
People ask me which model gives the best reviews, and the honest answer is that it depends on what you're reviewing — and a good prompt matters more than the choice. Claude tends to give the most thoughtful correctness reviews on larger files; it holds the whole function in its head and notices when something on line 10 contradicts an assumption on line 90. It will also tell you when it's unsure, which I value in a reviewer more than confidence. ChatGPT is quicker and more decisive, which is great for a fast security pass and occasionally annoying when it states a 'bug' that isn't one with total conviction. I've learned to add 'flag your confidence level for each issue' to prompts aimed at it. Copilot's review lives in your editor and pull request, which is the right place for line-level comments, but it sees less surrounding context than a chat model you've pasted the full file into. For a deep review I copy the code into a chat; for inline nits while I work, Copilot is closer at hand. The generator doesn't lock you to one — you copy the prompt anywhere. What it does is shape the prompt to ask for severity rankings and confidence flags, which is exactly the structure that keeps any of these three from drowning you in low-value comments.
Tell the model what to ignore — it's half the prompt
This is the part new reviewers never think to include, and it's quietly the most important: tell the model what NOT to comment on. If your repo runs Prettier and ESLint on commit, you do not need an AI relitigating tabs versus spaces or whether you used a semicolon. Those are solved problems handled by tools that don't hallucinate. Every comment the model spends on formatting is attention it didn't spend on the logic error you actually wanted caught. So the prompts the generator produces include an explicit ignore list: 'Formatting, naming conventions, and import order are handled by our linters — do not comment on them. Focus your review on correctness, security, and edge cases.' That one paragraph changes the signal-to-noise ratio dramatically. I started doing this after a frustrating week in March 2024 where every AI review I ran buried two real bugs under fifteen style suggestions I'd already configured my tooling to fix. Telling the model to skip what your tools already cover isn't laziness — it's pointing a finite amount of attention at the things only a reviewer can catch.
Make the model show its reasoning, not just its verdict
A habit that changed how much I trust AI reviews: I make the model explain why something is a problem and how to reproduce it, not just assert that it is one. 'This could cause a memory leak' is a claim. 'This event listener is added on every render but never removed in a cleanup function, so each re-render adds another listener and they accumulate — you'd see memory climb if you re-render this component in a loop' is a reviewable claim. The second one I can verify in thirty seconds. The first one I have to investigate from scratch, which defeats the purpose. The generator's prompts ask for three things per issue: what the problem is, why it matters (with a concrete failure scenario), and a suggested fix. This does double duty. It makes the model's reasoning auditable, so you can catch it when it's wrong — and models are wrong about code more often than their tone suggests. And it teaches you something, because a good explanation of why a pattern is dangerous sticks better than a flag that just says 'dangerous.' A review you can't verify is just a vibe with line numbers.
A small experiment: vague review prompt vs structured one
I ran this on a real file back in November 2024 because a teammate and I disagreed about whether the prompt structure actually mattered or whether I was just being fussy. We took one moderately gnarly file — an Express route handler with auth, a database call, and some input parsing, about 180 lines, with three bugs I'd planted that I knew were there: a missing await that created a race, an unvalidated query parameter that flowed into a database call, and an error path that returned a 200. The vague prompt was 'review this code for problems.' Across five runs it caught the missing await twice, never flagged the injection risk as a real issue (it mentioned 'consider validating input' generically once), and never noticed the wrong status code. It spent most of its words on naming and structure. The structured prompt — generated, with the three-layer framework, an ignore list for style, and a request for severity and reproduction steps — caught all three bugs in four of five runs, ranked the injection risk as critical, and gave a one-line repro for the race. Same model, same file, same afternoon. The only difference was the prompt told the model what kind of review this was and what mattered. That experiment ended the disagreement, and it's why I bother generating these prompts instead of typing 'review this' like I used to.
Worked example: messy PR comment → structured review prompt
Let me show the whole loop with something close to what I actually paste, because the gap between input and output is where the generator earns its place. The input a tired developer gives at the end of the day: 'can you look at this, opening a PR tomorrow, mostly worried about the database stuff.' Then a chunk of code. That's it. There's a signal buried in there — 'worried about the database stuff' — but a vague prompt would treat it as noise and review everything equally. The code review prompt generator pulls the signal forward. It built a prompt that opened with the context (pre-PR review, author's stated concern is the data layer), put the database interaction at the top of the priority list, applied the correctness-then-safety ordering to it specifically — N+1 queries, transaction boundaries, what happens if the connection drops mid-write — and explicitly deprioritized the parts the author wasn't asking about. It also added the line I now add to everything: 'rank each finding Critical / High / Medium and skip anything our linter handles.' The model came back with a tight list: one Critical (a write that wasn't wrapped in a transaction, so a partial failure would leave the row half-updated), two Highs, and it stayed off the formatting entirely. The developer reads four real issues instead of scrolling past forty mixed ones. That's the whole point — not that the AI is smarter, but that you asked it a sharper question, and a sharper question is something you can produce on demand instead of hoping you're in the mood to write one well.