How I Built the Prompt Grader
The honest build journal. The shape I chose, the things I rejected, the system prompt, the wrong turns, what running the grader against the site's own 390-prompt library found, and what it all cost.
The simplest description of this project is one sentence. There is now a tool on this site where you paste a prompt and get back a sharp critique plus a stronger version. The interesting story is in what shape it took, what didn't work first time, and what happened when I pointed the grader at the 390 prompts already living on this site and let it tell me how good they were.
What I wanted
Two readers in mind. The everyday reader who uses AI casually and isn't always sure why their prompts sometimes work brilliantly and sometimes feel flat. And the slightly more invested reader who wants to understand the discipline of prompt-writing as a skill, the same way you might want to understand how to write a good email or a good brief.
So the public tool needs to do four things in a single response. Tell the user honestly whether the prompt is going to work. Quote the actual weak words back to them so they can see where the problem is. Provide a stronger version they can copy and paste straight into ChatGPT, Claude, or Gemini. And explain the principle behind each change so the user learns the move, not just takes the rewrite.
There is also a recursive twist. Once the grader exists, the second job is to point it at the 390 prompts already in this site's Prompt Library and use it to make those prompts sharper. The library is the right target because every entry there is a prompt readers actually paste into ChatGPT, Claude, or Gemini, and a sharper library translates directly into better AI output for them. Library prompts are tools; personas are characters. The grader belongs on the tools.
The shape I chose, and the things I rejected
A handful of design calls were locked before I wrote a line of code, with alternatives rejected for specific reasons.
Both a live "Grade it now" button AND a copy-paste prompt. The live button is the magic for a first-time visitor, because the answer arrives without leaving the page. The copy-paste alternative is for power users and for anyone who's already burnt through the daily quota; it produces a markdown prompt the user takes to whichever AI they normally use. The cost difference between live-only and both is small (a button and a generator function); the user-experience difference is meaningful.
Sonnet 4.6 over Haiku 4.5 for the live model. Critique quality matters here. Haiku is fast and cheap, and I use it for the chat widget and Persona Explorer where the answer is conversational. For prompt critique the model needs to spot subtleties: a contradiction between two instructions, a phrase that says nothing, a missing piece of context that would make the answer useful. Sonnet is sharper. The cost is roughly $0.005 per grade; with a daily limit of ten per IP, the monthly bill stays small.
A single API call, not two. I considered a "first critique, then rewrite" two-pass design where the rewrite is informed by the critique. It would have been marginally smarter and would have doubled both the cost and the latency. The system prompt does both jobs in one structured response, which is the correct trade.
Lambda, not Edge Function. Netlify's edge proxy times out non-streaming responses at 30 seconds, and Netlify's stream wrapper from `@netlify/functions` caps at 10 seconds. For prompt grading at max_tokens: 2500, Sonnet finishes comfortably inside 30 seconds, so a non-streaming Lambda is the correct shape. Edge Functions (Deno-based, fifty-second budget) would have been overkill for the duration we need.
The system prompt that drives the grader
This is the verbatim text the function sends to Sonnet on every grade. The transparency layer is the point. Anyone can lift this and paste it into ChatGPT or Claude themselves to run the grader without going through this site at all.
(Abbreviated above for readability; the full version with all examples and constraints is in netlify/functions/grade-prompt.js, and the live tool's "How this works" panel renders it verbatim.)
The wrong turns
This section was written live as the build progressed. Memory of wrong turns fades fast, so the file was open in a markdown editor the whole time, growing as new things broke. Five entries now.
1. The "what were you trying to do?" field was nonsensical 2026-05-09
The first version of the page had two textareas: one for "Your prompt" and a smaller one for "What were you trying to get the AI to do? — context helps the grader". The thinking was that the AI could use the context to judge whether the prompt was fit for purpose.
It surfaced as a bug first. I loaded an example prompt about housing policy, then deleted that prompt and typed my own about something completely different. The context box still held the housing-policy text from the example click, so every grade looped back to housing.
The deeper problem was that the second box should never have existed. A good prompt is self-contained. If the goal is in the prompt, the second box is redundant. If the goal isn't in the prompt, the second box lets the user smuggle in clarity that won't actually be there when they paste the prompt into ChatGPT or Claude. Either way the grader should grade what's in the prompt, not what the user wished they'd written.
Removing the field also sharpened what the grader does with vague prompts. With a context box the grader had a tendency to use that context as the answer. Without one, the grader has to do the harder, more useful job: name the vagueness, point at the missing pieces, and write a rewrite that includes bracketed placeholders the user fills in. That's the lesson the user actually wants.
2. Personas were the wrong recursive target; the library was the right one 2026-05-09
The original brief had me pointing the grader at the 105 personas in the Persona Explorer. The recursion was clean and the headline ("AI tool grades the AI prompts driving other AI tools") was satisfying. But once the public grader was working and we sat with the question "what would actually make this site better for readers?", the answer was different.
Personas are deliberate characters; running a critic over them produces "fixes" that flatten what makes them interesting. The Prompt Library, by contrast, is 390 prompts that readers paste straight into ChatGPT or Gemini. Sharpening those is direct user benefit, not a meta-flex about recursion. Library prompts are tools; personas are characters. The grader belongs on the tools.
3. Tool versions matter; --no can fail moderation; multi-tool prompts need dual output 2026-05-09
The library has 390 prompts, of which 133 target image, video, music, or design tools whose prompt syntax is distinctive. The recursive grader couldn't be naive about these without producing nicely-worded gibberish from the tool's point of view.
Three lessons came out in close succession.
The first: tool versions move fast. A research pass on Midjourney came back with mostly V7 information; the actual current version is V8.1, released 30 April 2026, with different stylize ranges and HD images native via --hd. A grader trained on V6 conventions would have produced subtly outdated rewrites. The fix is a _tool-guides/ directory holding short, dated reference notes that the grader pulls in when grading any prompt whose tools.recommended array includes a tool with a guide here. When a new version drops, one file updates and every future grade picks up the change. A scheduled scripts/check-tool-guides.py task flags any guide whose "Last verified:" date is older than 30 days, so guides can't quietly rot.
The second: --no, Midjourney's negative-prompt flag, can trip the moderation system in surprising ways. Midjourney parses each word in --no independently, so --no text, watermark, logo is read as no text plus no watermark plus no logo, and any single word that triggers a content filter rejects the whole prompt. The pattern is now replaced with plain-prose negatives in the descriptive part of the prompt.
The third: the library categories targeting Midjourney also target ChatGPT and Gemini, which take prompts in natural language. A single rewrite couldn't serve both. The solution borrowed an existing pattern from the site's Suno music prompts: a mode toggle. The Library page now shows "Output for: ChatGPT or Gemini / Midjourney" buttons for any entry with a promptTextMidjourney field; the copy box switches between two genuinely different prompts depending on the user's choice.
4. The workflow buried the diagnostic in a file nobody told the user to open 2026-05-09
While running the bulk pass across 390 library prompts, the workflow looked sensible on paper. The grader produced a four-section critique per prompt and wrote it to a markdown review file. The user marked Decision lines [x] accept, ran the apply script, and the rewrites landed in prompts-data.json. Twenty entries reviewed in a sitting; thirteen categories closed in a couple of days. The instructions in the chat said something like "mark each entry as accept or reject, then run apply".
The instructions did not say "before you mark anything, open the file and read the verdicts because some of them flag real problems". They did not summarise any flagged issues in the chat at all. The agent (me) had access to every verdict, ran a script that wrote them all to disk, and then handed the user a one-line sed command that ticked every box without prompting them to read what was being accepted.
The cost surfaced when a tester pasted a generated Midjourney prompt and got a parser error from Midjourney. Looking at that specific entry's verdict afterwards, the grader had written: "One fixable issue exists in the Midjourney fence: the aspect ratio placeholder includes parenthetical labels inside the parameter values, which will cause Midjourney to parse the label text as prompt tokens and may produce errors or unexpected output."
The grader had diagnosed the bug in the verdict text. The user had no reason to open the file because the workflow hadn't told them to. The bug went live.
This is the more honest framing. The cliché is "the grader told us and we didn't read". The truer version is that the workflow's design didn't push the diagnostic forward. When an automation produces a diagnostic, the design has to surface it actively in the channel the user actually reads, not park it in a file the user has no reason to open. A summary in chat that said "of the twenty graded prompts, three of them have verdicts flagging tool-syntax concerns; here are the snippets, please read these specifically before accepting" would have made the bug visible the moment it was written.
Four changes followed. The Midjourney tool guide was tightened to forbid the inline-annotation pattern at source, with worked examples, so the grader stops generating the pattern at all. The output structure was changed to two separate fenced code blocks per multi-tool prompt. The apply script grew a Midjourney syntax linter that scans the rewrite for known parser-breaking patterns and refuses to write a Midjourney rewrite that fails the lint, with a clear error in the apply log. And the workflow itself grew a verdict-summary step: any future bulk pass that this site runs will surface flagged issues in the chat or the email summary, not bury them in the file. A user shouldn't need to know to open a file to see that something is wrong with the work that was just done on their behalf.
When an automation produces a diagnostic, the design has to surface it actively in the channel the user actually reads, not park it in a file the user has no reason to open.
5. Duplicate placeholder fields and token-mismatched substitutions 2026-05-10
After the dual-output toggle landed, a tester opened a multi-tool image prompt and saw two SUBJECT fields, two SETTING fields, two LIGHT fields, side by side. Filling them changed nothing in some places. The placeholder text still appeared, raw, in the assembled prompt.
Two bugs compounding. First, the grader was emitting one placeholder list per fence (chat fence and Midjourney fence), and the apply script was concatenating them, producing duplicates. Second, the grader was using subtly different bracket text in the two fences for the same conceptual placeholder: chat said [LIGHT — e.g., "golden hour"] while Midjourney said [LIGHTING CONDITIONS — e.g., "golden hour"]. The metadata listed both, the user's value substituted into one, and the other appeared as raw bracket text in the output.
The fix has three layers. The tool guide now states explicitly that both fences must use identical bracketed token strings, so the user fills each placeholder once and the value substitutes into both versions. The apply script gained a metadata-rebuild step that always extracts placeholders directly from the rewrite text, using the grader's own metadata only as a hint for labels and examples; this guarantees the placeholders metadata exactly matches the tokens in the rewrites, regardless of any drift in the grader's output. And a one-off cleanup script reconciled all 140 entries that had drifted from earlier runs.
The test on my own prompts
The recursive-use story is the headline of this build. The grader was pointed at the site's own 390 library prompts, in stages: thirteen general-LLM categories first (the prompts that work in ChatGPT, Claude, or Gemini), then the eighteen tool-specific categories (Midjourney, Suno, Veo, and the rest). Three categories of finding stand out.
Cases where the grader caught a real structural bug
caughtEmail Drafter — From Rough Notes (W-01)
The original prompt had placeholders for recipient name and tone, then the instruction "Before writing, tell me: who is the recipient? What tone do you want?". The grader spotted the contradiction immediately: the prompt asks for inputs the user has already specified, and the AI sits in a stalled "tell me first" loop instead of just writing the email. Net effect was to make a perfectly good prompt waste a turn before getting started.
The rewrite stripped the redundant question loop, defined the three tone options concretely (instead of leaving "professional" and "casual" undefined), and tightened the length cap. Real improvement.
caughtFormal Letter Writer (W-03)
The original asked for "a response within a reasonable timeframe", a vague phrase in an otherwise commendably specific prompt. The grader pointed out that "reasonable" varies between 7, 14, and 28 days depending on context, and that the AI was being asked to guess on the user's behalf. Rewrite added an explicit deadline placeholder. It also separated recipient name and title from the organisation field (so letters could open "Dear Mr Patel" instead of "Dear Sir/Madam"), and added a structured guide for the key-facts field.
Cases where the grader honestly said the prompt was already strong
held its handEmail Editor — Fix Tone and Clarity (W-02)
The grader's verdict opened: "This is a genuinely strong prompt: it is specific, well-structured, and gives the AI clear constraints on scope, format, and behaviour. Minor refinements are possible but nothing here is broken." That was the calibration we wanted. The system prompt is explicit that the grader should not invent weaknesses to fill bullets when the original is already strong; this verdict proved the calibration worked. The eventual rewrite added a small context block for purpose / audience / intended tone (a real improvement, but framed as a sharpening, not a rescue).
Cases where the grader needed tool-aware help
tool-awareWildlife in the Garden (IMG-P-03)
The tool-aware version of this prompt has two distinct outputs. For ChatGPT or Gemini: "Please generate a photograph of [SUBJECT — e.g., 'red fox'] in [SETTING — e.g., 'a British garden at dusk']. Style it as a 400mm telephoto wildlife photograph: shallow depth of field, soft natural bokeh, lock sharp focus on [FOCUS POINT — e.g., 'fur texture on face and whiskers']."
For Midjourney, the same placeholders inside a different prose voice and ending in proper flag syntax: "[SUBJECT — e.g., 'red fox'] in [SETTING — e.g., 'a British garden at dusk'], 400mm telephoto wildlife photograph, f/4, shallow depth of field, soft natural bokeh, sharp focus on [FOCUS POINT — e.g., 'fur texture on face and whiskers'], [LIGHT — e.g., 'golden hour'], --ar 3:2 --v 8.1 --style raw --s 200"
Different prompts, identical placeholders, one form. The user picks Midjourney or ChatGPT/Gemini on the page, the matching prompt is what gets copied. Midjourney users get parameter flags; ChatGPT users get conversational instruction. Neither is a compromise.
The effect nobody designed for: the rewrite is also a teacher
This was discovered by accident while testing. A user with no electronics background pasted into the grader: "How does a keyboard work?" Five words. A vague prompt by any normal definition.
The grader's critique was the expected shape: no audience signal, no scope signal, no format signal, no length constraint. Routine. The interesting thing was what came out the other end as the stronger version:
Explain how a computer keyboard works, aimed at someone with no technical background. Cover these specific areas, in this order: 1. What happens physically when a key is pressed (focus on membrane or mechanical switches, whichever is more common in everyday keyboards). 2. How the keyboard detects which key was pressed (matrix scanning). 3. How that information is sent to the computer (USB HID protocol, briefly). Format: three short sections with a heading for each. Each section should be 3 to 5 sentences. Avoid jargon; if a technical term is necessary, define it in plain language immediately after using it.
Read that for a second. The user could not have written this stronger version. They did not know that "membrane vs mechanical switches" was the right way to slice the physical question. They did not know that "matrix scanning" was the term for the detection layer, or that "USB HID protocol" was the right level of detail for the computer-side handoff. The grader supplied the domain map that the user did not have.
This is the part the original brief did not anticipate. We designed the grader to be a critic of prompt structure: spot vague phrases, name missing constraints, write a rewrite that adds what was missing. But to write a rewrite for "How does a keyboard work?", the model has to think about the SUBJECT before it thinks about the PROMPT. It has to ask "what would a good answer to this question even look like?" and then build a prompt that requests that shape of answer.
So the grader is doing two jobs at once. It is polishing your wording. And it is drawing the map of the question you didn't know how to draw yourself. The first job is the one we built; the second is a free side effect of using a model that knows things, not just a model that critiques structure.
A good rewrite doesn't just fix your prompt. It tells you what you didn't know to want.
The proof was in pasting the rewrite straight into ChatGPT and seeing what came back. Three sections, headings, 3-5 sentences each, jargon defined in plain language. Exactly the structure the rewrite asked for, applied honestly to a question the user couldn't have framed in this much detail themselves:
What ChatGPT returned, with the rewritten prompt verbatim
1. What happens physically when a key is pressed
In many everyday keyboards, pressing a key pushes down on a soft rubber dome or membrane layer underneath the keycap. That pressure brings two thin electrical contact points together, a bit like closing a tiny switch. When you let go, the rubber dome springs back and separates the contacts again. Mechanical keyboards work in a similar basic way, but each key has its own individual spring-loaded switch.
2. How the keyboard detects which key was pressed
The keyboard does not usually have a separate wire for every key. Instead, the keys are arranged in a grid called a matrix, which just means rows and columns of electrical lines. The keyboard's small built-in controller rapidly checks these rows and columns to see where a connection has been made. If row 2 and column 5 connect, for example, the controller knows which key sits at that position.
3. How that information is sent to the computer
Once the keyboard knows which key was pressed, it sends a small message to the computer through USB. Most keyboards use USB HID, which means "Human Interface Device", a standard way for things like keyboards, mice, and game controllers to talk to computers. The message does not usually say "the letter A"; it sends a key code that means "this physical key was pressed." The computer then decides what character to show, depending on your keyboard layout.
Notice what just happened. The user wrote five words. They got back a clear, three-section, jargon-free explanation of switch actuation, matrix scanning, and the USB HID protocol — concepts they did not know to ask for. The grader did not write the answer. ChatGPT did. But the grader supplied the structural shape that made the answer this clean, by asking ChatGPT for exactly the right kind of slicing and exactly the right depth of vocabulary. Without the grader's rewrite, the same five-word prompt to ChatGPT would have produced a different (almost certainly worse) answer.
The implication is genuinely useful. Even if you never run the rewritten prompt, even if you copy nothing, reading the rewrite teaches you the structural shape of a good answer to the question you asked. For non-experts approaching a topic, the grader functions as a quiet thinking partner. You arrive with "how does X work?" and you leave with "ah, X has a physical layer, a detection layer, and a transport layer; that's how the field is sliced". And if you do run the rewrite, you get the answer too.
The conversation does not have to stop there either. Now that you have the domain map, the next round of questions can be sharper, and you can ask them directly in the same chat where you ran the rewrite. "Tell me more about matrix scanning, with a small example of how the controller decides between two keys pressed at the same time." Or "What's the practical difference between membrane and mechanical switches in terms of typing feel and longevity?" Or "Why does the keyboard send a key code rather than the letter, and where does the layout actually live?" Each of those follow-ups would have been impossible from your original five-word starting point because you didn't have the vocabulary. The first round gave you the vocabulary; the next rounds let you use it. The lesson for any non-developer using the grader: don't treat the rewrite as the end of the conversation. Treat it as the beginning of a better one.
This was always going to be true of any prompt grader running on a capable foundation model. It was not built into the design. We did not write a system-prompt instruction that said "also surface the domain shape of the question". The model brought that to the work because that is what models do when they are asked to write better prompts: they think about the answer first, and the prompt second. The lesson is to value this side effect as much as the headline feature when telling a non-developer reader why they would use a tool like this.
How the grader stays current
The wrong turns above clustered around one root cause: AI tools change syntax every few months, and a grader calibrated against last year's conventions produces subtly outdated rewrites that no one notices until the live tool errors out. Midjourney went V6 → V7 → V8.1 in less than a year. Suno went v4.5 → v5 → v5.5 in the same window. The grader can't be a one-off build; it has to assume drift and have a system that catches it.
The defence sits in three layers, each a notch more proactive than the last.
Layer one: tool guides with a dated header. Each tool the grader knows about has a short reference note in _tool-guides/. The first two header lines name the model version the guide is calibrated against and the date of the last verification. Updating one file updates every future grade automatically. There are guides for Midjourney V8.1, Suno v5.5, and Veo 3.1 to start, and the structure is open for the next tool that lands.
Layer two: a freshness-check script. A small script, scripts/check-tool-guides.py, walks the guide directory and flags any guide whose Last verified: date is older than a threshold (default 30 days). It exits non-zero when anything is stale, so it can be wired into a CI pipeline if you want hard failure. Run it on demand with python3 scripts/check-tool-guides.py.
Layer three: a scheduled refresh task. Every fortnight, a Cowork scheduled task fires, walks every guide, and for each one that's older than 14 days it does the actual research: web-searches the tool's current state, has Sonnet read the guide content alongside the search results, and reports one of three verdicts. Still current auto-bumps the verified date in place. Minor update generates a markdown diff for the human reviewer, surfaced in a report at _review/tool-guide-refresh-{date}.md AND summarised in chat or email when the task completes. Major update halts and flags the guide for full rewrite.
The point of layer three is that the diagnostic doesn't wait for someone to think to check. The cost is ~$7 a year in API spend and the benefit is that the grader can't quietly rot.
The third layer is the direct lesson learnt from wrong turn #4 above. It is not enough to have a script that produces a report. The report has to be surfaced to the human in the channel they actually read, not buried in a file they have no reason to open. The scheduled task ends with a chat or email summary that says "guide X is still current, guide Y had this minor change applied, guide Z needs your attention", so the active diagnostic is in front of you, not waiting for you to go looking.
What it cost
Honestly modest. Sonnet 4.6 grading at the parameters I use is roughly $0.005 AUD per call. The full library pass plus all the iteration runs for the wrong-turn fixes came to:
| Item | Calls | Cost (AUD) |
|---|---|---|
| Initial test prompts during build | ~40 | $0.20 |
| Library pass: 13 general-LLM categories (one round) | 257 | $1.30 |
| Library pass: 18 tool-specific categories (one round) | 133 | $0.65 |
| Re-grades after tool-guide / dual-output fixes | ~280 | $1.40 |
| Tool-guide refresh task (per fortnight, ongoing) | ~13 web-searches + Sonnet review | $0.30 / fortnight ($7 / year) |
| Total to ship v1 | ~710 | $3.55 |
The dominant cost in this kind of project is your time, not the model's. The thirty hours of iteration dwarf the four dollars to ship and the seven dollars a year to keep current. Treating the model bill as the headline number obscures where the work actually is.
What you'd change for your own version
If you wanted to clone this for a different site, here is what I'd suggest, in priority order.
Spend half a day on the system prompt. The four-section structure (verdict, weaknesses, rewrite, what changed) is the most important design decision in the whole project. A generic "improve this prompt" system prompt produces generic critique. Anchoring the model on quoting actual words from the original, naming the principle behind each change, and being honest when the prompt is already strong is what makes the output sharp instead of bureaucratic.
Build the linter before you need it. The Midjourney --ar 4:3 (landscape) bug only surfaced after I'd applied 60-odd rewrites with the broken pattern. A trivial regex check at the apply step would have caught it on the first one. The same applies to Suno (Style block over 200 chars), Veo (more than two camera moves in one prompt), and any other tool with rigid syntax. Each lint pattern is a guardrail you don't have to remember.
Use the grader's critique as quality control, not just the rewrite. The most expensive mistake I made was running auto-accept across whole categories without reading the verdicts. The grader was actually flagging real issues that I should have caught and addressed before applying. An AI tool that grades AI output gives you two pieces of information: the rewrite and the critique. If you only read the rewrite, you are using half the tool.
Tool guides need verification dates and a stale-check. Tool versions change every few months. Midjourney V7 → V8 → V8.1 happened across 2025-2026. Suno v4.5 → v5 → v5.5 happened in the same window. A guide written against V6 conventions produces subtly outdated rewrites. The simplest defence is a one-line **Last verified:** YYYY-MM-DD header in each guide, and a script that flags any guide older than 30 days. We have that wired up here; it runs on a schedule and writes a markdown report.
Consider whether your prompts target multiple tools. If yes, decide early whether the prompt template should be tool-tuned (different prose for ChatGPT vs Midjourney), tool-flexible (same prose, different parameter formatting), or hard-fork (one prompt per tool). The answer changes the page architecture: a single prompt-text field works for the second; the third needs a mode toggle and two fields. The first works in any of those, but you'll spend the most effort on the rewrites.
Files for the technically-curious reader
If you want to read the actual implementation, the relevant files in the repository are:
reference/prompt-grader.html: the public tool's page, including the system-prompt display.netlify/functions/grade-prompt.js: the Lambda function with the system prompt as a constant.scripts/grade-site-prompts.py: the recursive-grader pipeline (library + persona + function-prompts modes, parallel workers, direct API mode).scripts/apply-grader-suggestions.py: applies accepted rewrites, with the Midjourney / Suno / Veo lint and the placeholder rebuild logic.scripts/check-tool-guides.py: the freshness check that runs against the tool-guide directory._tool-guides/: short reference notes for Midjourney V8.1, Suno v5.5, and Veo 3.1, each with a verified-on date.reference/tool-guide-refresh-task-prompt.md: the canonical prompt for the fortnightly refresh task. Cowork runs this on a schedule, the task walks each guide, web-searches for current state, and surfaces any drift in a chat or email summary.