- Why not track LLM eval results in a spreadsheet?
- Because a spreadsheet can't answer 'which test cases regressed between v3 and v4' or 'pass rate per model this month' without manual pivoting, and it breaks as runs pile up. Those are GROUP BY and trend queries. Log each scored case as a row and nlqdb runs the aggregation in Postgres, showing the SQL it ran.
- How do the eval results get into the database?
- Write one row per scored case — prompt version, test case id, model, score, pass/fail, timestamp — with the deterministic `nlqdb_remember` MCP tool or a parameterised INSERT through `POST /v1/run` (`GLOBAL-015`). The row shape stays a trust boundary, built server-side, not LLM-guessed. Then ask the trend questions in English over the same table.
- Does nlqdb run the evals or score the outputs itself?
- No — your eval harness (promptfoo, Braintrust, LangSmith, or a custom LLM-as-judge) runs the cases and produces the scores. nlqdb is the database half: you log the scored results and get a SQL query planner over them for 'per version / over time' questions. They compose; nlqdb doesn't grade your outputs.
- Can I see the SQL behind a regression number?
- Always — every answer returns the result rows plus the compiled SQL under a trace toggle (`SK-WEB-005`), so you can check the grain (per case vs per run) before trusting a pass-rate figure. nlqdb never hides the SQL behind the answer.