Why not just count tool failures in my application code?

Because the questions are aggregations — failures per tool, p95 latency, calls per session — and pulling rows back to tally them in app code is fragile and slow as call volume grows. Asking the LLM to count a log is worse: arithmetic over a list hallucinates. nlqdb runs the GROUP BY in Postgres and shows you the SQL it ran, so you can trust the grain.

How do the tool-call records get into the database?

Write one row per tool call — tool name, session id, status, latency, timestamp — with the deterministic `nlqdb_remember` MCP tool or a parameterised INSERT through `POST /v1/run` (`GLOBAL-015`). The row shape stays a trust boundary, built server-side, not LLM-guessed. Then ask the reliability questions in English over the same table.

Can I see the SQL behind the error rates?

Always — every answer returns the result rows plus the compiled SQL under a trace toggle (`SK-WEB-005`), so you can check the grain (per call vs per session) before trusting a failure rate. nlqdb never hides the SQL behind the answer.

Is this a replacement for an agent-observability tool like Langfuse or AgentOps?

No — those instrument your agent and capture every span automatically, with nested trace-tree UIs built for debugging one run. nlqdb is the database half: you decide what to log, and you get a SQL query planner over it for ad-hoc 'per tool / per week' questions without a per-seat dashboard. They compose; nlqdb doesn't trace your runs.

Solve · Agent builders

How do I log my AI agent's tool calls and query which tool fails most?

If your agent calls tools and you need to know which tool fails most and how slow each one is — log every tool call as a row and ask in English. nlqdb provisions Postgres from your first goal and runs the GROUP BY in SQL, so 'error rate per tool' is a real query, not a grep over traces.

Agents fail across steps — a tool returns the wrong shape, a call times out, a retrieval step finds the wrong doc — and the questions that matter are aggregations: which tool fails most, p95 latency per tool, calls per session, success rate this week. Those answers live in flat trace logs or a JSON column, where you grep instead of GROUP BY. Counting failures by hand (or asking the LLM to tally a log) doesn't scale; these are queries, and queries want a planner.

The snippet that solves it.

> error rate and average latency grouped by tool name this week, worst first

What nlqdb does for this

Log each tool call as a typed row — tool name, session id, status, latency_ms, timestamp — so error-rate-per-tool and p95-latency run as real SQL GROUP BY.
Ask the reliability question in English via `<nlq-data>`, the `@nlqdb/sdk`, or MCP `nlqdb_query`; every answer returns rows plus the compiled SQL.
Write call records with the deterministic `nlqdb_remember` tool or a `POST /v1/run` parameterised INSERT, then report over the same database — no separate analytics store.
Plans are content-addressed on `(goal-fingerprint, schema-hash)` (`GLOBAL-006`), so a repeated weekly reliability rollup hits the cache and returns in single-digit ms.

Drop into any HTML page

<nlq-data goal="error rate and average latency grouped by tool name this week, worst first"></nlq-data>

The first reliability question an agent team asks — which tool is failing and how slow — is one English goal here, not a hand-written GROUP BY over a trace log.

What this replaces

What nlqdb doesn't try to do here

No automatic tracing — nlqdb stores and aggregates the call rows you write; capturing each tool invocation, its status, and its latency is your agent framework's job (or an OTel/tracing SDK's).
No nested trace-tree UI — nlqdb answers tabular 'per tool / per session' aggregations, not the multi-step span waterfall a dedicated agent-observability tool draws.
No connecting to your existing log store — nlqdb provisions and owns the Postgres it queries; bring-your-own-Postgres is roadmap, not shipped.

Questions buyers ask

Why not just count tool failures in my application code?: Because the questions are aggregations — failures per tool, p95 latency, calls per session — and pulling rows back to tally them in app code is fragile and slow as call volume grows. Asking the LLM to count a log is worse: arithmetic over a list hallucinates. nlqdb runs the GROUP BY in Postgres and shows you the SQL it ran, so you can trust the grain.
How do the tool-call records get into the database?: Write one row per tool call — tool name, session id, status, latency, timestamp — with the deterministic `nlqdb_remember` MCP tool or a parameterised INSERT through `POST /v1/run` (`GLOBAL-015`). The row shape stays a trust boundary, built server-side, not LLM-guessed. Then ask the reliability questions in English over the same table.
Can I see the SQL behind the error rates?: Always — every answer returns the result rows plus the compiled SQL under a trace toggle (`SK-WEB-005`), so you can check the grain (per call vs per session) before trusting a failure rate. nlqdb never hides the SQL behind the answer.
Is this a replacement for an agent-observability tool like Langfuse or AgentOps?: No — those instrument your agent and capture every span automatically, with nested trace-tree UIs built for debugging one run. nlqdb is the database half: you decide what to log, and you get a SQL query planner over it for ad-hoc 'per tool / per week' questions without a per-seat dashboard. They compose; nlqdb doesn't trace your runs.

Where this pain shows up in public

Enduring discussion hubs where you can verify the theme without taking our word for it. We don't quote individual posts; we cite search-result and subreddit URLs that stay live as new threads land.

Try nlqdb in 30 seconds

No sign-in. The anonymous database lasts 72 hours; adopt it with one click if you keep it.

Start with a goal →

Looking at this from a different angle? Browse all solve pages or browse competitor comparisons.

What nlqdb doesn't try to do here

Questions buyers ask

Where this pain shows up in public

Try nlqdb in 30 seconds

Something broke on this page.