Valmar with Evals
Turn LLM-as-a-judge failures into knowledge requests so the next eval run has the answer.
Evals catch assistant failures before they ship. Most "wrong" answers your judge flags are not really wrong — they are missing knowledge. Valmar closes the loop: when the judge fails a case, hand the failure to a real person via a knowledge request. By the next eval run, the answer is in your knowledge store and the same case passes.
This pattern works with any eval framework that produces a pass/fail
plus a judge's reasoning. The example below uses
DeepEval's GEval,
but the shape is identical for Promptfoo llm-rubric, OpenAI Evals
model-graded templates, Braintrust autoevals, Ragas, or a hand-rolled
judge.
The shape
- Run your eval suite as usual.
- For every case the LLM judge fails, collect the input, the assistant's output, and the judge's reasoning.
- Call
client.context.gather(...)to open a knowledge request — the question goes to a human who can produce the missing answer.
Why this works
The judge already tells you what is missing (its reasoning) and
what was tried (the assistant's output). Those map cleanly onto
background_context and already_tried on context.gather.
DeepEval example
Wrap your eval loop so any judge failure escalates to Valmar.
import os
from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from valmar import ValmarClient
valmar = ValmarClient(
api_key=os.environ["VALMAR_API_KEY"],
organization_id=os.environ["VALMAR_ORGANIZATION_ID"],
project_id=os.environ["VALMAR_PROJECT_ID"],
base_url=os.environ["VALMAR_BASE_URL"],
)
correctness = GEval(
name="Correctness",
criteria=(
"Determine if the actual output correctly and completely answers "
"the input using only verifiable company knowledge."
),
evaluation_params=[
LLMTestCaseParams.INPUT,
LLMTestCaseParams.ACTUAL_OUTPUT,
LLMTestCaseParams.EXPECTED_OUTPUT,
],
threshold=0.7,
)
test_cases = [
LLMTestCase(
input="What's the policy on issuing refunds above $500?",
actual_output=my_assistant("What's the policy on issuing refunds above $500?"),
expected_output="Refunds above $500 require manager approval.",
),
# ...
]
for case in test_cases:
correctness.measure(case)
if correctness.score >= correctness.threshold:
continue
# Judge failed this case — escalate to a human.
handle = valmar.context.gather(
question=case.input,
background_context=(
f"Eval judge marked this answer wrong.\n\n"
f"Judge reasoning: {correctness.reason}"
),
already_tried=(
f"Assistant produced: {case.actual_output}"
),
requesting_application="evals",
)
print(
f"Knowledge request opened for failed eval "
f"(id={handle.context_request_id}, status={handle.status})"
)See Create Knowledge Requests for the full parameter reference.
Closing the loop
Once a teammate answers the request, the result lands in your
project-scoped knowledge store. On the next eval run, your assistant
retrieves it through context.search (or whatever
RAG path it already uses) and the same case passes — without anyone
editing the eval suite.
This is a third way to use Valmar alongside the proactive and reactive modes: evals are proactive discovery, but driven by your existing test suite instead of a gap-analysis run.
Works with any judge
The same wrap applies anywhere a judge emits a pass/fail and a reason:
- Promptfoo — wrap the
llm-rubricassertion's failure callback. - OpenAI Evals — call
context.gatherfrom a custom grader. - Braintrust
autoevals— escalate whenFactuality/Battlescores drop below threshold. - Ragas — escalate when
faithfulnessoranswer_correctnessfails on a case. - Your own judge — anywhere you have
(input, output, reason), you have everythingcontext.gatherneeds.