Valmar with Evals

Turn LLM-as-a-judge failures into knowledge requests so the next eval run has the answer.

Evals catch assistant failures before they ship. Most "wrong" answers your judge flags are not really wrong — they are missing knowledge. Valmar closes the loop: when the judge fails a case, hand the failure to a real person via a knowledge request. By the next eval run, the answer is in your knowledge store and the same case passes.

This pattern works with any eval framework that produces a pass/fail plus a judge's reasoning. The example below uses DeepEval's GEval, but the shape is identical for Promptfoo llm-rubric, OpenAI Evals model-graded templates, Braintrust autoevals, Ragas, or a hand-rolled judge.

The shape

Run your eval suite as usual.
For every case the LLM judge fails, collect the input, the assistant's output, and the judge's reasoning.
Call client.context.gather(...) to open a knowledge request — the question goes to a human who can produce the missing answer.

Why this works

The judge already tells you what is missing (its reasoning) and what was tried (the assistant's output). Those map cleanly onto background_context and already_tried on context.gather.

DeepEval example

Wrap your eval loop so any judge failure escalates to Valmar.

import os

from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from valmar import ValmarClient

valmar = ValmarClient(
    api_key=os.environ["VALMAR_API_KEY"],
    organization_id=os.environ["VALMAR_ORGANIZATION_ID"],
    project_id=os.environ["VALMAR_PROJECT_ID"],
    base_url=os.environ["VALMAR_BASE_URL"],
)

correctness = GEval(
    name="Correctness",
    criteria=(
        "Determine if the actual output correctly and completely answers "
        "the input using only verifiable company knowledge."
    ),
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.7,
)

test_cases = [
    LLMTestCase(
        input="What's the policy on issuing refunds above $500?",
        actual_output=my_assistant("What's the policy on issuing refunds above $500?"),
        expected_output="Refunds above $500 require manager approval.",
    ),
    # ...
]

for case in test_cases:
    correctness.measure(case)
    if correctness.score >= correctness.threshold:
        continue

    # Judge failed this case — escalate to a human.
    handle = valmar.context.gather(
        question=case.input,
        background_context=(
            f"Eval judge marked this answer wrong.\n\n"
            f"Judge reasoning: {correctness.reason}"
        ),
        already_tried=(
            f"Assistant produced: {case.actual_output}"
        ),
        requesting_application="evals",
    )
    print(
        f"Knowledge request opened for failed eval "
        f"(id={handle.context_request_id}, status={handle.status})"
    )

See Create Knowledge Requests for the full parameter reference.

Closing the loop

Once a teammate answers the request, the result lands in your project-scoped knowledge store. On the next eval run, your assistant retrieves it through context.search (or whatever RAG path it already uses) and the same case passes — without anyone editing the eval suite.

This is a third way to use Valmar alongside the proactive and reactive modes: evals are proactive discovery, but driven by your existing test suite instead of a gap-analysis run.

Works with any judge

The same wrap applies anywhere a judge emits a pass/fail and a reason:

Promptfoo — wrap the llm-rubric assertion's failure callback.
OpenAI Evals — call context.gather from a custom grader.
Braintrust autoevals — escalate when Factuality / Battle scores drop below threshold.
Ragas — escalate when faithfulness or answer_correctness fails on a case.
Your own judge — anywhere you have (input, output, reason), you have everything context.gather needs.

Valmar with Evals

Turn LLM-as-a-judge failures into knowledge requests so the next eval run has the answer.

The shape

Run your eval suite as usual.
For every case the LLM judge fails, collect the input, the assistant's output, and the judge's reasoning.
Call client.context.gather(...) to open a knowledge request — the question goes to a human who can produce the missing answer.

Why this works

The judge already tells you what is missing (its reasoning) and what was tried (the assistant's output). Those map cleanly onto background_context and already_tried on context.gather.

DeepEval example

Wrap your eval loop so any judge failure escalates to Valmar.

import os

from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from valmar import ValmarClient

valmar = ValmarClient(
    api_key=os.environ["VALMAR_API_KEY"],
    organization_id=os.environ["VALMAR_ORGANIZATION_ID"],
    project_id=os.environ["VALMAR_PROJECT_ID"],
    base_url=os.environ["VALMAR_BASE_URL"],
)

correctness = GEval(
    name="Correctness",
    criteria=(
        "Determine if the actual output correctly and completely answers "
        "the input using only verifiable company knowledge."
    ),
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.7,
)

test_cases = [
    LLMTestCase(
        input="What's the policy on issuing refunds above $500?",
        actual_output=my_assistant("What's the policy on issuing refunds above $500?"),
        expected_output="Refunds above $500 require manager approval.",
    ),
    # ...
]

for case in test_cases:
    correctness.measure(case)
    if correctness.score >= correctness.threshold:
        continue

    # Judge failed this case — escalate to a human.
    handle = valmar.context.gather(
        question=case.input,
        background_context=(
            f"Eval judge marked this answer wrong.\n\n"
            f"Judge reasoning: {correctness.reason}"
        ),
        already_tried=(
            f"Assistant produced: {case.actual_output}"
        ),
        requesting_application="evals",
    )
    print(
        f"Knowledge request opened for failed eval "
        f"(id={handle.context_request_id}, status={handle.status})"
    )

See Create Knowledge Requests for the full parameter reference.

Closing the loop

This is a third way to use Valmar alongside the proactive and reactive modes: evals are proactive discovery, but driven by your existing test suite instead of a gap-analysis run.

Works with any judge

The same wrap applies anywhere a judge emits a pass/fail and a reason:

Promptfoo — wrap the llm-rubric assertion's failure callback.
OpenAI Evals — call context.gather from a custom grader.
Braintrust autoevals — escalate when Factuality / Battle scores drop below threshold.
Ragas — escalate when faithfulness or answer_correctness fails on a case.
Your own judge — anywhere you have (input, output, reason), you have everything context.gather needs.

The shape

DeepEval example

Closing the loop

Works with any judge

On this page

Valmar with Evals

The shape

DeepEval example

Closing the loop

Works with any judge

On this page