Context Bundle Eval
Purpose
Context Bundle Eval provides continuous quality checks for /v1/context/bundle.
It combines:
- question-set execution
- rule-based scoring
- run-to-run diff
- optional LLM judge
Structure
eval/questions.yaml(20 sample questions)scripts/eval/run_bundle_eval.tsscripts/eval/score_bundle.tsscripts/eval/diff_bundle.tsscripts/eval/render_diff_html.tsscripts/eval/token_count.tsscripts/eval/helpers.ts
Run outputs:
eval/runs/<timestamp>/bundle.jsonleval/runs/<timestamp>/scores.jsoneval/runs/<timestamp>/report.mdeval/runs/<timestamp>/diff.md(when diffing)eval/runs/<timestamp>/diff.html(when diffing)
Run Eval
pnpm eval:bundle
Common options:
pnpm eval:bundle -- --base-url http://localhost:8080
pnpm eval:bundle -- --limit 10
pnpm eval:bundle -- --debug true
pnpm eval:bundle -- --mask true
pnpm eval:bundle -- --out-dir eval/runs/manual-01
Notes:
--debug truestores debug bundle per case in JSONL entry.--mask truemasks sensitive token/key fields before persisting outputs.
Rule-Based Scoring
score_bundle.ts evaluates each case using expected rules:
must_include_typesmust_not_include_typesshould_include_keywordsmust_include_fields- token budget penalty when over budget
Outputs:
scores.json(case scores, totals, reasons)report.md(summary + fail top cases)
Diff Two Runs
pnpm eval:diff -- --a eval/runs/<runA> --b eval/runs/<runB>
Compared dimensions:
- global rule selected IDs
snapshot.top_decisions(id:title)snapshot.active_worktitles- retrieval IDs + score breakdown
- token usage breakdown
Outputs in run B directory by default:
diff.jsondiff.mddiff.html
Color conventions in HTML:
- added: green
- removed: red
- changed: yellow
Optional LLM Judge
LLM judge is optional and disabled by default.
Enable:
EVAL_JUDGE_PROVIDER=openai \
EVAL_JUDGE_API_KEY=*** \
pnpm eval:bundle -- --judge true
Supported providers:
openaiclaudegemini
Judge returns:
- score (1..5)
- reasons (up to 3 bullets)
- suggestions (up to 3 bullets)
When env vars are missing, judge is skipped and scoring stays rule-based only.
Report Helpers
Show latest report:
pnpm eval:report
Security Notes
- Do not print or store API keys in eval outputs.
workspace_key/project_keyare allowed in reports.- Use environment variables for auth:
MEMORY_CORE_API_KEY
CI Guidance (Optional)
Recommended CI pattern:
- Run
pnpm eval:bundleagainst staging memory-core. - Store
report.md+scores.jsonas artifacts. - Compare with previous baseline using
pnpm eval:diff.
PR Comment Integration
Workflow:
.github/workflows/eval-comment.yml
Behavior on pull_request (opened, synchronize, reopened):
- Runs eval on PR HEAD (
eval/runs/pr-head) - Tries optional base eval on
origin/<base_ref>(eval/runs/pr-base) - Generates diff (
diff.md,diff.html) when base run exists - Posts/updates a sticky PR comment (header-based update)
Sticky marker in comment body:
<!-- CLAUSTRUM_EVAL_COMMENT -->
Comment includes:
- total score / failing case count
- top failures (up to 5)
- budget overrun cases
- MCP tool schema snapshot guard status
- diff summary
- link to workflow artifacts
Schema guard details:
- Runs
apps/mcp-adapter/src/tools-schema.snapshot.test.ts - Writes
schema-snapshot.jsonandschema-snapshot.logtoeval/runs/pr-head/ - Fails the workflow at the end if snapshot drift is detected (after comment + artifacts are published)
Troubleshooting
-
scores.json not found- verify
pnpm eval:bundlecompleted - verify output folder exists under
eval/runs/<id>
- verify
-
Diff not generated
- base eval may fail (network/time/resource limits)
- check
eval/runs/pr-base/bundle.jsonlexistence
-
All HTTP checks fail
- confirm
memory-corehealth endpoint is reachable - confirm
MEMORY_CORE_API_KEYis set for eval runner
- confirm
-
Judge is skipped unexpectedly
- confirm
--judge true - confirm both
EVAL_JUDGE_PROVIDERandEVAL_JUDGE_API_KEY
- confirm