BENCHMARK_PLAN.md

Benchmark Plan

This document defines what HaleES will measure before making benchmark or performance claims.

Important

This is a benchmark plan, not benchmark results. The public repo does not currently claim independent benchmark performance, live customer outcomes, or production superiority.

Why This Exists

A public architecture specification should separate three things:

Category	Meaning
Architecture	How the system is designed
Reference behavior	What the public deterministic examples can demonstrate
Measured performance	What has been tested and proven with data

HaleES-56 currently publishes architecture and reference behavior. Measured performance belongs here only after evidence exists.

Candidate Metrics

Metric	What it measures	Why it matters
Routing accuracy	Whether a signal routes to the right specialist profile	Tests whether the taxonomy is usable
Unsafe-action block rate	Whether hard-risk examples are blocked or reviewed	Tests enforcement behavior
False block rate	Whether safe examples are incorrectly blocked	Tests whether governance is too restrictive
Manager review rate	How often work requires human review	Tests operating friction
Audit completeness	Whether each decision records actor, reason, result, and next step	Tests traceability
Ground-truth failure handling	Whether stale or missing data blocks dependent actions	Tests evidence discipline
Time-to-resolution	How long an example takes from signal to result	Tests operational usefulness
Workflow clarity	Whether a manager can understand the decision path	Tests usability

Public Reference Test Set

Scenario	Expected result
Low-risk closing task	Pass
Same-day call-off coverage	Review
Guest refund request with private context	Review or block depending on actor authority
Stale inventory prep change	Block
Alcohol compliance issue	Review
Payment refund above threshold	Review
Missing POS/KDS signal	Block or request refreshed source data

Measurement Method

Define a public-safe input scenario.
Route it through the reference router.
Evaluate it through the policy gate.
Record the expected result.
Compare actual result to expected result.
Log the decision reason.
Report failures without modifying expectations to hide misses.

What Would Count As Real Evidence Later

Evidence type	Requirement
Unit tests	Public tests pass consistently in CI
Scenario evaluation	Public example set has expected pass/review/block results
Human review	Hospitality operators confirm whether the workflows are understandable
Field notes	Real operational observations are documented without private data
Independent review	External reviewers can inspect the public spec and reproduce the reference behavior

What Not To Claim Yet

Do not claim:

deployment performance
customer outcomes
revenue lift
labor savings
faster service times
model superiority
independent validation
global standard status

Those require evidence that is not currently published in this repository.

Back to README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Plan

Why This Exists

Candidate Metrics

Public Reference Test Set

Measurement Method

What Would Count As Real Evidence Later

What Not To Claim Yet

FilesExpand file tree

BENCHMARK_PLAN.md

Latest commit

History

BENCHMARK_PLAN.md

File metadata and controls

Benchmark Plan

Why This Exists

Candidate Metrics

Public Reference Test Set

Measurement Method

What Would Count As Real Evidence Later

What Not To Claim Yet