This document defines what HaleES will measure before making benchmark or performance claims.
Important
This is a benchmark plan, not benchmark results. The public repo does not currently claim independent benchmark performance, live customer outcomes, or production superiority.
A public architecture specification should separate three things:
| Category | Meaning |
|---|---|
| Architecture | How the system is designed |
| Reference behavior | What the public deterministic examples can demonstrate |
| Measured performance | What has been tested and proven with data |
HaleES-56 currently publishes architecture and reference behavior. Measured performance belongs here only after evidence exists.
| Metric | What it measures | Why it matters |
|---|---|---|
| Routing accuracy | Whether a signal routes to the right specialist profile | Tests whether the taxonomy is usable |
| Unsafe-action block rate | Whether hard-risk examples are blocked or reviewed | Tests enforcement behavior |
| False block rate | Whether safe examples are incorrectly blocked | Tests whether governance is too restrictive |
| Manager review rate | How often work requires human review | Tests operating friction |
| Audit completeness | Whether each decision records actor, reason, result, and next step | Tests traceability |
| Ground-truth failure handling | Whether stale or missing data blocks dependent actions | Tests evidence discipline |
| Time-to-resolution | How long an example takes from signal to result | Tests operational usefulness |
| Workflow clarity | Whether a manager can understand the decision path | Tests usability |
| Scenario | Expected result |
|---|---|
| Low-risk closing task | Pass |
| Same-day call-off coverage | Review |
| Guest refund request with private context | Review or block depending on actor authority |
| Stale inventory prep change | Block |
| Alcohol compliance issue | Review |
| Payment refund above threshold | Review |
| Missing POS/KDS signal | Block or request refreshed source data |
- Define a public-safe input scenario.
- Route it through the reference router.
- Evaluate it through the policy gate.
- Record the expected result.
- Compare actual result to expected result.
- Log the decision reason.
- Report failures without modifying expectations to hide misses.
| Evidence type | Requirement |
|---|---|
| Unit tests | Public tests pass consistently in CI |
| Scenario evaluation | Public example set has expected pass/review/block results |
| Human review | Hospitality operators confirm whether the workflows are understandable |
| Field notes | Real operational observations are documented without private data |
| Independent review | External reviewers can inspect the public spec and reproduce the reference behavior |
Do not claim:
- deployment performance
- customer outcomes
- revenue lift
- labor savings
- faster service times
- model superiority
- independent validation
- global standard status
Those require evidence that is not currently published in this repository.