Log pumpingStation architectural decisions; bump submodule pointers

Four decisions recorded under .agents/decisions/ per project convention (DECISION-YYYYMMDD-slug.md) to close the loop on today's pumpingStation refactor + eval + docs work: - wiki-in-code-repo — why docs+diagrams+code now live in one package - 5-threshold-naming — old/new field mapping + breaking-change rationale - mode-tier-template — Tier 1/2/3 classification for mode pages - eval-harness — why eval/ exists alongside test/ Also bumps nodes/pumpingStation to 66fd3fe (eval harness + Tier 2/3 template pages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:50:00 +02:00
parent b885f291d4
commit 79afe11da8
5 changed files with 195 additions and 1 deletions
--- a/.agents/decisions/DECISION-20260422-pumpingstation-eval-harness.md
+++ b/.agents/decisions/DECISION-20260422-pumpingstation-eval-harness.md
@@ -0,0 +1,53 @@
+# DECISION-20260422-pumpingstation-eval-harness
+
+## Context
+- Task/request: Provide a way to fluctuate inputs to the pumpingStation and observe the system's response over time, in a readable form suitable for post-hoc analysis (operator review, Grafana, or ad-hoc debugging).
+- Impacted files/contracts: `nodes/pumpingStation/eval/*`, `test/basic/*`.
+- Why a decision is required now: Unit tests (`node --test`) verify individual functions in isolation. They can't ergonomically show "what does the level look like over 20 minutes of storm surge". That's a different artefact.
+
+## Options
+1. Extend unit tests to cover scenarios
+- Benefits: Single testing surface.
+- Risks: Unit tests are assertion-heavy and slow to read; scenario output (tables, events) gets lost in TAP.
+
+2. Separate `eval/` folder with a scenario runner (selected)
+- Benefits: Scenarios read as narratives ("steady state", "storm surge", "safety dry-run"); output is human-friendly (ASCII table + events + expectation checks); JSONL per-tick log enables Grafana streaming or offline analysis.
+- Risks: Second test surface to maintain.
+
+3. Real-time Node-RED deployment + observe
+- Benefits: Closest to production.
+- Risks: Slow, requires infrastructure, irreproducible.
+
+## Decision
+- Selected option: Option 2.
+- Decision owner: User
+- Date: 2026-04-22
+- Rationale: Unit tests answer "is this function correct?"; evals answer "how does the system behave under this input profile?". Two distinct questions — two distinct tools. The split also matches the .claude/rules/testing.md 3-tier convention (basic/integration/edge) which is for asserted behaviours, not scenario replay.
+
+## Architecture
+
+```
+test/
+  basic/ integration/ edge/   — node:test + assertions
+eval/
+  run.js                      — scenario driver
+  scenarios/*.js              — each exports { name, config, setup, inputs(t,ps), expectations }
+  formatters/table.js         — ASCII summary
+  logs/*.jsonl                — one-line-per-tick output
+  README.md                   — usage + how to pipe into Grafana
+```
+
+Driver monkey-patches `Date.now()` so the volume integrator sees 1 second per tick regardless of wall-clock. Every tick records a state snapshot (level, volume, direction, netFlow, flowSource, demand, mode, safetyActive) to JSONL for streaming.
+
+## Consequences
+- Compatibility impact: None.
+- Safety/security impact: None — read-only simulation.
+- Data/operations impact: Running `node eval/run.js --all` produces artefacts that can be checked into CI for regression (e.g. "did the storm scenario's max level rise compared to last release?"). The JSONL format is friendly to InfluxDB/Grafana for interactive review.
+
+## Implementation Notes
+- Required code/doc updates: Driver + three starter scenarios (`levelbased-steady`, `levelbased-storm`, `safety-dry-run-trip`) + README in `eval/`.
+- Validation evidence required: `node eval/run.js --all` exits 0; manual inspection of JSONL confirms per-tick records make physical sense.
+
+## Rollback / Migration
+- Rollback strategy: Delete `eval/`. Unit tests continue to work.
+- Migration/deprecation plan: N/A.