--- title: EVOLV Architecture Review created: 2026-03-01 updated: 2026-04-07 status: evolving tags: [architecture, stack, review] --- # EVOLV Architecture Review ## Purpose This document captures: - the architecture implemented in this repository today - the broader edge/site/central architecture shown in the drawings under `temp/` - the key strengths and weaknesses of that direction - the currently preferred target stack based on owner decisions from this review It is the local staging document for a later wiki update. ## Evidence Used Implemented stack evidence: - `docker-compose.yml` - `docker/settings.js` - `docker/grafana/provisioning/datasources/influxdb.yaml` - `package.json` - `nodes/*` Target-state evidence: - `temp/fullStack.pdf` - `temp/edge.pdf` - `temp/CoreSync.drawio.pdf` - `temp/cloud.yml` Owner decisions from this review: - local InfluxDB is required for operational resilience - central acts as the advisory/intelligence and API-entry layer, not as a direct field caller - intended configuration authority is the database-backed `tagcodering` model - architecture wiki pages should be visual, not text-only ## 1. What Exists Today ### 1.1 Product/runtime layer The codebase is currently a modular Node-RED package for wastewater/process automation: - EVOLV ships custom Node-RED nodes for plant assets and process logic - nodes emit both process/control messages and telemetry-oriented outputs - shared helper logic lives in `nodes/generalFunctions/` - Grafana-facing integration exists through `dashboardAPI` and Influx-oriented outputs ### 1.2 Implemented development stack The concrete development stack in this repository is: - Node-RED - InfluxDB 2.x - Grafana That gives a clear local flow: 1. EVOLV logic runs in Node-RED. 2. Telemetry is emitted in a time-series-oriented shape. 3. InfluxDB stores the telemetry. 4. Grafana renders operational dashboards. ### 1.3 Existing runtime pattern in the nodes A recurring EVOLV pattern is: - output 0: process/control message - output 1: Influx/telemetry message - output 2: registration/control plumbing where relevant So even in its current implemented form, EVOLV is not only a Node-RED project. It is already a control-plus-observability platform, with Node-RED as orchestration/runtime and InfluxDB/Grafana as telemetry and visualization services. ## 2. What The Drawings Describe Across `temp/fullStack.pdf` and `temp/CoreSync.drawio.pdf`, the intended platform is broader and layered. ### 2.1 Edge / OT layer The drawings consistently place these capabilities at the edge: - PLC / OPC UA connectivity - Node-RED container as protocol translator and logic runtime - local broker in some variants - local InfluxDB / Prometheus style storage in some variants - local Grafana/SCADA in some variants This is the plant-side operational layer. ### 2.2 Site / local server layer The CoreSync drawings also show a site aggregation layer: - RWZI-local server - Node-RED / CoreSync services - site-local broker - site-local database - upward API-based synchronization This layer decouples field assets from central services and absorbs plant-specific complexity. ### 2.3 Central / cloud layer The broader stack drawings and `temp/cloud.yml` show a central platform layer with: - Gitea - Jenkins - reverse proxy / ingress - Grafana - InfluxDB - Node-RED - RabbitMQ / messaging - VPN / tunnel concepts - Keycloak in the drawing - Portainer in the drawing This is a platform-services layer, not just an application runtime. ## 3. Architecture Decisions From This Review These decisions now shape the preferred EVOLV target architecture. ### 3.1 Local telemetry is mandatory for resilience Local InfluxDB is not optional. It is required so that: - operations continue when central SCADA or central services are down - local dashboards and advanced digital-twin workflows can still consume recent and relevant process history - local edge/site layers can make smarter decisions without depending on round-trips to central ### 3.2 Multi-level InfluxDB is part of the architecture InfluxDB should exist on multiple levels where it adds operational value: - edge/local for resilience and near-real-time replay - site for plant-level history, diagnostics, and resilience - central for fleet-wide analytics, benchmarking, and advisory intelligence This is not just copy-paste storage at each level. The design intent is event-driven and selective. ### 3.3 Storage should be smart, not only deadband-driven The target is not simple "store every point" or only a fixed deadband rule such as 1%. The desired storage approach is: - observe signal slope and change behavior - preserve points where state is changing materially - store fewer points where the signal can be reconstructed downstream with sufficient fidelity - carry enough metadata or conventions so reconstruction quality is auditable This implies EVOLV should evolve toward smart storage and signal-aware retention rather than naive event dumping. ### 3.4 Central is the intelligence and API-entry layer Central may advise and coordinate edge/site layers, but external API requests should not hit field-edge systems directly. The intended pattern is: - external and enterprise integrations terminate centrally - central evaluates, aggregates, authorizes, and advises - site/edge layers receive mediated requests, policies, or setpoints - field-edge remains protected behind an intermediate layer This aligns with the stated security direction. ### 3.5 Configuration source of truth should be database-backed The intended configuration authority is the database-backed `tagcodering` model, which already exists but is not yet complete enough to serve as the fully realized source of truth. That means the architecture should assume: - asset and machine metadata belong in `tagcodering` - Node-RED flows should consume configuration rather than silently becoming the only configuration store - more work is still needed before this behaves as the intended central configuration backbone ## 4. Visual Model ### 4.1 Platform topology ```mermaid flowchart LR subgraph OT["OT / Field"] PLC["PLC / IO"] DEV["Sensors / Machines"] end subgraph EDGE["Edge Layer"] ENR["Edge Node-RED"] EDB["Local InfluxDB"] EUI["Local Grafana / Local Monitoring"] EBR["Optional Local Broker"] end subgraph SITE["Site Layer"] SNR["Site Node-RED / CoreSync"] SDB["Site InfluxDB"] SUI["Site Grafana / SCADA Support"] SBR["Site Broker"] end subgraph CENTRAL["Central Layer"] API["API / Integration Gateway"] INTEL["Overview Intelligence / Advisory Logic"] CDB["Central InfluxDB"] CGR["Central Grafana"] CFG["Tagcodering Config Model"] GIT["Gitea"] CI["CI/CD"] IAM["IAM / Keycloak"] end DEV --> PLC PLC --> ENR ENR --> EDB ENR --> EUI ENR --> EBR ENR <--> SNR EDB <--> SDB SNR --> SDB SNR --> SUI SNR --> SBR SNR <--> API API --> INTEL API <--> CFG SDB <--> CDB INTEL --> SNR CGR --> CDB CI --> GIT IAM --> API IAM --> CGR ``` ### 4.2 Command and access boundary ```mermaid flowchart TD EXT["External APIs / Enterprise Requests"] --> API["Central API Gateway"] API --> AUTH["AuthN/AuthZ / Policy Checks"] AUTH --> INTEL["Central Advisory / Decision Support"] INTEL --> SITE["Site Integration Layer"] SITE --> EDGE["Edge Runtime"] EDGE --> PLC["PLC / Field Assets"] EXT -. no direct access .-> EDGE EXT -. no direct access .-> PLC ``` ### 4.3 Smart telemetry flow ```mermaid flowchart LR RAW["Raw Signal"] --> EDGELOGIC["Edge Signal Evaluation"] EDGELOGIC --> KEEP["Keep Critical Change Points"] EDGELOGIC --> SKIP["Skip Reconstructable Flat Points"] EDGELOGIC --> LOCAL["Local InfluxDB"] LOCAL --> SITE["Site InfluxDB"] SITE --> CENTRAL["Central InfluxDB"] KEEP --> LOCAL SKIP -. reconstruction assumptions / metadata .-> SITE CENTRAL --> DASH["Fleet Dashboards / Analytics"] ``` ## 5. Upsides Of This Direction ### 5.1 Strong separation between control and observability Node-RED for runtime/orchestration and InfluxDB/Grafana for telemetry is still the right structural split: - control stays close to the process - telemetry storage/querying stays in time-series-native tooling - dashboards do not need to overload Node-RED itself ### 5.2 Edge-first matches operational reality For wastewater/process systems, edge-first remains correct: - lower latency - better degraded-mode behavior - less dependence on WAN or central platform uptime - clearer OT trust boundary ### 5.3 Site mediation improves safety and security Using central as the enterprise/API entry point and site as the mediator improves posture: - field systems are less exposed - policy decisions can be centralized - external integrations do not probe the edge directly - site can continue operating even when upstream is degraded ### 5.4 Multi-level storage enables better analytics Multiple Influx layers can support: - local resilience - site diagnostics - fleet benchmarking - smarter retention and reconstruction strategies That is substantially more capable than a single central historian model. ### 5.5 `tagcodering` is the right long-term direction A database-backed configuration authority is stronger than embedding configuration only in flows because it supports: - machine metadata management - controlled rollout of configuration changes - clearer versioning and provenance - future API-driven configuration services ## 6. Downsides And Risks ### 6.1 Smart storage raises algorithmic and governance complexity Signal-aware storage and reconstruction is promising, but it creates architectural obligations: - reconstruction rules must be explicit - acceptable reconstruction error must be defined per signal type - operators must know whether they see raw or reconstructed history - compliance-relevant data may need stricter retention than operational convenience data Without those rules, smart storage can become opaque and hard to trust. ### 6.2 Multi-level databases can create ownership confusion If edge, site, and central all store telemetry, you must define: - which layer is authoritative for which time horizon - when backfill is allowed - when data is summarized vs copied - how duplicates or gaps are detected Otherwise operations will argue over which trend is "the real one." ### 6.3 Central intelligence must remain advisory-first Central guidance can become valuable, but direct closed-loop dependency on central would be risky. The architecture should therefore preserve: - local control authority at edge/site - bounded and explicit central advice - safe behavior if central recommendations stop arriving ### 6.4 `tagcodering` is not yet complete enough to lean on blindly It is the right target, but its current partial state means there is still architecture debt: - incomplete config workflows - likely mismatch between desired and implemented schema behavior - temporary duplication between flows, node config, and database-held metadata This should be treated as a core platform workstream, not a side issue. ### 6.5 Broker responsibilities are still not crisp enough The materials still reference MQTT/AMQP/RabbitMQ/brokers without one stable responsibility split. That needs to be resolved before large-scale deployment. Questions still open: - command bus or event bus? - site-only or cross-site? - telemetry transport or only synchronization/eventing? - durability expectations and replay behavior? ## 7. Security And Regulatory Positioning ### 7.1 Purdue-style layering is a good fit EVOLV's preferred structure aligns well with a Purdue-style OT/IT layering approach: - PLCs and field assets stay at the operational edge - edge runtimes stay close to the process - site systems mediate between OT and broader enterprise concerns - central services host APIs, identity, analytics, and engineering workflows That is important because it supports segmented trust boundaries instead of direct enterprise-to-field reach-through. ### 7.2 NIS2 alignment Directive (EU) 2022/2555 (NIS2) requires cybersecurity risk-management measures, incident handling, and stronger governance for covered entities. This architecture supports that by: - limiting direct exposure of field systems - separating operational layers - enabling central policy and oversight - preserving local operation during upstream failure ### 7.3 CER alignment Directive (EU) 2022/2557 (Critical Entities Resilience Directive) focuses on resilience of essential services. The edge-plus-site approach supports that direction because: - local/site layers can continue during central disruption - essential service continuity does not depend on one central runtime - degraded-mode behavior can be explicitly designed per layer ### 7.4 Cyber Resilience Act alignment Regulation (EU) 2024/2847 (Cyber Resilience Act) creates cybersecurity requirements for products with digital elements. For EVOLV, that means the platform should keep strengthening: - secure configuration handling - vulnerability and update management - release traceability - lifecycle ownership of components and dependencies ### 7.5 GDPR alignment where personal data is present Regulation (EU) 2016/679 (GDPR) applies whenever EVOLV processes personal data. The architecture helps by: - centralizing ingress - reducing unnecessary propagation of data to field layers - making access, retention, and audit boundaries easier to define ### 7.6 What can and cannot be claimed The defensible claim is that EVOLV can be deployed in a way that supports compliance with strict European cybersecurity and resilience expectations. The non-defensible claim is that EVOLV is automatically compliant purely because of the architecture diagram. Actual compliance still depends on implementation and operations, including: - access control - patch and vulnerability management - incident response - logging and audit evidence - retention policy - data classification ## 8. Recommended Ideal Stack The ideal EVOLV stack should be layered around operational boundaries, not around tools. ### 7.1 Layer A: Edge execution Purpose: - connect to PLCs and field assets - execute time-sensitive local logic - preserve operation during WAN/central loss - provide local telemetry access for resilience and digital-twin use cases Recommended components: - Node-RED runtime for EVOLV edge flows - OPC UA and protocol adapters - local InfluxDB - optional local Grafana for local engineering/monitoring - optional local broker only when multiple participants need decoupling Principle: - edge remains safe and useful when disconnected ### 7.2 Layer B: Site integration Purpose: - aggregate multiple edge systems at plant/site level - host plant-local dashboards and diagnostics - mediate between raw OT detail and central standardization - serve as the protected step between field systems and central requests Recommended components: - site Node-RED / CoreSync services - site InfluxDB - site Grafana / SCADA-supporting dashboards - site broker where asynchronous eventing is justified Principle: - site absorbs plant complexity and protects field assets ### 7.3 Layer C: Central platform Purpose: - fleet-wide analytics - shared dashboards - engineering lifecycle - enterprise/API entry point - overview intelligence and advisory logic Recommended components: - Gitea - CI/CD - central InfluxDB - central Grafana - API/integration gateway - IAM - VPN/private connectivity - `tagcodering`-backed configuration services Principle: - central coordinates, advises, and governs; it is not the direct field caller ### 7.4 Cross-cutting platform services These should be explicit architecture elements: - secrets management - certificate management - backup/restore - audit logging - monitoring/alerting of the platform itself - versioned configuration and schema management - rollout/rollback strategy ## 9. Recommended Opinionated Choices ### 8.1 Keep Node-RED as the orchestration layer, not the whole platform Node-RED should own: - process orchestration - protocol mediation - edge/site logic - KPI production It should not become the sole owner of: - identity - long-term configuration authority - secret management - compliance/audit authority ### 8.2 Use InfluxDB by function and horizon Recommended split: - edge: resilience, local replay, digital-twin input - site: plant diagnostics and local continuity - central: fleet analytics, advisory intelligence, benchmarking, and long-term cross-site views ### 8.3 Prefer smart telemetry retention over naive point dumping Recommended rule: - keep information-rich points - reduce information-poor flat spans - document reconstruction assumptions - define signal-class-specific fidelity expectations This needs design discipline, but it is a real differentiator if executed well. ### 8.4 Put enterprise/API ingress at central, not at edge This should become a hard architectural rule: - external requests land centrally - central authenticates and authorizes - central or site mediates downward - edge never becomes the exposed public integration surface ### 8.5 Make `tagcodering` the target configuration backbone The architecture should be designed so that `tagcodering` can mature into: - machine and asset registry - configuration source of truth - site/central configuration exchange point - API-served configuration source for runtime layers ## 10. Suggested Phasing ### Phase 1: Stabilize contracts - define topic and payload contracts - define telemetry classes and reconstruction policy - define asset, machine, and site identity model - define `tagcodering` scope and schema ownership ### Phase 2: Harden local/site resilience - formalize edge and site runtime patterns - define local telemetry retention and replay behavior - define central-loss behavior - define dashboard behavior during isolation ### Phase 3: Harden central platform - IAM - API gateway - central observability - CI/CD - backup and disaster recovery - config services over `tagcodering` ### Phase 4: Introduce selective synchronization and intelligence - event-driven telemetry propagation rules - smart-storage promotion/backfill policies - advisory services from central - auditability of downward recommendations and configuration changes ## 11. Immediate Open Questions Before Wiki Finalization 1. Which signals are allowed to use reconstruction-aware smart storage, and which must remain raw or near-raw for audit/compliance reasons? 2. How should `tagcodering` be exposed to runtime layers: direct database access, a dedicated API, or both? 3. What exact responsibility split should EVOLV use between API synchronization and broker-based eventing? ## 12. Recommended Wiki Structure The wiki should not be one long page. It should be split into: 1. platform overview with the main topology diagram 2. edge-site-central runtime model 3. telemetry and smart storage model 4. security and access-boundary model 5. configuration architecture centered on `tagcodering` ## 13. Next Step Use this document as the architecture baseline. The companion markdown page in `architecture/` can then be shaped into a wiki-ready visual overview page with Mermaid diagrams and shorter human-readable sections.