Files
EVOLV/architecture/stack-architecture-review.md
2026-03-23 11:23:24 +01:00

16 KiB

EVOLV Architecture Review

Purpose

This document captures:

  • the architecture implemented in this repository today
  • the broader edge/site/central architecture shown in the drawings under temp/
  • the key strengths and weaknesses of that direction
  • the currently preferred target stack based on owner decisions from this review

It is the local staging document for a later wiki update.

Evidence Used

Implemented stack evidence:

  • docker-compose.yml
  • docker/settings.js
  • docker/grafana/provisioning/datasources/influxdb.yaml
  • package.json
  • nodes/*

Target-state evidence:

  • temp/fullStack.pdf
  • temp/edge.pdf
  • temp/CoreSync.drawio.pdf
  • temp/cloud.yml

Owner decisions from this review:

  • local InfluxDB is required for operational resilience
  • central acts as the advisory/intelligence and API-entry layer, not as a direct field caller
  • intended configuration authority is the database-backed tagcodering model
  • architecture wiki pages should be visual, not text-only

1. What Exists Today

1.1 Product/runtime layer

The codebase is currently a modular Node-RED package for wastewater/process automation:

  • EVOLV ships custom Node-RED nodes for plant assets and process logic
  • nodes emit both process/control messages and telemetry-oriented outputs
  • shared helper logic lives in nodes/generalFunctions/
  • Grafana-facing integration exists through dashboardAPI and Influx-oriented outputs

1.2 Implemented development stack

The concrete development stack in this repository is:

  • Node-RED
  • InfluxDB 2.x
  • Grafana

That gives a clear local flow:

  1. EVOLV logic runs in Node-RED.
  2. Telemetry is emitted in a time-series-oriented shape.
  3. InfluxDB stores the telemetry.
  4. Grafana renders operational dashboards.

1.3 Existing runtime pattern in the nodes

A recurring EVOLV pattern is:

  • output 0: process/control message
  • output 1: Influx/telemetry message
  • output 2: registration/control plumbing where relevant

So even in its current implemented form, EVOLV is not only a Node-RED project. It is already a control-plus-observability platform, with Node-RED as orchestration/runtime and InfluxDB/Grafana as telemetry and visualization services.

2. What The Drawings Describe

Across temp/fullStack.pdf and temp/CoreSync.drawio.pdf, the intended platform is broader and layered.

2.1 Edge / OT layer

The drawings consistently place these capabilities at the edge:

  • PLC / OPC UA connectivity
  • Node-RED container as protocol translator and logic runtime
  • local broker in some variants
  • local InfluxDB / Prometheus style storage in some variants
  • local Grafana/SCADA in some variants

This is the plant-side operational layer.

2.2 Site / local server layer

The CoreSync drawings also show a site aggregation layer:

  • RWZI-local server
  • Node-RED / CoreSync services
  • site-local broker
  • site-local database
  • upward API-based synchronization

This layer decouples field assets from central services and absorbs plant-specific complexity.

2.3 Central / cloud layer

The broader stack drawings and temp/cloud.yml show a central platform layer with:

  • Gitea
  • Jenkins
  • reverse proxy / ingress
  • Grafana
  • InfluxDB
  • Node-RED
  • RabbitMQ / messaging
  • VPN / tunnel concepts
  • Keycloak in the drawing
  • Portainer in the drawing

This is a platform-services layer, not just an application runtime.

3. Architecture Decisions From This Review

These decisions now shape the preferred EVOLV target architecture.

3.1 Local telemetry is mandatory for resilience

Local InfluxDB is not optional. It is required so that:

  • operations continue when central SCADA or central services are down
  • local dashboards and advanced digital-twin workflows can still consume recent and relevant process history
  • local edge/site layers can make smarter decisions without depending on round-trips to central

3.2 Multi-level InfluxDB is part of the architecture

InfluxDB should exist on multiple levels where it adds operational value:

  • edge/local for resilience and near-real-time replay
  • site for plant-level history, diagnostics, and resilience
  • central for fleet-wide analytics, benchmarking, and advisory intelligence

This is not just copy-paste storage at each level. The design intent is event-driven and selective.

3.3 Storage should be smart, not only deadband-driven

The target is not simple "store every point" or only a fixed deadband rule such as 1%.

The desired storage approach is:

  • observe signal slope and change behavior
  • preserve points where state is changing materially
  • store fewer points where the signal can be reconstructed downstream with sufficient fidelity
  • carry enough metadata or conventions so reconstruction quality is auditable

This implies EVOLV should evolve toward smart storage and signal-aware retention rather than naive event dumping.

3.4 Central is the intelligence and API-entry layer

Central may advise and coordinate edge/site layers, but external API requests should not hit field-edge systems directly.

The intended pattern is:

  • external and enterprise integrations terminate centrally
  • central evaluates, aggregates, authorizes, and advises
  • site/edge layers receive mediated requests, policies, or setpoints
  • field-edge remains protected behind an intermediate layer

This aligns with the stated security direction.

3.5 Configuration source of truth should be database-backed

The intended configuration authority is the database-backed tagcodering model, which already exists but is not yet complete enough to serve as the fully realized source of truth.

That means the architecture should assume:

  • asset and machine metadata belong in tagcodering
  • Node-RED flows should consume configuration rather than silently becoming the only configuration store
  • more work is still needed before this behaves as the intended central configuration backbone

4. Visual Model

4.1 Platform topology

flowchart LR
    subgraph OT["OT / Field"]
        PLC["PLC / IO"]
        DEV["Sensors / Machines"]
    end

    subgraph EDGE["Edge Layer"]
        ENR["Edge Node-RED"]
        EDB["Local InfluxDB"]
        EUI["Local Grafana / Local Monitoring"]
        EBR["Optional Local Broker"]
    end

    subgraph SITE["Site Layer"]
        SNR["Site Node-RED / CoreSync"]
        SDB["Site InfluxDB"]
        SUI["Site Grafana / SCADA Support"]
        SBR["Site Broker"]
    end

    subgraph CENTRAL["Central Layer"]
        API["API / Integration Gateway"]
        INTEL["Overview Intelligence / Advisory Logic"]
        CDB["Central InfluxDB"]
        CGR["Central Grafana"]
        CFG["Tagcodering Config Model"]
        GIT["Gitea"]
        CI["CI/CD"]
        IAM["IAM / Keycloak"]
    end

    DEV --> PLC
    PLC --> ENR
    ENR --> EDB
    ENR --> EUI
    ENR --> EBR
    ENR <--> SNR
    EDB <--> SDB
    SNR --> SDB
    SNR --> SUI
    SNR --> SBR
    SNR <--> API
    API --> INTEL
    API <--> CFG
    SDB <--> CDB
    INTEL --> SNR
    CGR --> CDB
    CI --> GIT
    IAM --> API
    IAM --> CGR

4.2 Command and access boundary

flowchart TD
    EXT["External APIs / Enterprise Requests"] --> API["Central API Gateway"]
    API --> AUTH["AuthN/AuthZ / Policy Checks"]
    AUTH --> INTEL["Central Advisory / Decision Support"]
    INTEL --> SITE["Site Integration Layer"]
    SITE --> EDGE["Edge Runtime"]
    EDGE --> PLC["PLC / Field Assets"]

    EXT -. no direct access .-> EDGE
    EXT -. no direct access .-> PLC

4.3 Smart telemetry flow

flowchart LR
    RAW["Raw Signal"] --> EDGELOGIC["Edge Signal Evaluation"]
    EDGELOGIC --> KEEP["Keep Critical Change Points"]
    EDGELOGIC --> SKIP["Skip Reconstructable Flat Points"]
    EDGELOGIC --> LOCAL["Local InfluxDB"]
    LOCAL --> SITE["Site InfluxDB"]
    SITE --> CENTRAL["Central InfluxDB"]
    KEEP --> LOCAL
    SKIP -. reconstruction assumptions / metadata .-> SITE
    CENTRAL --> DASH["Fleet Dashboards / Analytics"]

5. Upsides Of This Direction

5.1 Strong separation between control and observability

Node-RED for runtime/orchestration and InfluxDB/Grafana for telemetry is still the right structural split:

  • control stays close to the process
  • telemetry storage/querying stays in time-series-native tooling
  • dashboards do not need to overload Node-RED itself

5.2 Edge-first matches operational reality

For wastewater/process systems, edge-first remains correct:

  • lower latency
  • better degraded-mode behavior
  • less dependence on WAN or central platform uptime
  • clearer OT trust boundary

5.3 Site mediation improves safety and security

Using central as the enterprise/API entry point and site as the mediator improves posture:

  • field systems are less exposed
  • policy decisions can be centralized
  • external integrations do not probe the edge directly
  • site can continue operating even when upstream is degraded

5.4 Multi-level storage enables better analytics

Multiple Influx layers can support:

  • local resilience
  • site diagnostics
  • fleet benchmarking
  • smarter retention and reconstruction strategies

That is substantially more capable than a single central historian model.

5.5 tagcodering is the right long-term direction

A database-backed configuration authority is stronger than embedding configuration only in flows because it supports:

  • machine metadata management
  • controlled rollout of configuration changes
  • clearer versioning and provenance
  • future API-driven configuration services

6. Downsides And Risks

6.1 Smart storage raises algorithmic and governance complexity

Signal-aware storage and reconstruction is promising, but it creates architectural obligations:

  • reconstruction rules must be explicit
  • acceptable reconstruction error must be defined per signal type
  • operators must know whether they see raw or reconstructed history
  • compliance-relevant data may need stricter retention than operational convenience data

Without those rules, smart storage can become opaque and hard to trust.

6.2 Multi-level databases can create ownership confusion

If edge, site, and central all store telemetry, you must define:

  • which layer is authoritative for which time horizon
  • when backfill is allowed
  • when data is summarized vs copied
  • how duplicates or gaps are detected

Otherwise operations will argue over which trend is "the real one."

6.3 Central intelligence must remain advisory-first

Central guidance can become valuable, but direct closed-loop dependency on central would be risky.

The architecture should therefore preserve:

  • local control authority at edge/site
  • bounded and explicit central advice
  • safe behavior if central recommendations stop arriving

6.4 tagcodering is not yet complete enough to lean on blindly

It is the right target, but its current partial state means there is still architecture debt:

  • incomplete config workflows
  • likely mismatch between desired and implemented schema behavior
  • temporary duplication between flows, node config, and database-held metadata

This should be treated as a core platform workstream, not a side issue.

6.5 Broker responsibilities are still not crisp enough

The materials still reference MQTT/AMQP/RabbitMQ/brokers without one stable responsibility split. That needs to be resolved before large-scale deployment.

Questions still open:

  • command bus or event bus?
  • site-only or cross-site?
  • telemetry transport or only synchronization/eventing?
  • durability expectations and replay behavior?

The ideal EVOLV stack should be layered around operational boundaries, not around tools.

7.1 Layer A: Edge execution

Purpose:

  • connect to PLCs and field assets
  • execute time-sensitive local logic
  • preserve operation during WAN/central loss
  • provide local telemetry access for resilience and digital-twin use cases

Recommended components:

  • Node-RED runtime for EVOLV edge flows
  • OPC UA and protocol adapters
  • local InfluxDB
  • optional local Grafana for local engineering/monitoring
  • optional local broker only when multiple participants need decoupling

Principle:

  • edge remains safe and useful when disconnected

7.2 Layer B: Site integration

Purpose:

  • aggregate multiple edge systems at plant/site level
  • host plant-local dashboards and diagnostics
  • mediate between raw OT detail and central standardization
  • serve as the protected step between field systems and central requests

Recommended components:

  • site Node-RED / CoreSync services
  • site InfluxDB
  • site Grafana / SCADA-supporting dashboards
  • site broker where asynchronous eventing is justified

Principle:

  • site absorbs plant complexity and protects field assets

7.3 Layer C: Central platform

Purpose:

  • fleet-wide analytics
  • shared dashboards
  • engineering lifecycle
  • enterprise/API entry point
  • overview intelligence and advisory logic

Recommended components:

  • Gitea
  • CI/CD
  • central InfluxDB
  • central Grafana
  • API/integration gateway
  • IAM
  • VPN/private connectivity
  • tagcodering-backed configuration services

Principle:

  • central coordinates, advises, and governs; it is not the direct field caller

7.4 Cross-cutting platform services

These should be explicit architecture elements:

  • secrets management
  • certificate management
  • backup/restore
  • audit logging
  • monitoring/alerting of the platform itself
  • versioned configuration and schema management
  • rollout/rollback strategy

8.1 Keep Node-RED as the orchestration layer, not the whole platform

Node-RED should own:

  • process orchestration
  • protocol mediation
  • edge/site logic
  • KPI production

It should not become the sole owner of:

  • identity
  • long-term configuration authority
  • secret management
  • compliance/audit authority

8.2 Use InfluxDB by function and horizon

Recommended split:

  • edge: resilience, local replay, digital-twin input
  • site: plant diagnostics and local continuity
  • central: fleet analytics, advisory intelligence, benchmarking, and long-term cross-site views

8.3 Prefer smart telemetry retention over naive point dumping

Recommended rule:

  • keep information-rich points
  • reduce information-poor flat spans
  • document reconstruction assumptions
  • define signal-class-specific fidelity expectations

This needs design discipline, but it is a real differentiator if executed well.

8.4 Put enterprise/API ingress at central, not at edge

This should become a hard architectural rule:

  • external requests land centrally
  • central authenticates and authorizes
  • central or site mediates downward
  • edge never becomes the exposed public integration surface

8.5 Make tagcodering the target configuration backbone

The architecture should be designed so that tagcodering can mature into:

  • machine and asset registry
  • configuration source of truth
  • site/central configuration exchange point
  • API-served configuration source for runtime layers

9. Suggested Phasing

Phase 1: Stabilize contracts

  • define topic and payload contracts
  • define telemetry classes and reconstruction policy
  • define asset, machine, and site identity model
  • define tagcodering scope and schema ownership

Phase 2: Harden local/site resilience

  • formalize edge and site runtime patterns
  • define local telemetry retention and replay behavior
  • define central-loss behavior
  • define dashboard behavior during isolation

Phase 3: Harden central platform

  • IAM
  • API gateway
  • central observability
  • CI/CD
  • backup and disaster recovery
  • config services over tagcodering

Phase 4: Introduce selective synchronization and intelligence

  • event-driven telemetry propagation rules
  • smart-storage promotion/backfill policies
  • advisory services from central
  • auditability of downward recommendations and configuration changes

10. Immediate Open Questions Before Wiki Finalization

  1. Which signals are allowed to use reconstruction-aware smart storage, and which must remain raw or near-raw for audit/compliance reasons?
  2. How should tagcodering be exposed to runtime layers: direct database access, a dedicated API, or both?
  3. What exact responsibility split should EVOLV use between API synchronization and broker-based eventing?

The wiki should not be one long page. It should be split into:

  1. platform overview with the main topology diagram
  2. edge-site-central runtime model
  3. telemetry and smart storage model
  4. security and access-boundary model
  5. configuration architecture centered on tagcodering

12. Next Step

Use this document as the architecture baseline. The companion markdown page in architecture/ can then be shaped into a wiki-ready visual overview page with Mermaid diagrams and shorter human-readable sections.