RnD/EVOLV

Fork 0

Files

znetsixe 75458713be Add architecture review and wiki draft

2026-03-23 11:23:24 +01:00

16 KiB

Raw Blame History

EVOLV Architecture Review

Purpose

This document captures:

the architecture implemented in this repository today
the broader edge/site/central architecture shown in the drawings under temp/
the key strengths and weaknesses of that direction
the currently preferred target stack based on owner decisions from this review

It is the local staging document for a later wiki update.

Evidence Used

Implemented stack evidence:

docker-compose.yml
docker/settings.js
docker/grafana/provisioning/datasources/influxdb.yaml
package.json
nodes/*

Target-state evidence:

temp/fullStack.pdf
temp/edge.pdf
temp/CoreSync.drawio.pdf
temp/cloud.yml

Owner decisions from this review:

local InfluxDB is required for operational resilience
central acts as the advisory/intelligence and API-entry layer, not as a direct field caller
intended configuration authority is the database-backed tagcodering model
architecture wiki pages should be visual, not text-only

1. What Exists Today

1.1 Product/runtime layer

The codebase is currently a modular Node-RED package for wastewater/process automation:

EVOLV ships custom Node-RED nodes for plant assets and process logic
nodes emit both process/control messages and telemetry-oriented outputs
shared helper logic lives in nodes/generalFunctions/
Grafana-facing integration exists through dashboardAPI and Influx-oriented outputs

1.2 Implemented development stack

The concrete development stack in this repository is:

Node-RED
InfluxDB 2.x
Grafana

That gives a clear local flow:

EVOLV logic runs in Node-RED.
Telemetry is emitted in a time-series-oriented shape.
InfluxDB stores the telemetry.
Grafana renders operational dashboards.

1.3 Existing runtime pattern in the nodes

A recurring EVOLV pattern is:

output 0: process/control message
output 1: Influx/telemetry message
output 2: registration/control plumbing where relevant

So even in its current implemented form, EVOLV is not only a Node-RED project. It is already a control-plus-observability platform, with Node-RED as orchestration/runtime and InfluxDB/Grafana as telemetry and visualization services.

2. What The Drawings Describe

Across temp/fullStack.pdf and temp/CoreSync.drawio.pdf, the intended platform is broader and layered.

2.1 Edge / OT layer

The drawings consistently place these capabilities at the edge:

PLC / OPC UA connectivity
Node-RED container as protocol translator and logic runtime
local broker in some variants
local InfluxDB / Prometheus style storage in some variants
local Grafana/SCADA in some variants

This is the plant-side operational layer.

2.2 Site / local server layer

The CoreSync drawings also show a site aggregation layer:

RWZI-local server
Node-RED / CoreSync services
site-local broker
site-local database
upward API-based synchronization

This layer decouples field assets from central services and absorbs plant-specific complexity.

2.3 Central / cloud layer

The broader stack drawings and temp/cloud.yml show a central platform layer with:

Gitea
Jenkins
reverse proxy / ingress
Grafana
InfluxDB
Node-RED
RabbitMQ / messaging
VPN / tunnel concepts
Keycloak in the drawing
Portainer in the drawing

This is a platform-services layer, not just an application runtime.

3. Architecture Decisions From This Review

These decisions now shape the preferred EVOLV target architecture.

3.1 Local telemetry is mandatory for resilience

Local InfluxDB is not optional. It is required so that:

operations continue when central SCADA or central services are down
local dashboards and advanced digital-twin workflows can still consume recent and relevant process history
local edge/site layers can make smarter decisions without depending on round-trips to central

3.2 Multi-level InfluxDB is part of the architecture

InfluxDB should exist on multiple levels where it adds operational value:

edge/local for resilience and near-real-time replay
site for plant-level history, diagnostics, and resilience
central for fleet-wide analytics, benchmarking, and advisory intelligence

This is not just copy-paste storage at each level. The design intent is event-driven and selective.

3.3 Storage should be smart, not only deadband-driven

The target is not simple "store every point" or only a fixed deadband rule such as 1%.

The desired storage approach is:

observe signal slope and change behavior
preserve points where state is changing materially
store fewer points where the signal can be reconstructed downstream with sufficient fidelity
carry enough metadata or conventions so reconstruction quality is auditable

This implies EVOLV should evolve toward smart storage and signal-aware retention rather than naive event dumping.

3.4 Central is the intelligence and API-entry layer

Central may advise and coordinate edge/site layers, but external API requests should not hit field-edge systems directly.

The intended pattern is:

external and enterprise integrations terminate centrally
central evaluates, aggregates, authorizes, and advises
site/edge layers receive mediated requests, policies, or setpoints
field-edge remains protected behind an intermediate layer

This aligns with the stated security direction.

3.5 Configuration source of truth should be database-backed

The intended configuration authority is the database-backed tagcodering model, which already exists but is not yet complete enough to serve as the fully realized source of truth.

That means the architecture should assume:

asset and machine metadata belong in tagcodering
Node-RED flows should consume configuration rather than silently becoming the only configuration store
more work is still needed before this behaves as the intended central configuration backbone

4. Visual Model

4.1 Platform topology

flowchart LR
    subgraph OT["OT / Field"]
        PLC["PLC / IO"]
        DEV["Sensors / Machines"]
    end

    subgraph EDGE["Edge Layer"]
        ENR["Edge Node-RED"]
        EDB["Local InfluxDB"]
        EUI["Local Grafana / Local Monitoring"]
        EBR["Optional Local Broker"]
    end

    subgraph SITE["Site Layer"]
        SNR["Site Node-RED / CoreSync"]
        SDB["Site InfluxDB"]
        SUI["Site Grafana / SCADA Support"]
        SBR["Site Broker"]
    end

    subgraph CENTRAL["Central Layer"]
        API["API / Integration Gateway"]
        INTEL["Overview Intelligence / Advisory Logic"]
        CDB["Central InfluxDB"]
        CGR["Central Grafana"]
        CFG["Tagcodering Config Model"]
        GIT["Gitea"]
        CI["CI/CD"]
        IAM["IAM / Keycloak"]
    end

    DEV --> PLC
    PLC --> ENR
    ENR --> EDB
    ENR --> EUI
    ENR --> EBR
    ENR <--> SNR
    EDB <--> SDB
    SNR --> SDB
    SNR --> SUI
    SNR --> SBR
    SNR <--> API
    API --> INTEL
    API <--> CFG
    SDB <--> CDB
    INTEL --> SNR
    CGR --> CDB
    CI --> GIT
    IAM --> API
    IAM --> CGR

4.2 Command and access boundary

flowchart TD
    EXT["External APIs / Enterprise Requests"] --> API["Central API Gateway"]
    API --> AUTH["AuthN/AuthZ / Policy Checks"]
    AUTH --> INTEL["Central Advisory / Decision Support"]
    INTEL --> SITE["Site Integration Layer"]
    SITE --> EDGE["Edge Runtime"]
    EDGE --> PLC["PLC / Field Assets"]

    EXT -. no direct access .-> EDGE
    EXT -. no direct access .-> PLC

4.3 Smart telemetry flow

flowchart LR
    RAW["Raw Signal"] --> EDGELOGIC["Edge Signal Evaluation"]
    EDGELOGIC --> KEEP["Keep Critical Change Points"]
    EDGELOGIC --> SKIP["Skip Reconstructable Flat Points"]
    EDGELOGIC --> LOCAL["Local InfluxDB"]
    LOCAL --> SITE["Site InfluxDB"]
    SITE --> CENTRAL["Central InfluxDB"]
    KEEP --> LOCAL
    SKIP -. reconstruction assumptions / metadata .-> SITE
    CENTRAL --> DASH["Fleet Dashboards / Analytics"]

5. Upsides Of This Direction

5.1 Strong separation between control and observability

Node-RED for runtime/orchestration and InfluxDB/Grafana for telemetry is still the right structural split:

control stays close to the process
telemetry storage/querying stays in time-series-native tooling
dashboards do not need to overload Node-RED itself

5.2 Edge-first matches operational reality

For wastewater/process systems, edge-first remains correct:

lower latency
better degraded-mode behavior
less dependence on WAN or central platform uptime
clearer OT trust boundary

5.3 Site mediation improves safety and security

Using central as the enterprise/API entry point and site as the mediator improves posture:

field systems are less exposed
policy decisions can be centralized
external integrations do not probe the edge directly
site can continue operating even when upstream is degraded

5.4 Multi-level storage enables better analytics

Multiple Influx layers can support:

local resilience
site diagnostics
fleet benchmarking
smarter retention and reconstruction strategies

That is substantially more capable than a single central historian model.

5.5 `tagcodering` is the right long-term direction

A database-backed configuration authority is stronger than embedding configuration only in flows because it supports:

machine metadata management
controlled rollout of configuration changes
clearer versioning and provenance
future API-driven configuration services

6. Downsides And Risks

6.1 Smart storage raises algorithmic and governance complexity

Signal-aware storage and reconstruction is promising, but it creates architectural obligations:

reconstruction rules must be explicit
acceptable reconstruction error must be defined per signal type
operators must know whether they see raw or reconstructed history
compliance-relevant data may need stricter retention than operational convenience data

Without those rules, smart storage can become opaque and hard to trust.

6.2 Multi-level databases can create ownership confusion

If edge, site, and central all store telemetry, you must define:

which layer is authoritative for which time horizon
when backfill is allowed
when data is summarized vs copied
how duplicates or gaps are detected

Otherwise operations will argue over which trend is "the real one."

6.3 Central intelligence must remain advisory-first

Central guidance can become valuable, but direct closed-loop dependency on central would be risky.

The architecture should therefore preserve:

local control authority at edge/site
bounded and explicit central advice
safe behavior if central recommendations stop arriving

6.4 `tagcodering` is not yet complete enough to lean on blindly

It is the right target, but its current partial state means there is still architecture debt:

incomplete config workflows
likely mismatch between desired and implemented schema behavior
temporary duplication between flows, node config, and database-held metadata

This should be treated as a core platform workstream, not a side issue.

6.5 Broker responsibilities are still not crisp enough

The materials still reference MQTT/AMQP/RabbitMQ/brokers without one stable responsibility split. That needs to be resolved before large-scale deployment.

Questions still open:

command bus or event bus?
site-only or cross-site?
telemetry transport or only synchronization/eventing?
durability expectations and replay behavior?

7. Recommended Ideal Stack

The ideal EVOLV stack should be layered around operational boundaries, not around tools.

7.1 Layer A: Edge execution

Purpose:

connect to PLCs and field assets
execute time-sensitive local logic
preserve operation during WAN/central loss
provide local telemetry access for resilience and digital-twin use cases

Recommended components:

Node-RED runtime for EVOLV edge flows
OPC UA and protocol adapters
local InfluxDB
optional local Grafana for local engineering/monitoring
optional local broker only when multiple participants need decoupling

Principle:

edge remains safe and useful when disconnected

7.2 Layer B: Site integration

Purpose:

aggregate multiple edge systems at plant/site level
host plant-local dashboards and diagnostics
mediate between raw OT detail and central standardization
serve as the protected step between field systems and central requests

Recommended components:

site Node-RED / CoreSync services
site InfluxDB
site Grafana / SCADA-supporting dashboards
site broker where asynchronous eventing is justified

Principle:

site absorbs plant complexity and protects field assets

7.3 Layer C: Central platform

Purpose:

fleet-wide analytics
shared dashboards
engineering lifecycle
enterprise/API entry point
overview intelligence and advisory logic

Recommended components:

Gitea
CI/CD
central InfluxDB
central Grafana
API/integration gateway
IAM
VPN/private connectivity
tagcodering-backed configuration services

Principle:

central coordinates, advises, and governs; it is not the direct field caller

7.4 Cross-cutting platform services

These should be explicit architecture elements:

secrets management
certificate management
backup/restore
audit logging
monitoring/alerting of the platform itself
versioned configuration and schema management
rollout/rollback strategy

8. Recommended Opinionated Choices

8.1 Keep Node-RED as the orchestration layer, not the whole platform

Node-RED should own:

process orchestration
protocol mediation
edge/site logic
KPI production

It should not become the sole owner of:

identity
long-term configuration authority
secret management
compliance/audit authority

8.2 Use InfluxDB by function and horizon

Recommended split:

edge: resilience, local replay, digital-twin input
site: plant diagnostics and local continuity
central: fleet analytics, advisory intelligence, benchmarking, and long-term cross-site views

8.3 Prefer smart telemetry retention over naive point dumping

Recommended rule:

keep information-rich points
reduce information-poor flat spans
document reconstruction assumptions
define signal-class-specific fidelity expectations

This needs design discipline, but it is a real differentiator if executed well.

8.4 Put enterprise/API ingress at central, not at edge

This should become a hard architectural rule:

external requests land centrally
central authenticates and authorizes
central or site mediates downward
edge never becomes the exposed public integration surface

8.5 Make `tagcodering` the target configuration backbone

The architecture should be designed so that tagcodering can mature into:

machine and asset registry
configuration source of truth
site/central configuration exchange point
API-served configuration source for runtime layers

9. Suggested Phasing

Phase 1: Stabilize contracts

define topic and payload contracts
define telemetry classes and reconstruction policy
define asset, machine, and site identity model
define tagcodering scope and schema ownership

Phase 2: Harden local/site resilience

formalize edge and site runtime patterns
define local telemetry retention and replay behavior
define central-loss behavior
define dashboard behavior during isolation

Phase 3: Harden central platform

IAM
API gateway
central observability
CI/CD
backup and disaster recovery
config services over tagcodering

Phase 4: Introduce selective synchronization and intelligence

event-driven telemetry propagation rules
smart-storage promotion/backfill policies
advisory services from central
auditability of downward recommendations and configuration changes

10. Immediate Open Questions Before Wiki Finalization

Which signals are allowed to use reconstruction-aware smart storage, and which must remain raw or near-raw for audit/compliance reasons?
How should tagcodering be exposed to runtime layers: direct database access, a dedicated API, or both?
What exact responsibility split should EVOLV use between API synchronization and broker-based eventing?

11. Recommended Wiki Structure

The wiki should not be one long page. It should be split into:

platform overview with the main topology diagram
edge-site-central runtime model
telemetry and smart storage model
security and access-boundary model
configuration architecture centered on tagcodering

12. Next Step

Use this document as the architecture baseline. The companion markdown page in architecture/ can then be shaped into a wiki-ready visual overview page with Mermaid diagrams and shorter human-readable sections.

16 KiB Raw Blame History

EVOLV Architecture Review

Purpose

Evidence Used

1. What Exists Today

1.1 Product/runtime layer

1.2 Implemented development stack

1.3 Existing runtime pattern in the nodes

2. What The Drawings Describe

2.1 Edge / OT layer

2.2 Site / local server layer

2.3 Central / cloud layer

3. Architecture Decisions From This Review

3.1 Local telemetry is mandatory for resilience

3.2 Multi-level InfluxDB is part of the architecture

3.3 Storage should be smart, not only deadband-driven

3.4 Central is the intelligence and API-entry layer

3.5 Configuration source of truth should be database-backed

4. Visual Model

4.1 Platform topology

4.2 Command and access boundary

4.3 Smart telemetry flow

5. Upsides Of This Direction

5.1 Strong separation between control and observability

5.2 Edge-first matches operational reality

5.3 Site mediation improves safety and security

5.4 Multi-level storage enables better analytics

5.5 tagcodering is the right long-term direction

6. Downsides And Risks

6.1 Smart storage raises algorithmic and governance complexity

6.2 Multi-level databases can create ownership confusion

6.3 Central intelligence must remain advisory-first

6.4 tagcodering is not yet complete enough to lean on blindly

6.5 Broker responsibilities are still not crisp enough

7. Recommended Ideal Stack

7.1 Layer A: Edge execution

7.2 Layer B: Site integration

7.3 Layer C: Central platform

7.4 Cross-cutting platform services

8. Recommended Opinionated Choices

8.1 Keep Node-RED as the orchestration layer, not the whole platform

8.2 Use InfluxDB by function and horizon

8.3 Prefer smart telemetry retention over naive point dumping

8.4 Put enterprise/API ingress at central, not at edge

8.5 Make tagcodering the target configuration backbone

9. Suggested Phasing

Phase 1: Stabilize contracts

Phase 2: Harden local/site resilience

Phase 3: Harden central platform

Phase 4: Introduce selective synchronization and intelligence

10. Immediate Open Questions Before Wiki Finalization

11. Recommended Wiki Structure

12. Next Step

16 KiB

Raw Blame History

5.5 `tagcodering` is the right long-term direction

6.4 `tagcodering` is not yet complete enough to lean on blindly

8.5 Make `tagcodering` the target configuration backbone