625 lines
19 KiB
Markdown
625 lines
19 KiB
Markdown
# EVOLV Architecture Review
|
|
|
|
## Purpose
|
|
|
|
This document captures:
|
|
|
|
- the architecture implemented in this repository today
|
|
- the broader edge/site/central architecture shown in the drawings under `temp/`
|
|
- the key strengths and weaknesses of that direction
|
|
- the currently preferred target stack based on owner decisions from this review
|
|
|
|
It is the local staging document for a later wiki update.
|
|
|
|
## Evidence Used
|
|
|
|
Implemented stack evidence:
|
|
|
|
- `docker-compose.yml`
|
|
- `docker/settings.js`
|
|
- `docker/grafana/provisioning/datasources/influxdb.yaml`
|
|
- `package.json`
|
|
- `nodes/*`
|
|
|
|
Target-state evidence:
|
|
|
|
- `temp/fullStack.pdf`
|
|
- `temp/edge.pdf`
|
|
- `temp/CoreSync.drawio.pdf`
|
|
- `temp/cloud.yml`
|
|
|
|
Owner decisions from this review:
|
|
|
|
- local InfluxDB is required for operational resilience
|
|
- central acts as the advisory/intelligence and API-entry layer, not as a direct field caller
|
|
- intended configuration authority is the database-backed `tagcodering` model
|
|
- architecture wiki pages should be visual, not text-only
|
|
|
|
## 1. What Exists Today
|
|
|
|
### 1.1 Product/runtime layer
|
|
|
|
The codebase is currently a modular Node-RED package for wastewater/process automation:
|
|
|
|
- EVOLV ships custom Node-RED nodes for plant assets and process logic
|
|
- nodes emit both process/control messages and telemetry-oriented outputs
|
|
- shared helper logic lives in `nodes/generalFunctions/`
|
|
- Grafana-facing integration exists through `dashboardAPI` and Influx-oriented outputs
|
|
|
|
### 1.2 Implemented development stack
|
|
|
|
The concrete development stack in this repository is:
|
|
|
|
- Node-RED
|
|
- InfluxDB 2.x
|
|
- Grafana
|
|
|
|
That gives a clear local flow:
|
|
|
|
1. EVOLV logic runs in Node-RED.
|
|
2. Telemetry is emitted in a time-series-oriented shape.
|
|
3. InfluxDB stores the telemetry.
|
|
4. Grafana renders operational dashboards.
|
|
|
|
### 1.3 Existing runtime pattern in the nodes
|
|
|
|
A recurring EVOLV pattern is:
|
|
|
|
- output 0: process/control message
|
|
- output 1: Influx/telemetry message
|
|
- output 2: registration/control plumbing where relevant
|
|
|
|
So even in its current implemented form, EVOLV is not only a Node-RED project. It is already a control-plus-observability platform, with Node-RED as orchestration/runtime and InfluxDB/Grafana as telemetry and visualization services.
|
|
|
|
## 2. What The Drawings Describe
|
|
|
|
Across `temp/fullStack.pdf` and `temp/CoreSync.drawio.pdf`, the intended platform is broader and layered.
|
|
|
|
### 2.1 Edge / OT layer
|
|
|
|
The drawings consistently place these capabilities at the edge:
|
|
|
|
- PLC / OPC UA connectivity
|
|
- Node-RED container as protocol translator and logic runtime
|
|
- local broker in some variants
|
|
- local InfluxDB / Prometheus style storage in some variants
|
|
- local Grafana/SCADA in some variants
|
|
|
|
This is the plant-side operational layer.
|
|
|
|
### 2.2 Site / local server layer
|
|
|
|
The CoreSync drawings also show a site aggregation layer:
|
|
|
|
- RWZI-local server
|
|
- Node-RED / CoreSync services
|
|
- site-local broker
|
|
- site-local database
|
|
- upward API-based synchronization
|
|
|
|
This layer decouples field assets from central services and absorbs plant-specific complexity.
|
|
|
|
### 2.3 Central / cloud layer
|
|
|
|
The broader stack drawings and `temp/cloud.yml` show a central platform layer with:
|
|
|
|
- Gitea
|
|
- Jenkins
|
|
- reverse proxy / ingress
|
|
- Grafana
|
|
- InfluxDB
|
|
- Node-RED
|
|
- RabbitMQ / messaging
|
|
- VPN / tunnel concepts
|
|
- Keycloak in the drawing
|
|
- Portainer in the drawing
|
|
|
|
This is a platform-services layer, not just an application runtime.
|
|
|
|
## 3. Architecture Decisions From This Review
|
|
|
|
These decisions now shape the preferred EVOLV target architecture.
|
|
|
|
### 3.1 Local telemetry is mandatory for resilience
|
|
|
|
Local InfluxDB is not optional. It is required so that:
|
|
|
|
- operations continue when central SCADA or central services are down
|
|
- local dashboards and advanced digital-twin workflows can still consume recent and relevant process history
|
|
- local edge/site layers can make smarter decisions without depending on round-trips to central
|
|
|
|
### 3.2 Multi-level InfluxDB is part of the architecture
|
|
|
|
InfluxDB should exist on multiple levels where it adds operational value:
|
|
|
|
- edge/local for resilience and near-real-time replay
|
|
- site for plant-level history, diagnostics, and resilience
|
|
- central for fleet-wide analytics, benchmarking, and advisory intelligence
|
|
|
|
This is not just copy-paste storage at each level. The design intent is event-driven and selective.
|
|
|
|
### 3.3 Storage should be smart, not only deadband-driven
|
|
|
|
The target is not simple "store every point" or only a fixed deadband rule such as 1%.
|
|
|
|
The desired storage approach is:
|
|
|
|
- observe signal slope and change behavior
|
|
- preserve points where state is changing materially
|
|
- store fewer points where the signal can be reconstructed downstream with sufficient fidelity
|
|
- carry enough metadata or conventions so reconstruction quality is auditable
|
|
|
|
This implies EVOLV should evolve toward smart storage and signal-aware retention rather than naive event dumping.
|
|
|
|
### 3.4 Central is the intelligence and API-entry layer
|
|
|
|
Central may advise and coordinate edge/site layers, but external API requests should not hit field-edge systems directly.
|
|
|
|
The intended pattern is:
|
|
|
|
- external and enterprise integrations terminate centrally
|
|
- central evaluates, aggregates, authorizes, and advises
|
|
- site/edge layers receive mediated requests, policies, or setpoints
|
|
- field-edge remains protected behind an intermediate layer
|
|
|
|
This aligns with the stated security direction.
|
|
|
|
### 3.5 Configuration source of truth should be database-backed
|
|
|
|
The intended configuration authority is the database-backed `tagcodering` model, which already exists but is not yet complete enough to serve as the fully realized source of truth.
|
|
|
|
That means the architecture should assume:
|
|
|
|
- asset and machine metadata belong in `tagcodering`
|
|
- Node-RED flows should consume configuration rather than silently becoming the only configuration store
|
|
- more work is still needed before this behaves as the intended central configuration backbone
|
|
|
|
## 4. Visual Model
|
|
|
|
### 4.1 Platform topology
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph OT["OT / Field"]
|
|
PLC["PLC / IO"]
|
|
DEV["Sensors / Machines"]
|
|
end
|
|
|
|
subgraph EDGE["Edge Layer"]
|
|
ENR["Edge Node-RED"]
|
|
EDB["Local InfluxDB"]
|
|
EUI["Local Grafana / Local Monitoring"]
|
|
EBR["Optional Local Broker"]
|
|
end
|
|
|
|
subgraph SITE["Site Layer"]
|
|
SNR["Site Node-RED / CoreSync"]
|
|
SDB["Site InfluxDB"]
|
|
SUI["Site Grafana / SCADA Support"]
|
|
SBR["Site Broker"]
|
|
end
|
|
|
|
subgraph CENTRAL["Central Layer"]
|
|
API["API / Integration Gateway"]
|
|
INTEL["Overview Intelligence / Advisory Logic"]
|
|
CDB["Central InfluxDB"]
|
|
CGR["Central Grafana"]
|
|
CFG["Tagcodering Config Model"]
|
|
GIT["Gitea"]
|
|
CI["CI/CD"]
|
|
IAM["IAM / Keycloak"]
|
|
end
|
|
|
|
DEV --> PLC
|
|
PLC --> ENR
|
|
ENR --> EDB
|
|
ENR --> EUI
|
|
ENR --> EBR
|
|
ENR <--> SNR
|
|
EDB <--> SDB
|
|
SNR --> SDB
|
|
SNR --> SUI
|
|
SNR --> SBR
|
|
SNR <--> API
|
|
API --> INTEL
|
|
API <--> CFG
|
|
SDB <--> CDB
|
|
INTEL --> SNR
|
|
CGR --> CDB
|
|
CI --> GIT
|
|
IAM --> API
|
|
IAM --> CGR
|
|
```
|
|
|
|
### 4.2 Command and access boundary
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
EXT["External APIs / Enterprise Requests"] --> API["Central API Gateway"]
|
|
API --> AUTH["AuthN/AuthZ / Policy Checks"]
|
|
AUTH --> INTEL["Central Advisory / Decision Support"]
|
|
INTEL --> SITE["Site Integration Layer"]
|
|
SITE --> EDGE["Edge Runtime"]
|
|
EDGE --> PLC["PLC / Field Assets"]
|
|
|
|
EXT -. no direct access .-> EDGE
|
|
EXT -. no direct access .-> PLC
|
|
```
|
|
|
|
### 4.3 Smart telemetry flow
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
RAW["Raw Signal"] --> EDGELOGIC["Edge Signal Evaluation"]
|
|
EDGELOGIC --> KEEP["Keep Critical Change Points"]
|
|
EDGELOGIC --> SKIP["Skip Reconstructable Flat Points"]
|
|
EDGELOGIC --> LOCAL["Local InfluxDB"]
|
|
LOCAL --> SITE["Site InfluxDB"]
|
|
SITE --> CENTRAL["Central InfluxDB"]
|
|
KEEP --> LOCAL
|
|
SKIP -. reconstruction assumptions / metadata .-> SITE
|
|
CENTRAL --> DASH["Fleet Dashboards / Analytics"]
|
|
```
|
|
|
|
## 5. Upsides Of This Direction
|
|
|
|
### 5.1 Strong separation between control and observability
|
|
|
|
Node-RED for runtime/orchestration and InfluxDB/Grafana for telemetry is still the right structural split:
|
|
|
|
- control stays close to the process
|
|
- telemetry storage/querying stays in time-series-native tooling
|
|
- dashboards do not need to overload Node-RED itself
|
|
|
|
### 5.2 Edge-first matches operational reality
|
|
|
|
For wastewater/process systems, edge-first remains correct:
|
|
|
|
- lower latency
|
|
- better degraded-mode behavior
|
|
- less dependence on WAN or central platform uptime
|
|
- clearer OT trust boundary
|
|
|
|
### 5.3 Site mediation improves safety and security
|
|
|
|
Using central as the enterprise/API entry point and site as the mediator improves posture:
|
|
|
|
- field systems are less exposed
|
|
- policy decisions can be centralized
|
|
- external integrations do not probe the edge directly
|
|
- site can continue operating even when upstream is degraded
|
|
|
|
### 5.4 Multi-level storage enables better analytics
|
|
|
|
Multiple Influx layers can support:
|
|
|
|
- local resilience
|
|
- site diagnostics
|
|
- fleet benchmarking
|
|
- smarter retention and reconstruction strategies
|
|
|
|
That is substantially more capable than a single central historian model.
|
|
|
|
### 5.5 `tagcodering` is the right long-term direction
|
|
|
|
A database-backed configuration authority is stronger than embedding configuration only in flows because it supports:
|
|
|
|
- machine metadata management
|
|
- controlled rollout of configuration changes
|
|
- clearer versioning and provenance
|
|
- future API-driven configuration services
|
|
|
|
## 6. Downsides And Risks
|
|
|
|
### 6.1 Smart storage raises algorithmic and governance complexity
|
|
|
|
Signal-aware storage and reconstruction is promising, but it creates architectural obligations:
|
|
|
|
- reconstruction rules must be explicit
|
|
- acceptable reconstruction error must be defined per signal type
|
|
- operators must know whether they see raw or reconstructed history
|
|
- compliance-relevant data may need stricter retention than operational convenience data
|
|
|
|
Without those rules, smart storage can become opaque and hard to trust.
|
|
|
|
### 6.2 Multi-level databases can create ownership confusion
|
|
|
|
If edge, site, and central all store telemetry, you must define:
|
|
|
|
- which layer is authoritative for which time horizon
|
|
- when backfill is allowed
|
|
- when data is summarized vs copied
|
|
- how duplicates or gaps are detected
|
|
|
|
Otherwise operations will argue over which trend is "the real one."
|
|
|
|
### 6.3 Central intelligence must remain advisory-first
|
|
|
|
Central guidance can become valuable, but direct closed-loop dependency on central would be risky.
|
|
|
|
The architecture should therefore preserve:
|
|
|
|
- local control authority at edge/site
|
|
- bounded and explicit central advice
|
|
- safe behavior if central recommendations stop arriving
|
|
|
|
### 6.4 `tagcodering` is not yet complete enough to lean on blindly
|
|
|
|
It is the right target, but its current partial state means there is still architecture debt:
|
|
|
|
- incomplete config workflows
|
|
- likely mismatch between desired and implemented schema behavior
|
|
- temporary duplication between flows, node config, and database-held metadata
|
|
|
|
This should be treated as a core platform workstream, not a side issue.
|
|
|
|
### 6.5 Broker responsibilities are still not crisp enough
|
|
|
|
The materials still reference MQTT/AMQP/RabbitMQ/brokers without one stable responsibility split. That needs to be resolved before large-scale deployment.
|
|
|
|
Questions still open:
|
|
|
|
- command bus or event bus?
|
|
- site-only or cross-site?
|
|
- telemetry transport or only synchronization/eventing?
|
|
- durability expectations and replay behavior?
|
|
|
|
## 7. Security And Regulatory Positioning
|
|
|
|
### 7.1 Purdue-style layering is a good fit
|
|
|
|
EVOLV's preferred structure aligns well with a Purdue-style OT/IT layering approach:
|
|
|
|
- PLCs and field assets stay at the operational edge
|
|
- edge runtimes stay close to the process
|
|
- site systems mediate between OT and broader enterprise concerns
|
|
- central services host APIs, identity, analytics, and engineering workflows
|
|
|
|
That is important because it supports segmented trust boundaries instead of direct enterprise-to-field reach-through.
|
|
|
|
### 7.2 NIS2 alignment
|
|
|
|
Directive (EU) 2022/2555 (NIS2) requires cybersecurity risk-management measures, incident handling, and stronger governance for covered entities.
|
|
|
|
This architecture supports that by:
|
|
|
|
- limiting direct exposure of field systems
|
|
- separating operational layers
|
|
- enabling central policy and oversight
|
|
- preserving local operation during upstream failure
|
|
|
|
### 7.3 CER alignment
|
|
|
|
Directive (EU) 2022/2557 (Critical Entities Resilience Directive) focuses on resilience of essential services.
|
|
|
|
The edge-plus-site approach supports that direction because:
|
|
|
|
- local/site layers can continue during central disruption
|
|
- essential service continuity does not depend on one central runtime
|
|
- degraded-mode behavior can be explicitly designed per layer
|
|
|
|
### 7.4 Cyber Resilience Act alignment
|
|
|
|
Regulation (EU) 2024/2847 (Cyber Resilience Act) creates cybersecurity requirements for products with digital elements.
|
|
|
|
For EVOLV, that means the platform should keep strengthening:
|
|
|
|
- secure configuration handling
|
|
- vulnerability and update management
|
|
- release traceability
|
|
- lifecycle ownership of components and dependencies
|
|
|
|
### 7.5 GDPR alignment where personal data is present
|
|
|
|
Regulation (EU) 2016/679 (GDPR) applies whenever EVOLV processes personal data.
|
|
|
|
The architecture helps by:
|
|
|
|
- centralizing ingress
|
|
- reducing unnecessary propagation of data to field layers
|
|
- making access, retention, and audit boundaries easier to define
|
|
|
|
### 7.6 What can and cannot be claimed
|
|
|
|
The defensible claim is that EVOLV can be deployed in a way that supports compliance with strict European cybersecurity and resilience expectations.
|
|
|
|
The non-defensible claim is that EVOLV is automatically compliant purely because of the architecture diagram.
|
|
|
|
Actual compliance still depends on implementation and operations, including:
|
|
|
|
- access control
|
|
- patch and vulnerability management
|
|
- incident response
|
|
- logging and audit evidence
|
|
- retention policy
|
|
- data classification
|
|
|
|
## 8. Recommended Ideal Stack
|
|
|
|
The ideal EVOLV stack should be layered around operational boundaries, not around tools.
|
|
|
|
### 7.1 Layer A: Edge execution
|
|
|
|
Purpose:
|
|
|
|
- connect to PLCs and field assets
|
|
- execute time-sensitive local logic
|
|
- preserve operation during WAN/central loss
|
|
- provide local telemetry access for resilience and digital-twin use cases
|
|
|
|
Recommended components:
|
|
|
|
- Node-RED runtime for EVOLV edge flows
|
|
- OPC UA and protocol adapters
|
|
- local InfluxDB
|
|
- optional local Grafana for local engineering/monitoring
|
|
- optional local broker only when multiple participants need decoupling
|
|
|
|
Principle:
|
|
|
|
- edge remains safe and useful when disconnected
|
|
|
|
### 7.2 Layer B: Site integration
|
|
|
|
Purpose:
|
|
|
|
- aggregate multiple edge systems at plant/site level
|
|
- host plant-local dashboards and diagnostics
|
|
- mediate between raw OT detail and central standardization
|
|
- serve as the protected step between field systems and central requests
|
|
|
|
Recommended components:
|
|
|
|
- site Node-RED / CoreSync services
|
|
- site InfluxDB
|
|
- site Grafana / SCADA-supporting dashboards
|
|
- site broker where asynchronous eventing is justified
|
|
|
|
Principle:
|
|
|
|
- site absorbs plant complexity and protects field assets
|
|
|
|
### 7.3 Layer C: Central platform
|
|
|
|
Purpose:
|
|
|
|
- fleet-wide analytics
|
|
- shared dashboards
|
|
- engineering lifecycle
|
|
- enterprise/API entry point
|
|
- overview intelligence and advisory logic
|
|
|
|
Recommended components:
|
|
|
|
- Gitea
|
|
- CI/CD
|
|
- central InfluxDB
|
|
- central Grafana
|
|
- API/integration gateway
|
|
- IAM
|
|
- VPN/private connectivity
|
|
- `tagcodering`-backed configuration services
|
|
|
|
Principle:
|
|
|
|
- central coordinates, advises, and governs; it is not the direct field caller
|
|
|
|
### 7.4 Cross-cutting platform services
|
|
|
|
These should be explicit architecture elements:
|
|
|
|
- secrets management
|
|
- certificate management
|
|
- backup/restore
|
|
- audit logging
|
|
- monitoring/alerting of the platform itself
|
|
- versioned configuration and schema management
|
|
- rollout/rollback strategy
|
|
|
|
## 9. Recommended Opinionated Choices
|
|
|
|
### 8.1 Keep Node-RED as the orchestration layer, not the whole platform
|
|
|
|
Node-RED should own:
|
|
|
|
- process orchestration
|
|
- protocol mediation
|
|
- edge/site logic
|
|
- KPI production
|
|
|
|
It should not become the sole owner of:
|
|
|
|
- identity
|
|
- long-term configuration authority
|
|
- secret management
|
|
- compliance/audit authority
|
|
|
|
### 8.2 Use InfluxDB by function and horizon
|
|
|
|
Recommended split:
|
|
|
|
- edge: resilience, local replay, digital-twin input
|
|
- site: plant diagnostics and local continuity
|
|
- central: fleet analytics, advisory intelligence, benchmarking, and long-term cross-site views
|
|
|
|
### 8.3 Prefer smart telemetry retention over naive point dumping
|
|
|
|
Recommended rule:
|
|
|
|
- keep information-rich points
|
|
- reduce information-poor flat spans
|
|
- document reconstruction assumptions
|
|
- define signal-class-specific fidelity expectations
|
|
|
|
This needs design discipline, but it is a real differentiator if executed well.
|
|
|
|
### 8.4 Put enterprise/API ingress at central, not at edge
|
|
|
|
This should become a hard architectural rule:
|
|
|
|
- external requests land centrally
|
|
- central authenticates and authorizes
|
|
- central or site mediates downward
|
|
- edge never becomes the exposed public integration surface
|
|
|
|
### 8.5 Make `tagcodering` the target configuration backbone
|
|
|
|
The architecture should be designed so that `tagcodering` can mature into:
|
|
|
|
- machine and asset registry
|
|
- configuration source of truth
|
|
- site/central configuration exchange point
|
|
- API-served configuration source for runtime layers
|
|
|
|
## 10. Suggested Phasing
|
|
|
|
### Phase 1: Stabilize contracts
|
|
|
|
- define topic and payload contracts
|
|
- define telemetry classes and reconstruction policy
|
|
- define asset, machine, and site identity model
|
|
- define `tagcodering` scope and schema ownership
|
|
|
|
### Phase 2: Harden local/site resilience
|
|
|
|
- formalize edge and site runtime patterns
|
|
- define local telemetry retention and replay behavior
|
|
- define central-loss behavior
|
|
- define dashboard behavior during isolation
|
|
|
|
### Phase 3: Harden central platform
|
|
|
|
- IAM
|
|
- API gateway
|
|
- central observability
|
|
- CI/CD
|
|
- backup and disaster recovery
|
|
- config services over `tagcodering`
|
|
|
|
### Phase 4: Introduce selective synchronization and intelligence
|
|
|
|
- event-driven telemetry propagation rules
|
|
- smart-storage promotion/backfill policies
|
|
- advisory services from central
|
|
- auditability of downward recommendations and configuration changes
|
|
|
|
## 11. Immediate Open Questions Before Wiki Finalization
|
|
|
|
1. Which signals are allowed to use reconstruction-aware smart storage, and which must remain raw or near-raw for audit/compliance reasons?
|
|
2. How should `tagcodering` be exposed to runtime layers: direct database access, a dedicated API, or both?
|
|
3. What exact responsibility split should EVOLV use between API synchronization and broker-based eventing?
|
|
|
|
## 12. Recommended Wiki Structure
|
|
|
|
The wiki should not be one long page. It should be split into:
|
|
|
|
1. platform overview with the main topology diagram
|
|
2. edge-site-central runtime model
|
|
3. telemetry and smart storage model
|
|
4. security and access-boundary model
|
|
5. configuration architecture centered on `tagcodering`
|
|
|
|
## 13. Next Step
|
|
|
|
Use this document as the architecture baseline. The companion markdown page in `architecture/` can then be shaped into a wiki-ready visual overview page with Mermaid diagrams and shorter human-readable sections.
|