<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/" rel="alternate" type="text/html" /><updated>2026-04-02T20:43:27+00:00</updated><id>https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/feed.xml</id><title type="html">AGNTCY Blogs</title><subtitle>Building infrastructure for the Internet of Agents</subtitle><entry><title type="html">SLIM MVP: Enterprise AI Agent Fleet — Multicluster Secure Communications</title><link href="https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/technical/2026/04/02/mvp-ai-agents-multicluster.html" rel="alternate" type="text/html" title="SLIM MVP: Enterprise AI Agent Fleet — Multicluster Secure Communications" /><published>2026-04-02T07:00:00+00:00</published><updated>2026-04-02T07:00:00+00:00</updated><id>https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/technical/2026/04/02/mvp-ai-agents-multicluster</id><content type="html" xml:base="https://blogs.agntcy.org/drafts/docs-mvp-ai-agents-multicluster/technical/2026/04/02/mvp-ai-agents-multicluster.html"><![CDATA[<blockquote>
  <p><strong>Related issue:</strong> <a href="https://github.com/agntcy/slim/issues/1372">#1372 — Epic: SLIM multicluster autoconfig installation for server fleets</a></p>
</blockquote>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#1-executive-summary">Executive Summary</a></li>
  <li><a href="#2-business-problem">Business Problem</a></li>
  <li><a href="#3-solution-slim-as-the-communication-backbone">Solution: SLIM as the Communication Backbone</a></li>
  <li><a href="#4-architecture">Architecture</a></li>
  <li><a href="#5-use-case-enterprise-cluster-monitoring-agent-fleet">Use Case: Enterprise Cluster-Monitoring Agent Fleet</a></li>
  <li><a href="#6-communication-flows">Communication Flows</a></li>
  <li><a href="#7-security-model">Security Model</a></li>
  <li><a href="#8-demo-scenario">Demo Scenario</a></li>
  <li><a href="#9-key-value-propositions">Key Value Propositions</a></li>
  <li><a href="#10-implementation-notes">Implementation Notes</a></li>
</ol>

<hr />

<h2 id="1-executive-summary">1. Executive Summary</h2>

<p>This MVP demonstrates how <strong>SLIM (Secure Low-Latency Interactive Messaging)</strong> enables AI
agents deployed across multiple Kubernetes clusters — including clusters hidden behind
corporate VPNs and firewalls — to communicate securely without exposing any agent as a
network server.</p>

<p>Agents subscribe to <strong>named channels</strong> (topics) matching the problem they need to solve.
When a cluster event fires, the relevant agents wake up, collaborate through SLIM, and
escalate to a human operator if needed. All of this works across network boundaries that
would otherwise require complex firewall rules, VPN tunnels, or service mesh configuration
per agent.</p>

<p><strong>Core value proposition:</strong> SLIM turns the hardest part of multi-cluster agent
communication — the network — into a non-problem, while making the system <em>more</em> secure, not
less.</p>

<hr />

<h2 id="2-business-problem">2. Business Problem</h2>

<h3 id="21-the-enterprise-fleet-reality">2.1 The Enterprise Fleet Reality</h3>

<p>Large enterprises operate AI agent fleets across tens or hundreds of Kubernetes clusters.
These clusters span different environments:</p>

<table>
  <thead>
    <tr>
      <th>Environment</th>
      <th>Network constraint</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>On-prem data centers</td>
      <td>Corporate firewall / private network</td>
    </tr>
    <tr>
      <td>Branch offices</td>
      <td>VPN-only access</td>
    </tr>
    <tr>
      <td>Cloud regions</td>
      <td>Transit VPC, private endpoints</td>
    </tr>
    <tr>
      <td>Edge / OT networks</td>
      <td>Strictly air-gapped or NAT-only outbound</td>
    </tr>
  </tbody>
</table>

<p>Each cluster runs <strong>specialized AI agents</strong>:</p>

<ul>
  <li><strong>Performance agents</strong> — track CPU, memory, latency, and SLO compliance</li>
  <li><strong>Security agents</strong> — detect anomalies, policy violations, CVEs in running images</li>
  <li><strong>Remediation agents</strong> — attempt automated fixes (node drain, pod restart, config rollback)</li>
  <li><strong>Escalation agents</strong> — page a human operator when automated resolution is insufficient</li>
</ul>

<h3 id="22-what-makes-agent-communication-hard">2.2 What Makes Agent Communication Hard</h3>

<p>Traditional approaches require agents to expose themselves as servers:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Agent A ──► opens TCP port ──► registered in service registry ──► firewall rule
Agent B ──► DNS lookup ──► TLS handshake ──► mTLS cert rotation ──► calls Agent A
</code></pre></div></div>

<p>This creates a dense web of operational complexity:</p>

<ul>
  <li><strong>Every cluster behind a VPN requires explicit firewall/NAT rules</strong> per agent pair</li>
  <li><strong>Every agent needs a stable DNS name and TLS certificate</strong>, even for ephemeral workloads</li>
  <li><strong>Human operators</strong> wanting to observe or intervene must be given network-level access
to the relevant cluster</li>
  <li><strong>Agent-to-agent topology changes</strong> (scale-up, migration) invalidate stale service
registry entries and routing rules</li>
</ul>

<h3 id="23-requirements-for-the-mvp">2.3 Requirements for the MVP</h3>

<table>
  <thead>
    <tr>
      <th>Requirement</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Zero exposed ports per agent</strong></td>
      <td>Agents must not listen on any TCP/UDP port</td>
    </tr>
    <tr>
      <td><strong>Cross-cluster messaging</strong></td>
      <td>Agents on Cluster A and Cluster B communicate transparently</td>
    </tr>
    <tr>
      <td><strong>Works behind VPN / NAT</strong></td>
      <td>Outbound-only connectivity from cluster to SLIM overlay is sufficient</td>
    </tr>
    <tr>
      <td><strong>Event-driven wake-up</strong></td>
      <td>Agents are idle until a relevant event arrives on their channel</td>
    </tr>
    <tr>
      <td><strong>Multi-channel membership</strong></td>
      <td>A single agent can participate in multiple topic channels simultaneously</td>
    </tr>
    <tr>
      <td><strong>Human-in-the-loop</strong></td>
      <td>Operators can observe, validate, and act via the same channel mechanism</td>
    </tr>
    <tr>
      <td><strong>End-to-end encryption</strong></td>
      <td>Agent messages are encrypted with MLS; SLIM nodes cannot read payloads</td>
    </tr>
    <tr>
      <td><strong>Workload identity</strong></td>
      <td>SPIRE provides SPIFFE certificates — no shared secrets, no static API keys</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="3-solution-slim-as-the-communication-backbone">3. Solution: SLIM as the Communication Backbone</h2>

<h3 id="31-what-slim-provides">3.1 What SLIM Provides</h3>

<p>SLIM is a <strong>publish-subscribe message router</strong> with a session layer. Its architecture has
three components that cleanly separate concerns:</p>

<pre><code class="language-mermaid">flowchart TB
  subgraph APP["Agent / Application"]
    RPC["SLIMRPC (request/response, streaming)"]
    SL["Session Layer (MLS encryption, group mgmt)"]
    DPC["Data Plane Client (routing, transport)"]
    RPC --&gt; SL --&gt; DPC
  end
  DPC --&gt;|"outbound connection only"| NL["SLIM Routing Node\n(cluster-local)"]
  NL --&gt;|"mTLS (SPIRE-issued certs)"| NR["SLIM Routing Node\n(remote cluster)"]
</code></pre>

<h3 id="32-the-channel-model">3.2 The Channel Model</h3>

<p>Every agent is identified by a <strong>hierarchical name</strong>: <code class="language-plaintext highlighter-rouge">&lt;org&gt;/&lt;namespace&gt;/&lt;agent-type&gt;[#id]</code>.</p>

<p>Agents <strong>subscribe</strong> to a named channel topic. The SLIM controller sees these subscriptions
and automatically creates routes between the cluster-local SLIM node and any remote nodes
that have subscribers for the same topic.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>channel: acme/monitoring/security-incident
          │       │            │
          │       │            └── topic (problem domain)
          │       └── namespace (cluster or team scope)
          └── org (enterprise identifier)
</code></pre></div></div>

<p>Key properties:</p>
<ul>
  <li>An agent subscribes with <strong>no knowledge of the remote agent’s address</strong> — only the channel name</li>
  <li>The SLIM router handles delivery; the agent never opens a listening port</li>
  <li>Multiple agents can subscribe to the same channel → <strong>multicast delivery</strong></li>
  <li>An agent can subscribe to <strong>multiple channels</strong> simultaneously</li>
</ul>

<h3 id="33-why-slim-solves-the-network-problem">3.3 Why SLIM Solves the Network Problem</h3>

<table>
  <thead>
    <tr>
      <th>Traditional approach</th>
      <th>SLIM approach</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Agent opens a port and registers in a service registry</td>
      <td>Agent connects <strong>outbound</strong> to the local SLIM node — no inbound port</td>
    </tr>
    <tr>
      <td>Firewall rules required for every agent pair</td>
      <td>Only the SLIM node itself needs one exposed endpoint per cluster</td>
    </tr>
    <tr>
      <td>DNS + TLS cert per agent</td>
      <td>Workload identity via SPIRE SVID — automatic, no manual cert management</td>
    </tr>
    <tr>
      <td>Service mesh or VPN tunnel between every cluster pair</td>
      <td>SLIM nodes federate; agents are completely unaware of cluster topology</td>
    </tr>
    <tr>
      <td>Agent scale-out changes routing config</td>
      <td>Agents with the same name auto-form a group; SLIM load-balances</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="4-architecture">4. Architecture</h2>

<h3 id="41-high-level-system-architecture">4.1 High-Level System Architecture</h3>

<pre><code class="language-mermaid">---
config:
  layout: elk
---
flowchart TB

  subgraph CLOUD["Cloud / Admin Cluster (admin.example)"]
    direction TB
    CTRL["SLIM Controller\n(route management)"]
    SPIRE_ADMIN["SPIRE Server (root)\ntrust domain: acme.example\nspire-root.admin.example:8081"]
    LB_CTRL["LoadBalancer\nslim-control.admin.example:50052"]
    LB_CTRL --&gt; CTRL
  end

  subgraph CLUSTER_A["Cluster A — On-prem / VPN"]
    direction TB
    SPIRE_A["SPIRE Server (nested)\nno external endpoint"]
    subgraph SLIM_A_NS["slim namespace"]
      SLIM_A["SLIM Node (StatefulSet)\nslim.cluster-a.example:46357"]
    end
    subgraph AGENTS_A["default namespace"]
      PERF_A["Performance\nAgent"]
      SEC_A["Security\nAgent"]
      REMED_A["Remediation\nAgent"]
    end
    PERF_A --&gt;|"outbound only"| SLIM_A
    SEC_A  --&gt;|"outbound only"| SLIM_A
    REMED_A--&gt;|"outbound only"| SLIM_A
  end

  subgraph CLUSTER_B["Cluster B — Cloud region"]
    direction TB
    SPIRE_B["SPIRE Server (nested)\nno external endpoint"]
    subgraph SLIM_B_NS["slim namespace"]
      SLIM_B["SLIM Node (StatefulSet)\nslim.cluster-b.example:46357"]
    end
    subgraph AGENTS_B["default namespace"]
      PERF_B["Performance\nAgent"]
      SEC_B["Security\nAgent"]
      REMED_B["Remediation\nAgent"]
    end
    PERF_B --&gt;|"outbound only"| SLIM_B
    SEC_B  --&gt;|"outbound only"| SLIM_B
    REMED_B--&gt;|"outbound only"| SLIM_B
  end

  subgraph OPERATOR["Operator Workstation / Jump Host"]
    HUMAN["👤 Human Operator\n(observer / approver)"]
    HUMAN --&gt;|"outbound to any SLIM node"| SLIM_A
  end

  SLIM_A &lt;--&gt;|"mTLS (SPIRE-issued certs)\noutbound from each side"| SLIM_B
  SLIM_A --&gt;|"gRPC controller connection"| CTRL
  SLIM_B --&gt;|"gRPC controller connection"| CTRL
  SPIRE_A --&gt;|"nested upstream (outbound)"| SPIRE_ADMIN
  SPIRE_B --&gt;|"nested upstream (outbound)"| SPIRE_ADMIN
</code></pre>

<blockquote>
  <p><strong>SPIRE nested deployment:</strong> Workload-cluster SPIRE servers connect outbound to the
admin root SPIRE server as nested (downstream) servers. Only the admin SPIRE server
requires an external endpoint — workload SPIRE servers never need to be reachable
from outside their own cluster. See <a href="https://spiffe.io/docs/latest/architecture/nested/readme/">SPIRE nested architecture</a>.</p>
</blockquote>

<h3 id="42-network-topology--what-crosses-the-firewall">4.2 Network Topology — What Crosses the Firewall</h3>

<p>The diagram below shows <strong>exactly which connections must be allowed</strong> through firewalls or
VPN gateways. Only SLIM nodes and SPIRE servers open outbound connections — agents and
workload SPIRE servers never listen on externally reachable ports.</p>

<pre><code class="language-mermaid">flowchart LR
  subgraph FW_A["Firewall — Cluster A"]
    direction TB
    note_a["Allow outbound to:\nslim-control.admin.example:50052\nslim.cluster-b.example:46357\nspire-root.admin.example:8081"]
  end
  subgraph FW_B["Firewall — Cluster B"]
    direction TB
    note_b["Allow outbound to:\nslim-control.admin.example:50052\nslim.cluster-a.example:46357\nspire-root.admin.example:8081"]
  end

  AGENT_A["Agent (Cluster A)\nno inbound port"] --&gt;|outbound| SLIM_NODE_A["SLIM Node\n(cluster-a)"]
  SLIM_NODE_A --&gt;|"mTLS outbound"| FW_A
  FW_A --&gt; SLIM_NODE_B["SLIM Node\n(cluster-b)"]
  SLIM_NODE_B --&gt; AGENT_B["Agent (Cluster B)\nno inbound port"]

  AGENT_B --&gt;|outbound| SLIM_NODE_B
  SLIM_NODE_B --&gt;|"mTLS outbound"| FW_B
  FW_B --&gt; SLIM_NODE_A

  SPIRE_A["SPIRE Server\n(nested, cluster-a)"] --&gt;|"nested upstream\noutbound"| FW_A
  SPIRE_B["SPIRE Server\n(nested, cluster-b)"] --&gt;|"nested upstream\noutbound"| FW_B
</code></pre>

<p><strong>Firewall rule summary:</strong></p>

<table>
  <thead>
    <tr>
      <th>Source</th>
      <th>Destination</th>
      <th>Port</th>
      <th>Protocol</th>
      <th>Purpose</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Cluster A SLIM node</td>
      <td><code class="language-plaintext highlighter-rouge">slim.cluster-b.example</code></td>
      <td>46357</td>
      <td>TCP/mTLS</td>
      <td>Data plane inter-cluster</td>
    </tr>
    <tr>
      <td>Cluster B SLIM node</td>
      <td><code class="language-plaintext highlighter-rouge">slim.cluster-a.example</code></td>
      <td>46357</td>
      <td>TCP/mTLS</td>
      <td>Data plane inter-cluster</td>
    </tr>
    <tr>
      <td>Cluster A SLIM node</td>
      <td><code class="language-plaintext highlighter-rouge">slim-control.admin.example</code></td>
      <td>50052</td>
      <td>TCP/gRPC+mTLS</td>
      <td>Controller</td>
    </tr>
    <tr>
      <td>Cluster B SLIM node</td>
      <td><code class="language-plaintext highlighter-rouge">slim-control.admin.example</code></td>
      <td>50052</td>
      <td>TCP/gRPC+mTLS</td>
      <td>Controller</td>
    </tr>
    <tr>
      <td>Cluster A SPIRE Server (nested)</td>
      <td><code class="language-plaintext highlighter-rouge">spire-root.admin.example</code></td>
      <td>8081</td>
      <td>TCP/mTLS</td>
      <td>SPIRE nested upstream</td>
    </tr>
    <tr>
      <td>Cluster B SPIRE Server (nested)</td>
      <td><code class="language-plaintext highlighter-rouge">spire-root.admin.example</code></td>
      <td>8081</td>
      <td>TCP/mTLS</td>
      <td>SPIRE nested upstream</td>
    </tr>
    <tr>
      <td>Agents (all clusters)</td>
      <td>local SLIM node</td>
      <td>46357</td>
      <td>TCP (local)</td>
      <td>Agent ↔ SLIM (cluster-internal)</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><strong>No agent-to-agent, no agent-to-internet, and no inbound rules for workload SPIRE
servers are required.</strong></p>
</blockquote>

<h3 id="43-channel-subscription-model">4.3 Channel Subscription Model</h3>

<pre><code class="language-mermaid">sequenceDiagram
  participant CA as Security Agent (Cluster A)
  participant SA as SLIM Node A
  participant CTRL as Controller (Admin)
  participant SB as SLIM Node B
  participant CB as Security Agent (Cluster B)
  participant OP as Operator

  CA-&gt;&gt;SA: subscribe(acme/monitoring/security-incident)
  SA-&gt;&gt;CTRL: subscription update
  CTRL-&gt;&gt;SB: configure route to Cluster A for channel
  CB-&gt;&gt;SB: subscribe(acme/monitoring/security-incident)
  SB-&gt;&gt;CTRL: subscription update
  CTRL-&gt;&gt;SA: configure route to Cluster B for channel
  OP-&gt;&gt;SA: subscribe(acme/monitoring/security-incident)
  Note over SA,SB: All three parties now receive messages on this channel
</code></pre>

<h3 id="44-multi-channel-agent-participation">4.4 Multi-Channel Agent Participation</h3>

<p>A single agent can join multiple channels, each scoped to a different problem domain:</p>

<pre><code class="language-mermaid">flowchart TB
  AGENT["Security Agent (Cluster A)"]
  AGENT --&gt; CH1["acme/monitoring/security-incident\n← security topics"]
  AGENT --&gt; CH2["acme/remediation/cve-patch\n← remediation coordination"]
  AGENT --&gt; CH3["acme/escalation/human-review\n← when it needs a human"]
</code></pre>

<p>This is achieved by the agent creating multiple SLIM sessions, each with its own channel
name. The SLIM session layer manages group membership independently per channel.</p>

<hr />

<h2 id="5-use-case-enterprise-cluster-monitoring-agent-fleet">5. Use Case: Enterprise Cluster-Monitoring Agent Fleet</h2>

<h3 id="51-scenario-description">5.1 Scenario Description</h3>

<p><strong>Company:</strong> ACME Corp
<strong>Fleet:</strong> 20 Kubernetes clusters across 3 regions (US-East, EU-West, APAC)
<strong>Problem:</strong> Security incident detected on a node in EU-West Cluster 7</p>

<ul>
  <li>Clusters in EU-West sit behind a corporate VPN; no inbound ports are permitted</li>
  <li>Cluster 7 runs a <strong>Security Monitoring Agent</strong> and a <strong>Remediation Agent</strong></li>
  <li>The cloud (US-East admin cluster) hosts the SLIM Controller and an <strong>Escalation Handler</strong></li>
  <li>A human operator is on-call via their laptop connected to corporate VPN</li>
</ul>

<h3 id="52-agent-roster">5.2 Agent Roster</h3>

<table>
  <thead>
    <tr>
      <th>Agent</th>
      <th>Channel(s)</th>
      <th>Location</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">acme/eu-west/security-monitor</code></td>
      <td><code class="language-plaintext highlighter-rouge">security-incident</code>, <code class="language-plaintext highlighter-rouge">escalation</code></td>
      <td>Cluster 7 (EU)</td>
      <td>Detects anomalies, CVEs</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">acme/eu-west/remediation</code></td>
      <td><code class="language-plaintext highlighter-rouge">security-incident</code>, <code class="language-plaintext highlighter-rouge">cve-patch</code></td>
      <td>Cluster 7 (EU)</td>
      <td>Automated fixes</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">acme/us-east/security-monitor</code></td>
      <td><code class="language-plaintext highlighter-rouge">security-incident</code></td>
      <td>Cluster 1 (US)</td>
      <td>Cross-region correlation</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">acme/admin/escalation-handler</code></td>
      <td><code class="language-plaintext highlighter-rouge">escalation</code></td>
      <td>Admin cluster</td>
      <td>LLM-backed escalation agent</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">acme/admin/human-operator</code></td>
      <td><code class="language-plaintext highlighter-rouge">escalation</code>, <code class="language-plaintext highlighter-rouge">security-incident</code></td>
      <td>Operator laptop</td>
      <td>Human observer / approver</td>
    </tr>
  </tbody>
</table>

<h3 id="53-event-types">5.3 Event Types</h3>

<table>
  <thead>
    <tr>
      <th>Event</th>
      <th>Trigger</th>
      <th>Severity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Anomalous pod CPU spike</td>
      <td>Metrics threshold</td>
      <td>Low</td>
    </tr>
    <tr>
      <td>CVE detected in running image</td>
      <td>Image scan</td>
      <td>Medium</td>
    </tr>
    <tr>
      <td>Node compromise indicators</td>
      <td>Audit log analysis</td>
      <td>High</td>
    </tr>
    <tr>
      <td>Policy violation cascade</td>
      <td>OPA/Gatekeeper alerts</td>
      <td>High</td>
    </tr>
    <tr>
      <td>Agent escalation</td>
      <td>Agent decision</td>
      <td>Critical</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="6-communication-flows">6. Communication Flows</h2>

<h3 id="61-flow-a-automated-security-incident-resolution">6.1 Flow A: Automated Security Incident Resolution</h3>

<pre><code class="language-mermaid">sequenceDiagram
  participant K8S as K8s Events (EU-West)
  participant SEC as Security Agent (EU)
  participant SA as SLIM Node A
  participant SB as SLIM Node B
  participant CORS as Security Agent (US-East)
  participant REM as Remediation Agent

  K8S-&gt;&gt;SEC: CVE detected in nginx:1.21 — Critical
  SEC-&gt;&gt;SA: publish(security-incident, cve=CVE-2024-XXXX image=nginx:1.21)
  SA-&gt;&gt;SB: route message (mTLS)
  SB-&gt;&gt;CORS: deliver
  SA-&gt;&gt;REM: deliver (same cluster)
  CORS-&gt;&gt;SB: reply(also_affected=[us-east-3])
  SB-&gt;&gt;SA: route reply
  SA-&gt;&gt;SEC: deliver reply
  REM-&gt;&gt;SA: publish(cve-patch, action=rolling-update target=nginx:1.25)
  Note over SEC,REM: Remediation proceeds autonomously — no human intervention needed
</code></pre>

<h3 id="62-flow-b-human-in-the-loop-escalation">6.2 Flow B: Human-in-the-Loop Escalation</h3>

<pre><code class="language-mermaid">sequenceDiagram
  participant REM as Remediation Agent
  participant SA as SLIM Node A
  participant SC as SLIM Node C (Admin)
  participant ESC as Escalation Handler
  participant OP as Human Operator

  REM-&gt;&gt;SA: publish(escalation, reason=cannot-auto-patch severity=CRITICAL)
  SA-&gt;&gt;SC: route to admin cluster
  SC-&gt;&gt;ESC: deliver to escalation handler
  SC-&gt;&gt;OP: deliver to operator (subscribed to channel)
  ESC-&gt;&gt;SC: publish(paging operator, preparing runbook)
  OP-&gt;&gt;SC: publish(decision=APPROVE_DRAIN node=eu-west-7-node-3)
  SC-&gt;&gt;SA: route approval back
  SA-&gt;&gt;REM: deliver operator approval
  REM-&gt;&gt;SA: publish(security-incident, status=RESOLVED action=node-drain)
  Note over OP,SA: Operator never needed direct access to Cluster 7 — only to SLIM
</code></pre>

<h3 id="63-flow-c-agent-joining-multiple-channels">6.3 Flow C: Agent Joining Multiple Channels</h3>

<pre><code class="language-mermaid">flowchart LR
  subgraph CLUSTER_7["Cluster 7 (EU-West)"]
    SEC_7["Security Agent\nacme/eu-west/security-monitor"]
    SLIM7["SLIM Node"]
    SEC_7 --&gt;|"subscribe × 3"| SLIM7
  end

  CH1["Channel:\nacme/monitoring/security-incident"]
  CH2["Channel:\nacme/remediation/cve-patch"]
  CH3["Channel:\nacme/escalation/human-review"]

  SLIM7 --- CH1
  SLIM7 --- CH2
  SLIM7 --- CH3

  CH1 --- SEC_US["US-East\nSecurity Agent"]
  CH2 --- REM_7["Remediation Agent\n(same cluster)"]
  CH3 --- ESC_ADMIN["Escalation Handler\n(Admin cluster)"]
  CH3 --- OP["👤 Operator"]
</code></pre>

<h3 id="64-message-flow-timeline-full-incident-lifecycle">6.4 Message Flow Timeline (Full Incident Lifecycle)</h3>

<pre><code class="language-mermaid">gantt
  title Security Incident Lifecycle — SLIM-enabled
  dateFormat  mm:ss
  axisFormat  %M:%S

  section Detection
  CVE scan triggers event        : 00:00, 5s
  Security Agent wakes up        : 00:05, 3s

  section Agent Collaboration
  Publish to security-incident   : 00:08, 2s
  Cross-cluster agents respond   : 00:10, 8s
  Remediation Agent attempts fix : 00:18, 15s

  section Escalation
  Remediation fails, escalate    : 00:33, 3s
  Escalation handler notified    : 00:36, 5s
  Operator receives page         : 00:41, 10s

  section Resolution
  Operator approves action       : 00:51, 5s
  Remediation executes drain     : 00:56, 20s
  Incident closed, channels idle : 01:16, 5s
</code></pre>

<hr />

<h2 id="7-security-model">7. Security Model</h2>

<h3 id="71-layered-security-architecture">7.1 Layered Security Architecture</h3>

<pre><code class="language-mermaid">flowchart TB
  L4["Layer 4: Application-level authorization\nAgents validate sender SPIFFE SVID in message metadata"]
  L3["Layer 3: End-to-end MLS encryption (RFC 9420)\nSLIM routers cannot read agent payloads"]
  L2["Layer 2: Transport mTLS between SLIM nodes\nSPIRE-issued SVID certificates, auto-rotated"]
  L1["Layer 1: Workload identity — SPIFFE/SPIRE\nZero-trust · no static secrets · no shared API keys"]
  L4 --&gt; L3 --&gt; L2 --&gt; L1
</code></pre>

<h3 id="72-spire-nested-deployment-across-clusters">7.2 SPIRE Nested Deployment Across Clusters</h3>

<p>SPIFFE peer <strong>federation</strong> requires every SPIRE server to expose an endpoint reachable
from the other clusters — not viable in VPN-restricted or air-gapped environments.</p>

<p><strong>SPIRE nested deployment</strong> solves this: workload-cluster SPIRE servers act as nested
(downstream) servers and connect <em>outbound</em> to the admin root SPIRE server. Only the
admin SPIRE server requires an external endpoint.</p>

<pre><code class="language-mermaid">flowchart TB
  subgraph ADMIN["Admin Cluster — root SPIRE server\nspire-root.admin.example:8081 (public)"]
    SPIRE_ROOT["SPIRE Server (root)\ntrust domain: acme.example"]
    CTRL_SVID["SVID: spiffe://acme.example/slim/controller"]
    SPIRE_ROOT --- CTRL_SVID
  end

  subgraph CLUSTER_A["Cluster A — nested SPIRE server\n(no external endpoint required)"]
    SPIRE_NESTED_A["SPIRE Server (nested)\ntrust domain: acme.example"]
    NODE_A_SVID["SVID: spiffe://acme.example/cluster-a/slim/node-0"]
    AGENT_A_SVID["SVID: spiffe://acme.example/cluster-a/agent/security-monitor"]
    SPIRE_NESTED_A --- NODE_A_SVID
    SPIRE_NESTED_A --- AGENT_A_SVID
  end

  subgraph CLUSTER_B["Cluster B — nested SPIRE server\n(no external endpoint required)"]
    SPIRE_NESTED_B["SPIRE Server (nested)\ntrust domain: acme.example"]
    NODE_B_SVID["SVID: spiffe://acme.example/cluster-b/slim/node-0"]
    AGENT_B_SVID["SVID: spiffe://acme.example/cluster-b/agent/security-monitor"]
    SPIRE_NESTED_B --- NODE_B_SVID
    SPIRE_NESTED_B --- AGENT_B_SVID
  end

  SPIRE_NESTED_A --&gt;|"outbound upstream connection"| SPIRE_ROOT
  SPIRE_NESTED_B --&gt;|"outbound upstream connection"| SPIRE_ROOT
</code></pre>

<p>All SVIDs are issued under the <strong>single shared trust domain</strong> (<code class="language-plaintext highlighter-rouge">acme.example</code>).
The root SPIRE server is the chain-of-trust anchor; nested servers delegate issuance to
their local workloads. SLIM nodes on different clusters can mutually authenticate because
they share the same trust domain and their certificates chain to the same root.</p>

<h3 id="73-security-properties">7.3 Security Properties</h3>

<table>
  <thead>
    <tr>
      <th>Property</th>
      <th>Mechanism</th>
      <th>Benefit</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>No exposed agent ports</td>
      <td>SLIM outbound-only model</td>
      <td>Eliminates entire attack surface class</td>
    </tr>
    <tr>
      <td>Workload identity</td>
      <td>SPIRE SVID (X.509 + JWT)</td>
      <td>No static credentials — identity is cryptographic</td>
    </tr>
    <tr>
      <td>Inter-node transport</td>
      <td>mTLS with SPIRE-issued certs</td>
      <td>Auto-rotated, zero-touch cert management</td>
    </tr>
    <tr>
      <td>Agent payload privacy</td>
      <td>MLS group encryption</td>
      <td>SLIM routing nodes are zero-knowledge to payloads</td>
    </tr>
    <tr>
      <td>Operator access control</td>
      <td>Channel subscription + SVID</td>
      <td>Operator only subscribes; no cluster-level access needed</td>
    </tr>
    <tr>
      <td>Audit trail</td>
      <td>SLIM controller events + SPIRE attestation</td>
      <td>Full provenance of who joined which channel</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="8-demo-scenario">8. Demo Scenario</h2>

<h3 id="81-environment-setup">8.1 Environment Setup</h3>

<p>The demo uses three kind clusters to simulate the production environment:</p>

<table>
  <thead>
    <tr>
      <th>Cluster</th>
      <th>Role</th>
      <th>Simulates</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">kind-admin.example</code></td>
      <td>SLIM Controller + SPIRE Server</td>
      <td>Cloud management plane</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">kind-cluster-a.example</code></td>
      <td>SLIM nodes + Security &amp; Remediation Agents</td>
      <td>On-prem cluster (VPN-restricted)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">kind-cluster-b.example</code></td>
      <td>SLIM nodes + Security Agent</td>
      <td>Cloud cluster (remote region)</td>
    </tr>
  </tbody>
</table>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Start clusters and install SPIRE (nested deployment)</span>
<span class="nb">sudo </span>task multi-cluster:up

<span class="c"># 2. Deploy Controller on admin cluster</span>
task controller:deploy

<span class="c"># 3. Deploy SLIM on workload clusters</span>
task slim:deploy

<span class="c"># 4. Deploy agents</span>
task demo:agents:deploy
</code></pre></div></div>

<h3 id="82-demo-script--step-by-step">8.2 Demo Script — Step by Step</h3>

<h4 id="step-1-show-baseline--agents-running-channels-empty">Step 1: Show baseline — agents running, channels empty</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Show agents are running but NOT listening on any port</span>
kubectl get pods <span class="nt">-n</span> default  <span class="c"># agents running</span>
kubectl <span class="nb">exec</span> <span class="nt">-it</span> &lt;security-agent-pod&gt; <span class="nt">--</span> ss <span class="nt">-tlnp</span>  <span class="c"># no open ports</span>
</code></pre></div></div>

<h4 id="step-2-inject-a-simulated-cve-event">Step 2: Inject a simulated CVE event</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Inject a CVE detection event into Security Agent on Cluster A</span>
kubectl <span class="nb">exec</span> <span class="nt">-it</span> &lt;security-agent-pod&gt; <span class="nt">--</span> <span class="se">\</span>
  python3 inject_event.py <span class="nt">--type</span> cve <span class="nt">--severity</span> critical <span class="nt">--image</span> nginx:1.21
</code></pre></div></div>

<h4 id="step-3-watch-agents-collaborate-on-the-security-incident-channel">Step 3: Watch agents collaborate on the security-incident channel</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Follow the channel in real time (operator view)</span>
slimctl channel subscribe acme/monitoring/security-incident <span class="nt">--cluster</span> admin.example
</code></pre></div></div>

<p>Expected output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[00:00] eu-west/security-monitor  → CVE-2024-XXXX detected in nginx:1.21
[00:02] us-east/security-monitor  → Confirmed affected: us-east-3 also running nginx:1.21
[00:04] eu-west/remediation       → Attempting rolling update to nginx:1.25
[00:19] eu-west/remediation       → Node tainted, cannot auto-patch — escalating
</code></pre></div></div>

<h4 id="step-4-observe-escalation-to-human-operator">Step 4: Observe escalation to human operator</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Operator joins escalation channel from their laptop</span>
slimctl channel subscribe acme/escalation/human-review
</code></pre></div></div>

<p>Expected output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[00:22] eu-west/security-monitor  → Escalation requested: node-drain required
[00:24] admin/escalation-handler  → Runbook loaded, paging operator
[APPROVAL REQUIRED]
</code></pre></div></div>

<h4 id="step-5-operator-approves-and-watches-resolution">Step 5: Operator approves and watches resolution</h4>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Operator sends approval</span>
slimctl channel publish acme/escalation/human-review <span class="se">\</span>
  <span class="s1">'{"decision": "APPROVE_DRAIN", "node": "eu-west-7-node-3"}'</span>
</code></pre></div></div>

<p>Expected output:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[00:31] eu-west/remediation       → Node drain initiated
[00:51] eu-west/remediation       → Drain complete, workloads rescheduled
[00:52] eu-west/security-monitor  → Incident resolved — CVE remediated
</code></pre></div></div>

<h3 id="83-demo-talking-points">8.3 Demo Talking Points</h3>

<ol>
  <li>
    <p><strong>“Notice that no agent opened a port”</strong> — <code class="language-plaintext highlighter-rouge">ss -tlnp</code> shows nothing. Yet cross-cluster
messaging worked transparently.</p>
  </li>
  <li>
    <p><strong>“The only firewall rules needed”</strong> — point to the two SLIM node endpoints, not the
dozens of agent-pair rules a traditional approach would require.</p>
  </li>
  <li>
    <p><strong>“The operator joined from a laptop”</strong> — connected to corporate VPN, authenticated via
SPIRE, and joined the channel. No jump host, no kubectl proxy, no cluster-level access.</p>
  </li>
  <li>
    <p><strong>“MLS means SLIM nodes are zero-knowledge”</strong> — the SLIM routers forwarded every byte
without being able to read a single word of the agent messages.</p>
  </li>
  <li>
    <p><strong>“Adding a new cluster is trivial”</strong> — register it with SPIRE, deploy SLIM with the
correct group name, point it at the controller. No routing rule changes in other
clusters.</p>
  </li>
</ol>

<hr />

<h2 id="9-key-value-propositions">9. Key Value Propositions</h2>

<h3 id="91-simplified-operations">9.1 Simplified Operations</h3>

<table>
  <thead>
    <tr>
      <th>Before SLIM</th>
      <th>With SLIM</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Firewall rules: O(n²) agent pairs</td>
      <td>Firewall rules: O(n) SLIM nodes</td>
    </tr>
    <tr>
      <td>DNS + cert per agent</td>
      <td>Zero per-agent network config</td>
    </tr>
    <tr>
      <td>Service registry maintenance</td>
      <td>Channel names are self-describing, DNS-free</td>
    </tr>
    <tr>
      <td>VPN access required per cluster</td>
      <td>One SLIM endpoint per cluster suffices</td>
    </tr>
  </tbody>
</table>

<h3 id="92-enhanced-security">9.2 Enhanced Security</h3>

<ul>
  <li><strong>Zero server exposure per agent</strong> — drastically reduces attack surface</li>
  <li><strong>End-to-end MLS encryption</strong> — infrastructure operators cannot read agent payloads</li>
  <li><strong>Cryptographic workload identity</strong> — SPIRE eliminates static credential sprawl</li>
  <li><strong>Principle of least privilege</strong> — operators observe only what they subscribe to</li>
</ul>

<h3 id="93-developer-experience">9.3 Developer Experience</h3>

<ul>
  <li>Agents use high-level SLIMRPC or pub/sub APIs — no networking code</li>
  <li>Language support: <strong>Python, Go, Java, .NET (C#), Kotlin, Rust</strong></li>
  <li>Protocol support: <strong>A2A, MCP, custom protobuf</strong></li>
  <li>Session types: <strong>point-to-point, multicast (group), streaming</strong></li>
</ul>

<h3 id="94-operational-scalability">9.4 Operational Scalability</h3>

<ul>
  <li><strong>Dynamic topology</strong> — agents join/leave channels without reconfiguration</li>
  <li><strong>Auto-routing</strong> — SLIM Controller creates inter-cluster routes on first subscription</li>
  <li><strong>Fleet-scale</strong> — tested with large StatefulSet deployments; single control plane for all</li>
  <li><strong>Multi-language agents</strong> — Python ML agent ↔ Go infrastructure agent ↔ Java app agent,
all on the same channel</li>
</ul>

<hr />

<h2 id="10-implementation-notes">10. Implementation Notes</h2>

<h3 id="101-tech-stack">10.1 Tech Stack</h3>

<table>
  <thead>
    <tr>
      <th>Component</th>
      <th>Technology</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>SLIM Node (data plane)</td>
      <td>Rust — high-performance message router</td>
    </tr>
    <tr>
      <td>SLIM Session Layer</td>
      <td>Rust — MLS (RFC 9420) encryption + group management</td>
    </tr>
    <tr>
      <td>SLIMRPC</td>
      <td>Protobuf + generated stubs (all languages)</td>
    </tr>
    <tr>
      <td>SLIM Controller</td>
      <td>Go (Kubernetes controller pattern)</td>
    </tr>
    <tr>
      <td>Workload identity</td>
      <td>SPIRE / SPIFFE</td>
    </tr>
    <tr>
      <td>Demo agents</td>
      <td>Python (ADK or custom SLIM Python bindings)</td>
    </tr>
    <tr>
      <td>Deployment</td>
      <td>Kubernetes (Helm charts)</td>
    </tr>
    <tr>
      <td>Observability</td>
      <td>OpenTelemetry (SLIM has native OTEL support)</td>
    </tr>
  </tbody>
</table>

<h3 id="102-key-apis-used-in-demo-agents">10.2 Key APIs Used in Demo Agents</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">slim_bindings</span>

<span class="c1"># Agent startup — outbound connection only, no listening port
</span><span class="n">svc</span> <span class="o">=</span> <span class="n">slim_bindings</span><span class="p">.</span><span class="n">Service</span><span class="p">.</span><span class="n">new</span><span class="p">(</span><span class="n">slim_node_addr</span><span class="p">)</span>
<span class="n">app</span> <span class="o">=</span> <span class="n">svc</span><span class="p">.</span><span class="n">create_app_with_secret</span><span class="p">(</span><span class="n">agent_name</span><span class="p">,</span> <span class="n">shared_secret</span><span class="p">)</span>
<span class="n">conn_id</span> <span class="o">=</span> <span class="n">svc</span><span class="p">.</span><span class="n">connect</span><span class="p">(</span><span class="n">slim_bindings</span><span class="p">.</span><span class="n">ClientConfig</span><span class="p">.</span><span class="n">new_with_tls</span><span class="p">(</span><span class="n">slim_node_addr</span><span class="p">))</span>

<span class="c1"># Subscribe to a channel
</span><span class="n">app</span><span class="p">.</span><span class="n">subscribe</span><span class="p">(</span><span class="n">app</span><span class="p">.</span><span class="n">name</span><span class="p">(),</span> <span class="n">conn_id</span><span class="p">)</span>

<span class="c1"># Join the security-incident group channel
</span><span class="n">session_cfg</span> <span class="o">=</span> <span class="n">slim_bindings</span><span class="p">.</span><span class="n">SessionConfig</span><span class="p">(</span>
    <span class="n">session_type</span><span class="o">=</span><span class="n">slim_bindings</span><span class="p">.</span><span class="n">SessionType</span><span class="p">.</span><span class="n">MULTICAST</span><span class="p">,</span>
    <span class="n">mls_enabled</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">session</span><span class="p">,</span> <span class="n">done</span> <span class="o">=</span> <span class="k">await</span> <span class="n">app</span><span class="p">.</span><span class="n">create_session</span><span class="p">(</span><span class="n">session_cfg</span><span class="p">,</span> <span class="s">"acme/monitoring/security-incident"</span><span class="p">)</span>
<span class="k">await</span> <span class="n">done</span>  <span class="c1"># wait for group to form
</span>
<span class="c1"># Publish an event
</span><span class="k">await</span> <span class="n">app</span><span class="p">.</span><span class="n">publish</span><span class="p">(</span><span class="s">"acme/monitoring/security-incident"</span><span class="p">,</span> <span class="n">payload_bytes</span><span class="p">)</span>

<span class="c1"># Receive messages (event-driven)
</span><span class="k">async</span> <span class="k">for</span> <span class="n">msg</span> <span class="ow">in</span> <span class="n">app</span><span class="p">.</span><span class="n">receive</span><span class="p">():</span>
    <span class="k">await</span> <span class="n">handle_event</span><span class="p">(</span><span class="n">msg</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="103-helm-deployment-snippet">10.3 Helm Deployment Snippet</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># values-cluster-a.yaml</span>
<span class="na">slim</span><span class="pi">:</span>
  <span class="na">config</span><span class="pi">:</span>
    <span class="na">services</span><span class="pi">:</span>
      <span class="na">slim/0</span><span class="pi">:</span>
        <span class="na">node_id</span><span class="pi">:</span> <span class="s2">"</span><span class="s">${env:SLIM_SVC_ID}"</span>
        <span class="na">group_name</span><span class="pi">:</span> <span class="s2">"</span><span class="s">cluster-a.example"</span>
        <span class="na">dataplane</span><span class="pi">:</span>
          <span class="na">servers</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="na">endpoint</span><span class="pi">:</span> <span class="s2">"</span><span class="s">0.0.0.0:46357"</span>
              <span class="na">metadata</span><span class="pi">:</span>
                <span class="na">local_endpoint</span><span class="pi">:</span> <span class="s2">"</span><span class="s">${env:MY_POD_IP}"</span>
                <span class="na">external_endpoint</span><span class="pi">:</span> <span class="s2">"</span><span class="s">slim.cluster-a.example:46357"</span>
              <span class="na">tls</span><span class="pi">:</span>
                <span class="na">source</span><span class="pi">:</span>
                  <span class="na">type</span><span class="pi">:</span> <span class="s">spire</span>
    <span class="na">controller</span><span class="pi">:</span>
      <span class="na">clients</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">endpoint</span><span class="pi">:</span> <span class="s2">"</span><span class="s">https://slim-control.admin.example:50052"</span>
          <span class="na">tls</span><span class="pi">:</span>
            <span class="na">source</span><span class="pi">:</span>
              <span class="na">type</span><span class="pi">:</span> <span class="s">spire</span>
</code></pre></div></div>

<h3 id="104-prerequisites-for-demo-environment">10.4 Prerequisites for Demo Environment</h3>

<ul>
  <li><code class="language-plaintext highlighter-rouge">kind</code> v0.20+</li>
  <li><code class="language-plaintext highlighter-rouge">kubectl</code> v1.28+</li>
  <li><code class="language-plaintext highlighter-rouge">helm</code> v3.12+</li>
  <li><code class="language-plaintext highlighter-rouge">task</code> (Taskfile runner)</li>
  <li><code class="language-plaintext highlighter-rouge">spire-server</code> / <code class="language-plaintext highlighter-rouge">spire-agent</code> (deployed via Helm chart in clusters)</li>
  <li>SLIM Helm charts (<code class="language-plaintext highlighter-rouge">charts/slim</code>, <code class="language-plaintext highlighter-rouge">charts/slim-control-plane</code>)</li>
  <li>Docker for building demo agent images</li>
</ul>

<hr />

<h2 id="appendix-a-glossary">Appendix A: Glossary</h2>

<table>
  <thead>
    <tr>
      <th>Term</th>
      <th>Definition</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>SLIM</strong></td>
      <td>Secure Low-Latency Interactive Messaging — the transport framework</td>
    </tr>
    <tr>
      <td><strong>SLIMRPC</strong></td>
      <td>SLIM’s request/response RPC layer built on top of the session layer</td>
    </tr>
    <tr>
      <td><strong>MLS</strong></td>
      <td>Messaging Layer Security (RFC 9420) — E2E encryption for groups</td>
    </tr>
    <tr>
      <td><strong>SPIRE</strong></td>
      <td>SPIFFE Runtime Environment — issues SVID certificates to workloads</td>
    </tr>
    <tr>
      <td><strong>SPIFFE</strong></td>
      <td>Secure Production Identity Framework for Everyone</td>
    </tr>
    <tr>
      <td><strong>SVID</strong></td>
      <td>SPIFFE Verifiable Identity Document (X.509 cert or JWT)</td>
    </tr>
    <tr>
      <td><strong>Channel</strong></td>
      <td>A named pub/sub topic in SLIM (hierarchical: org/ns/topic)</td>
    </tr>
    <tr>
      <td><strong>Group session</strong></td>
      <td>A SLIM multicast session with multiple subscribers</td>
    </tr>
    <tr>
      <td><strong>Data plane</strong></td>
      <td>SLIM message routing layer (pure forwarding, zero payload inspection)</td>
    </tr>
    <tr>
      <td><strong>Control plane</strong></td>
      <td>SLIM management layer (route configuration, monitoring)</td>
    </tr>
    <tr>
      <td><strong>Session layer</strong></td>
      <td>SLIM encryption + group membership layer (sits above data plane)</td>
    </tr>
  </tbody>
</table>

<h2 id="appendix-b-references">Appendix B: References</h2>

<ul>
  <li><a href="https://github.com/agntcy/slim">SLIM Repository</a></li>
  <li><a href="https://docs.agntcy.org/slim/overview/">SLIM Documentation</a></li>
  <li><a href="../../deployments/multicluster/multi_cluster_strategy.md">Multi-Cluster Deployment Strategy</a></li>
  <li><a href="../../data-plane/README.md">SLIM Data Plane README</a></li>
  <li><a href="../../control-plane/control-plane/README.md">SLIM Control Plane README</a></li>
  <li><a href="https://spiffe.io/">SPIFFE/SPIRE</a></li>
  <li><a href="https://www.rfc-editor.org/rfc/rfc9420">MLS RFC 9420</a></li>
  <li><a href="https://a2a.ai">A2A Protocol</a></li>
  <li><a href="https://modelcontextprotocol.io">MCP Protocol</a></li>
  <li><a href="https://github.com/agntcy/slim/issues/1372">Issue #1372 — Multicluster Epic</a></li>
</ul>]]></content><author><name>Luca Muscariello</name></author><category term="technical" /><category term="slim" /><category term="multi-agent" /><category term="agents" /><category term="multicluster" /><summary type="html"><![CDATA[Related issue: #1372 — Epic: SLIM multicluster autoconfig installation for server fleets Table of Contents Executive Summary Business Problem Solution: SLIM as the Communication Backbone Architecture Use Case: Enterprise Cluster-Monitoring Agent Fleet Communication Flows Security Model Demo Scenario Key Value Propositions Implementation Notes 1. Executive Summary This MVP demonstrates how SLIM (Secure Low-Latency Interactive Messaging) enables AI agents deployed across multiple Kubernetes clusters — including clusters hidden behind corporate VPNs and firewalls — to communicate securely without exposing any agent as a network server. Agents subscribe to named channels (topics) matching the problem they need to solve. When a cluster event fires, the relevant agents wake up, collaborate through SLIM, and escalate to a human operator if needed. All of this works across network boundaries that would otherwise require complex firewall rules, VPN tunnels, or service mesh configuration per agent. Core value proposition: SLIM turns the hardest part of multi-cluster agent communication — the network — into a non-problem, while making the system more secure, not less. 2. Business Problem 2.1 The Enterprise Fleet Reality Large enterprises operate AI agent fleets across tens or hundreds of Kubernetes clusters. These clusters span different environments: Environment Network constraint On-prem data centers Corporate firewall / private network Branch offices VPN-only access Cloud regions Transit VPC, private endpoints Edge / OT networks Strictly air-gapped or NAT-only outbound Each cluster runs specialized AI agents: Performance agents — track CPU, memory, latency, and SLO compliance Security agents — detect anomalies, policy violations, CVEs in running images Remediation agents — attempt automated fixes (node drain, pod restart, config rollback) Escalation agents — page a human operator when automated resolution is insufficient 2.2 What Makes Agent Communication Hard Traditional approaches require agents to expose themselves as servers: Agent A ──► opens TCP port ──► registered in service registry ──► firewall rule Agent B ──► DNS lookup ──► TLS handshake ──► mTLS cert rotation ──► calls Agent A This creates a dense web of operational complexity: Every cluster behind a VPN requires explicit firewall/NAT rules per agent pair Every agent needs a stable DNS name and TLS certificate, even for ephemeral workloads Human operators wanting to observe or intervene must be given network-level access to the relevant cluster Agent-to-agent topology changes (scale-up, migration) invalidate stale service registry entries and routing rules 2.3 Requirements for the MVP Requirement Description Zero exposed ports per agent Agents must not listen on any TCP/UDP port Cross-cluster messaging Agents on Cluster A and Cluster B communicate transparently Works behind VPN / NAT Outbound-only connectivity from cluster to SLIM overlay is sufficient Event-driven wake-up Agents are idle until a relevant event arrives on their channel Multi-channel membership A single agent can participate in multiple topic channels simultaneously Human-in-the-loop Operators can observe, validate, and act via the same channel mechanism End-to-end encryption Agent messages are encrypted with MLS; SLIM nodes cannot read payloads Workload identity SPIRE provides SPIFFE certificates — no shared secrets, no static API keys 3. Solution: SLIM as the Communication Backbone 3.1 What SLIM Provides SLIM is a publish-subscribe message router with a session layer. Its architecture has three components that cleanly separate concerns: flowchart TB subgraph APP["Agent / Application"] RPC["SLIMRPC (request/response, streaming)"] SL["Session Layer (MLS encryption, group mgmt)"] DPC["Data Plane Client (routing, transport)"] RPC --&gt; SL --&gt; DPC end DPC --&gt;|"outbound connection only"| NL["SLIM Routing Node\n(cluster-local)"] NL --&gt;|"mTLS (SPIRE-issued certs)"| NR["SLIM Routing Node\n(remote cluster)"] 3.2 The Channel Model Every agent is identified by a hierarchical name: &lt;org&gt;/&lt;namespace&gt;/&lt;agent-type&gt;[#id]. Agents subscribe to a named channel topic. The SLIM controller sees these subscriptions and automatically creates routes between the cluster-local SLIM node and any remote nodes that have subscribers for the same topic. channel: acme/monitoring/security-incident │ │ │ │ │ └── topic (problem domain) │ └── namespace (cluster or team scope) └── org (enterprise identifier) Key properties: An agent subscribes with no knowledge of the remote agent’s address — only the channel name The SLIM router handles delivery; the agent never opens a listening port Multiple agents can subscribe to the same channel → multicast delivery An agent can subscribe to multiple channels simultaneously 3.3 Why SLIM Solves the Network Problem Traditional approach SLIM approach Agent opens a port and registers in a service registry Agent connects outbound to the local SLIM node — no inbound port Firewall rules required for every agent pair Only the SLIM node itself needs one exposed endpoint per cluster DNS + TLS cert per agent Workload identity via SPIRE SVID — automatic, no manual cert management Service mesh or VPN tunnel between every cluster pair SLIM nodes federate; agents are completely unaware of cluster topology Agent scale-out changes routing config Agents with the same name auto-form a group; SLIM load-balances 4. Architecture 4.1 High-Level System Architecture --- config: layout: elk --- flowchart TB subgraph CLOUD["Cloud / Admin Cluster (admin.example)"] direction TB CTRL["SLIM Controller\n(route management)"] SPIRE_ADMIN["SPIRE Server (root)\ntrust domain: acme.example\nspire-root.admin.example:8081"] LB_CTRL["LoadBalancer\nslim-control.admin.example:50052"] LB_CTRL --&gt; CTRL end subgraph CLUSTER_A["Cluster A — On-prem / VPN"] direction TB SPIRE_A["SPIRE Server (nested)\nno external endpoint"] subgraph SLIM_A_NS["slim namespace"] SLIM_A["SLIM Node (StatefulSet)\nslim.cluster-a.example:46357"] end subgraph AGENTS_A["default namespace"] PERF_A["Performance\nAgent"] SEC_A["Security\nAgent"] REMED_A["Remediation\nAgent"] end PERF_A --&gt;|"outbound only"| SLIM_A SEC_A --&gt;|"outbound only"| SLIM_A REMED_A--&gt;|"outbound only"| SLIM_A end subgraph CLUSTER_B["Cluster B — Cloud region"] direction TB SPIRE_B["SPIRE Server (nested)\nno external endpoint"] subgraph SLIM_B_NS["slim namespace"] SLIM_B["SLIM Node (StatefulSet)\nslim.cluster-b.example:46357"] end subgraph AGENTS_B["default namespace"] PERF_B["Performance\nAgent"] SEC_B["Security\nAgent"] REMED_B["Remediation\nAgent"] end PERF_B --&gt;|"outbound only"| SLIM_B SEC_B --&gt;|"outbound only"| SLIM_B REMED_B--&gt;|"outbound only"| SLIM_B end subgraph OPERATOR["Operator Workstation / Jump Host"] HUMAN["👤 Human Operator\n(observer / approver)"] HUMAN --&gt;|"outbound to any SLIM node"| SLIM_A end SLIM_A &lt;--&gt;|"mTLS (SPIRE-issued certs)\noutbound from each side"| SLIM_B SLIM_A --&gt;|"gRPC controller connection"| CTRL SLIM_B --&gt;|"gRPC controller connection"| CTRL SPIRE_A --&gt;|"nested upstream (outbound)"| SPIRE_ADMIN SPIRE_B --&gt;|"nested upstream (outbound)"| SPIRE_ADMIN SPIRE nested deployment: Workload-cluster SPIRE servers connect outbound to the admin root SPIRE server as nested (downstream) servers. Only the admin SPIRE server requires an external endpoint — workload SPIRE servers never need to be reachable from outside their own cluster. See SPIRE nested architecture. 4.2 Network Topology — What Crosses the Firewall The diagram below shows exactly which connections must be allowed through firewalls or VPN gateways. Only SLIM nodes and SPIRE servers open outbound connections — agents and workload SPIRE servers never listen on externally reachable ports. flowchart LR subgraph FW_A["Firewall — Cluster A"] direction TB note_a["Allow outbound to:\nslim-control.admin.example:50052\nslim.cluster-b.example:46357\nspire-root.admin.example:8081"] end subgraph FW_B["Firewall — Cluster B"] direction TB note_b["Allow outbound to:\nslim-control.admin.example:50052\nslim.cluster-a.example:46357\nspire-root.admin.example:8081"] end AGENT_A["Agent (Cluster A)\nno inbound port"] --&gt;|outbound| SLIM_NODE_A["SLIM Node\n(cluster-a)"] SLIM_NODE_A --&gt;|"mTLS outbound"| FW_A FW_A --&gt; SLIM_NODE_B["SLIM Node\n(cluster-b)"] SLIM_NODE_B --&gt; AGENT_B["Agent (Cluster B)\nno inbound port"] AGENT_B --&gt;|outbound| SLIM_NODE_B SLIM_NODE_B --&gt;|"mTLS outbound"| FW_B FW_B --&gt; SLIM_NODE_A SPIRE_A["SPIRE Server\n(nested, cluster-a)"] --&gt;|"nested upstream\noutbound"| FW_A SPIRE_B["SPIRE Server\n(nested, cluster-b)"] --&gt;|"nested upstream\noutbound"| FW_B Firewall rule summary: Source Destination Port Protocol Purpose Cluster A SLIM node slim.cluster-b.example 46357 TCP/mTLS Data plane inter-cluster Cluster B SLIM node slim.cluster-a.example 46357 TCP/mTLS Data plane inter-cluster Cluster A SLIM node slim-control.admin.example 50052 TCP/gRPC+mTLS Controller Cluster B SLIM node slim-control.admin.example 50052 TCP/gRPC+mTLS Controller Cluster A SPIRE Server (nested) spire-root.admin.example 8081 TCP/mTLS SPIRE nested upstream Cluster B SPIRE Server (nested) spire-root.admin.example 8081 TCP/mTLS SPIRE nested upstream Agents (all clusters) local SLIM node 46357 TCP (local) Agent ↔ SLIM (cluster-internal) No agent-to-agent, no agent-to-internet, and no inbound rules for workload SPIRE servers are required. 4.3 Channel Subscription Model sequenceDiagram participant CA as Security Agent (Cluster A) participant SA as SLIM Node A participant CTRL as Controller (Admin) participant SB as SLIM Node B participant CB as Security Agent (Cluster B) participant OP as Operator CA-&gt;&gt;SA: subscribe(acme/monitoring/security-incident) SA-&gt;&gt;CTRL: subscription update CTRL-&gt;&gt;SB: configure route to Cluster A for channel CB-&gt;&gt;SB: subscribe(acme/monitoring/security-incident) SB-&gt;&gt;CTRL: subscription update CTRL-&gt;&gt;SA: configure route to Cluster B for channel OP-&gt;&gt;SA: subscribe(acme/monitoring/security-incident) Note over SA,SB: All three parties now receive messages on this channel 4.4 Multi-Channel Agent Participation A single agent can join multiple channels, each scoped to a different problem domain: flowchart TB AGENT["Security Agent (Cluster A)"] AGENT --&gt; CH1["acme/monitoring/security-incident\n← security topics"] AGENT --&gt; CH2["acme/remediation/cve-patch\n← remediation coordination"] AGENT --&gt; CH3["acme/escalation/human-review\n← when it needs a human"] This is achieved by the agent creating multiple SLIM sessions, each with its own channel name. The SLIM session layer manages group membership independently per channel. 5. Use Case: Enterprise Cluster-Monitoring Agent Fleet 5.1 Scenario Description Company: ACME Corp Fleet: 20 Kubernetes clusters across 3 regions (US-East, EU-West, APAC) Problem: Security incident detected on a node in EU-West Cluster 7 Clusters in EU-West sit behind a corporate VPN; no inbound ports are permitted Cluster 7 runs a Security Monitoring Agent and a Remediation Agent The cloud (US-East admin cluster) hosts the SLIM Controller and an Escalation Handler A human operator is on-call via their laptop connected to corporate VPN 5.2 Agent Roster Agent Channel(s) Location Description acme/eu-west/security-monitor security-incident, escalation Cluster 7 (EU) Detects anomalies, CVEs acme/eu-west/remediation security-incident, cve-patch Cluster 7 (EU) Automated fixes acme/us-east/security-monitor security-incident Cluster 1 (US) Cross-region correlation acme/admin/escalation-handler escalation Admin cluster LLM-backed escalation agent acme/admin/human-operator escalation, security-incident Operator laptop Human observer / approver 5.3 Event Types Event Trigger Severity Anomalous pod CPU spike Metrics threshold Low CVE detected in running image Image scan Medium Node compromise indicators Audit log analysis High Policy violation cascade OPA/Gatekeeper alerts High Agent escalation Agent decision Critical 6. Communication Flows 6.1 Flow A: Automated Security Incident Resolution sequenceDiagram participant K8S as K8s Events (EU-West) participant SEC as Security Agent (EU) participant SA as SLIM Node A participant SB as SLIM Node B participant CORS as Security Agent (US-East) participant REM as Remediation Agent K8S-&gt;&gt;SEC: CVE detected in nginx:1.21 — Critical SEC-&gt;&gt;SA: publish(security-incident, cve=CVE-2024-XXXX image=nginx:1.21) SA-&gt;&gt;SB: route message (mTLS) SB-&gt;&gt;CORS: deliver SA-&gt;&gt;REM: deliver (same cluster) CORS-&gt;&gt;SB: reply(also_affected=[us-east-3]) SB-&gt;&gt;SA: route reply SA-&gt;&gt;SEC: deliver reply REM-&gt;&gt;SA: publish(cve-patch, action=rolling-update target=nginx:1.25) Note over SEC,REM: Remediation proceeds autonomously — no human intervention needed 6.2 Flow B: Human-in-the-Loop Escalation sequenceDiagram participant REM as Remediation Agent participant SA as SLIM Node A participant SC as SLIM Node C (Admin) participant ESC as Escalation Handler participant OP as Human Operator REM-&gt;&gt;SA: publish(escalation, reason=cannot-auto-patch severity=CRITICAL) SA-&gt;&gt;SC: route to admin cluster SC-&gt;&gt;ESC: deliver to escalation handler SC-&gt;&gt;OP: deliver to operator (subscribed to channel) ESC-&gt;&gt;SC: publish(paging operator, preparing runbook) OP-&gt;&gt;SC: publish(decision=APPROVE_DRAIN node=eu-west-7-node-3) SC-&gt;&gt;SA: route approval back SA-&gt;&gt;REM: deliver operator approval REM-&gt;&gt;SA: publish(security-incident, status=RESOLVED action=node-drain) Note over OP,SA: Operator never needed direct access to Cluster 7 — only to SLIM 6.3 Flow C: Agent Joining Multiple Channels flowchart LR subgraph CLUSTER_7["Cluster 7 (EU-West)"] SEC_7["Security Agent\nacme/eu-west/security-monitor"] SLIM7["SLIM Node"] SEC_7 --&gt;|"subscribe × 3"| SLIM7 end CH1["Channel:\nacme/monitoring/security-incident"] CH2["Channel:\nacme/remediation/cve-patch"] CH3["Channel:\nacme/escalation/human-review"] SLIM7 --- CH1 SLIM7 --- CH2 SLIM7 --- CH3 CH1 --- SEC_US["US-East\nSecurity Agent"] CH2 --- REM_7["Remediation Agent\n(same cluster)"] CH3 --- ESC_ADMIN["Escalation Handler\n(Admin cluster)"] CH3 --- OP["👤 Operator"] 6.4 Message Flow Timeline (Full Incident Lifecycle) gantt title Security Incident Lifecycle — SLIM-enabled dateFormat mm:ss axisFormat %M:%S section Detection CVE scan triggers event : 00:00, 5s Security Agent wakes up : 00:05, 3s section Agent Collaboration Publish to security-incident : 00:08, 2s Cross-cluster agents respond : 00:10, 8s Remediation Agent attempts fix : 00:18, 15s section Escalation Remediation fails, escalate : 00:33, 3s Escalation handler notified : 00:36, 5s Operator receives page : 00:41, 10s section Resolution Operator approves action : 00:51, 5s Remediation executes drain : 00:56, 20s Incident closed, channels idle : 01:16, 5s 7. Security Model 7.1 Layered Security Architecture flowchart TB L4["Layer 4: Application-level authorization\nAgents validate sender SPIFFE SVID in message metadata"] L3["Layer 3: End-to-end MLS encryption (RFC 9420)\nSLIM routers cannot read agent payloads"] L2["Layer 2: Transport mTLS between SLIM nodes\nSPIRE-issued SVID certificates, auto-rotated"] L1["Layer 1: Workload identity — SPIFFE/SPIRE\nZero-trust · no static secrets · no shared API keys"] L4 --&gt; L3 --&gt; L2 --&gt; L1 7.2 SPIRE Nested Deployment Across Clusters SPIFFE peer federation requires every SPIRE server to expose an endpoint reachable from the other clusters — not viable in VPN-restricted or air-gapped environments. SPIRE nested deployment solves this: workload-cluster SPIRE servers act as nested (downstream) servers and connect outbound to the admin root SPIRE server. Only the admin SPIRE server requires an external endpoint. flowchart TB subgraph ADMIN["Admin Cluster — root SPIRE server\nspire-root.admin.example:8081 (public)"] SPIRE_ROOT["SPIRE Server (root)\ntrust domain: acme.example"] CTRL_SVID["SVID: spiffe://acme.example/slim/controller"] SPIRE_ROOT --- CTRL_SVID end subgraph CLUSTER_A["Cluster A — nested SPIRE server\n(no external endpoint required)"] SPIRE_NESTED_A["SPIRE Server (nested)\ntrust domain: acme.example"] NODE_A_SVID["SVID: spiffe://acme.example/cluster-a/slim/node-0"] AGENT_A_SVID["SVID: spiffe://acme.example/cluster-a/agent/security-monitor"] SPIRE_NESTED_A --- NODE_A_SVID SPIRE_NESTED_A --- AGENT_A_SVID end subgraph CLUSTER_B["Cluster B — nested SPIRE server\n(no external endpoint required)"] SPIRE_NESTED_B["SPIRE Server (nested)\ntrust domain: acme.example"] NODE_B_SVID["SVID: spiffe://acme.example/cluster-b/slim/node-0"] AGENT_B_SVID["SVID: spiffe://acme.example/cluster-b/agent/security-monitor"] SPIRE_NESTED_B --- NODE_B_SVID SPIRE_NESTED_B --- AGENT_B_SVID end SPIRE_NESTED_A --&gt;|"outbound upstream connection"| SPIRE_ROOT SPIRE_NESTED_B --&gt;|"outbound upstream connection"| SPIRE_ROOT All SVIDs are issued under the single shared trust domain (acme.example). The root SPIRE server is the chain-of-trust anchor; nested servers delegate issuance to their local workloads. SLIM nodes on different clusters can mutually authenticate because they share the same trust domain and their certificates chain to the same root. 7.3 Security Properties Property Mechanism Benefit No exposed agent ports SLIM outbound-only model Eliminates entire attack surface class Workload identity SPIRE SVID (X.509 + JWT) No static credentials — identity is cryptographic Inter-node transport mTLS with SPIRE-issued certs Auto-rotated, zero-touch cert management Agent payload privacy MLS group encryption SLIM routing nodes are zero-knowledge to payloads Operator access control Channel subscription + SVID Operator only subscribes; no cluster-level access needed Audit trail SLIM controller events + SPIRE attestation Full provenance of who joined which channel 8. Demo Scenario 8.1 Environment Setup The demo uses three kind clusters to simulate the production environment: Cluster Role Simulates kind-admin.example SLIM Controller + SPIRE Server Cloud management plane kind-cluster-a.example SLIM nodes + Security &amp; Remediation Agents On-prem cluster (VPN-restricted) kind-cluster-b.example SLIM nodes + Security Agent Cloud cluster (remote region) # 1. Start clusters and install SPIRE (nested deployment) sudo task multi-cluster:up # 2. Deploy Controller on admin cluster task controller:deploy # 3. Deploy SLIM on workload clusters task slim:deploy # 4. Deploy agents task demo:agents:deploy 8.2 Demo Script — Step by Step Step 1: Show baseline — agents running, channels empty # Show agents are running but NOT listening on any port kubectl get pods -n default # agents running kubectl exec -it &lt;security-agent-pod&gt; -- ss -tlnp # no open ports Step 2: Inject a simulated CVE event # Inject a CVE detection event into Security Agent on Cluster A kubectl exec -it &lt;security-agent-pod&gt; -- \ python3 inject_event.py --type cve --severity critical --image nginx:1.21 Step 3: Watch agents collaborate on the security-incident channel # Follow the channel in real time (operator view) slimctl channel subscribe acme/monitoring/security-incident --cluster admin.example Expected output: [00:00] eu-west/security-monitor → CVE-2024-XXXX detected in nginx:1.21 [00:02] us-east/security-monitor → Confirmed affected: us-east-3 also running nginx:1.21 [00:04] eu-west/remediation → Attempting rolling update to nginx:1.25 [00:19] eu-west/remediation → Node tainted, cannot auto-patch — escalating Step 4: Observe escalation to human operator # Operator joins escalation channel from their laptop slimctl channel subscribe acme/escalation/human-review Expected output: [00:22] eu-west/security-monitor → Escalation requested: node-drain required [00:24] admin/escalation-handler → Runbook loaded, paging operator [APPROVAL REQUIRED] Step 5: Operator approves and watches resolution # Operator sends approval slimctl channel publish acme/escalation/human-review \ '{"decision": "APPROVE_DRAIN", "node": "eu-west-7-node-3"}' Expected output: [00:31] eu-west/remediation → Node drain initiated [00:51] eu-west/remediation → Drain complete, workloads rescheduled [00:52] eu-west/security-monitor → Incident resolved — CVE remediated 8.3 Demo Talking Points “Notice that no agent opened a port” — ss -tlnp shows nothing. Yet cross-cluster messaging worked transparently. “The only firewall rules needed” — point to the two SLIM node endpoints, not the dozens of agent-pair rules a traditional approach would require. “The operator joined from a laptop” — connected to corporate VPN, authenticated via SPIRE, and joined the channel. No jump host, no kubectl proxy, no cluster-level access. “MLS means SLIM nodes are zero-knowledge” — the SLIM routers forwarded every byte without being able to read a single word of the agent messages. “Adding a new cluster is trivial” — register it with SPIRE, deploy SLIM with the correct group name, point it at the controller. No routing rule changes in other clusters. 9. Key Value Propositions 9.1 Simplified Operations Before SLIM With SLIM Firewall rules: O(n²) agent pairs Firewall rules: O(n) SLIM nodes DNS + cert per agent Zero per-agent network config Service registry maintenance Channel names are self-describing, DNS-free VPN access required per cluster One SLIM endpoint per cluster suffices 9.2 Enhanced Security Zero server exposure per agent — drastically reduces attack surface End-to-end MLS encryption — infrastructure operators cannot read agent payloads Cryptographic workload identity — SPIRE eliminates static credential sprawl Principle of least privilege — operators observe only what they subscribe to 9.3 Developer Experience Agents use high-level SLIMRPC or pub/sub APIs — no networking code Language support: Python, Go, Java, .NET (C#), Kotlin, Rust Protocol support: A2A, MCP, custom protobuf Session types: point-to-point, multicast (group), streaming 9.4 Operational Scalability Dynamic topology — agents join/leave channels without reconfiguration Auto-routing — SLIM Controller creates inter-cluster routes on first subscription Fleet-scale — tested with large StatefulSet deployments; single control plane for all Multi-language agents — Python ML agent ↔ Go infrastructure agent ↔ Java app agent, all on the same channel 10. Implementation Notes 10.1 Tech Stack Component Technology SLIM Node (data plane) Rust — high-performance message router SLIM Session Layer Rust — MLS (RFC 9420) encryption + group management SLIMRPC Protobuf + generated stubs (all languages) SLIM Controller Go (Kubernetes controller pattern) Workload identity SPIRE / SPIFFE Demo agents Python (ADK or custom SLIM Python bindings) Deployment Kubernetes (Helm charts) Observability OpenTelemetry (SLIM has native OTEL support) 10.2 Key APIs Used in Demo Agents import slim_bindings # Agent startup — outbound connection only, no listening port svc = slim_bindings.Service.new(slim_node_addr) app = svc.create_app_with_secret(agent_name, shared_secret) conn_id = svc.connect(slim_bindings.ClientConfig.new_with_tls(slim_node_addr)) # Subscribe to a channel app.subscribe(app.name(), conn_id) # Join the security-incident group channel session_cfg = slim_bindings.SessionConfig( session_type=slim_bindings.SessionType.MULTICAST, mls_enabled=True, ) session, done = await app.create_session(session_cfg, "acme/monitoring/security-incident") await done # wait for group to form # Publish an event await app.publish("acme/monitoring/security-incident", payload_bytes) # Receive messages (event-driven) async for msg in app.receive(): await handle_event(msg) 10.3 Helm Deployment Snippet # values-cluster-a.yaml slim: config: services: slim/0: node_id: "${env:SLIM_SVC_ID}" group_name: "cluster-a.example" dataplane: servers: - endpoint: "0.0.0.0:46357" metadata: local_endpoint: "${env:MY_POD_IP}" external_endpoint: "slim.cluster-a.example:46357" tls: source: type: spire controller: clients: - endpoint: "https://slim-control.admin.example:50052" tls: source: type: spire 10.4 Prerequisites for Demo Environment kind v0.20+ kubectl v1.28+ helm v3.12+ task (Taskfile runner) spire-server / spire-agent (deployed via Helm chart in clusters) SLIM Helm charts (charts/slim, charts/slim-control-plane) Docker for building demo agent images Appendix A: Glossary Term Definition SLIM Secure Low-Latency Interactive Messaging — the transport framework SLIMRPC SLIM’s request/response RPC layer built on top of the session layer MLS Messaging Layer Security (RFC 9420) — E2E encryption for groups SPIRE SPIFFE Runtime Environment — issues SVID certificates to workloads SPIFFE Secure Production Identity Framework for Everyone SVID SPIFFE Verifiable Identity Document (X.509 cert or JWT) Channel A named pub/sub topic in SLIM (hierarchical: org/ns/topic) Group session A SLIM multicast session with multiple subscribers Data plane SLIM message routing layer (pure forwarding, zero payload inspection) Control plane SLIM management layer (route configuration, monitoring) Session layer SLIM encryption + group membership layer (sits above data plane) Appendix B: References SLIM Repository SLIM Documentation Multi-Cluster Deployment Strategy SLIM Data Plane README SLIM Control Plane README SPIFFE/SPIRE MLS RFC 9420 A2A Protocol MCP Protocol Issue #1372 — Multicluster Epic]]></summary></entry></feed>