SLIM for Observability and Remediation: Beyond Agentic AI
SLIM (Secure Low-Latency Interactive Messaging) was designed as the transport layer for agentic AI protocols like A2A (Agent-to-Agent). While SLIM was built to enable AI agents to communicate securely across network boundaries, its core capabilities—network traversal, end-to-end encryption, and dynamic channel management—also apply to standard distributed applications. This post explores its use for an observability and remediation use case.
Modern distributed systems face critical challenges when transporting telemetry data. Applications run behind firewalls, in different cloud regions, at the edge, or across organizational boundaries. Traditional observability requires exposed collector endpoints and complex firewall management. Telemetry contains sensitive information yet TLS terminates at load balancers, leaving data exposed through intermediate infrastructure. Furthermore, hardcoded endpoint configurations break as systems scale elastically across regions.
SLIM addresses these challenges by enabling remote components to communicate without requiring exposed endpoints or complex NAT/firewall configurations. Message Layer Security (MLS) protects telemetry from source to destination, and dynamic channel management allows you to reconfigure topologies at runtime without restarting applications.
The Observability and Remediation Use Case
Consider a modern distributed application running across multiple environments—cloud, edge, and on-premises—often spanning separate Kubernetes clusters in different regions or cloud providers. When incidents occur, you need real-time monitoring with continuous telemetry streaming to dashboards and storage. Different stakeholders—operations teams in one cluster, SREs in another, executives from headquarters—need visibility into the same data from different network segments. AI agents should monitor telemetry streams for anomalies, and when problems arise, specialized diagnostic agents running in separate clusters should join dynamically to investigate root causes while engineers observe live agent analysis and approve remediation actions.
Traditional observability architectures struggle with this scenario. Hardcoded collector addresses make it difficult to add consumers dynamically across network segments. Sharing telemetry between clusters requires VPNs or complex tunneling. Broadcasting to multiple consumers requires intermediate message queues. Most critically, standard telemetry pipelines weren’t designed for AI agents to join channels dynamically and collaborate.
Why Use SLIM for Observability?
SLIM provides several capabilities that address common challenges in distributed observability.
Network Traversal Without Exposed Endpoints
When working across Kubernetes clusters—whether in different AWS regions, separate GCP projects, or hybrid cloud/on-premises environments—each cluster boundary becomes a network boundary requiring careful firewall configuration, load balancer setup, and security group management.
SLIM simplifies these requirements. Applications and collectors connect outbound to SLIM nodes, meaning only the SLIM routing infrastructure needs exposed addresses. Components behind firewalls and NATs can communicate without VPN tunnels or complex ingress configurations. A workload in your AWS EKS cluster can send telemetry to a collector in your on-premises Kubernetes cluster without port forwarding or firewall exceptions.
End-to-End Encryption for Sensitive Telemetry
While standard OTLP over TLS protects point-to-point connections, TLS frequently terminates at load balancers or proxies. This means your telemetry—containing API keys, business metrics, and internal system details—flows in plaintext through infrastructure you may not fully control.
SLIM uses Message Layer Security (MLS) to provide true end-to-end encryption from application to collector. Messages remain encrypted even when traversing SLIM nodes, load balancers, or proxies. Only authorized channel participants possess the keys to decrypt telemetry, ensuring confidentiality across untrusted infrastructure.
Dynamic Channel Management
Traditional observability architectures are largely static. Routing telemetry to a new collector typically requires application restarts. Adding a monitoring agent to investigate an incident requires reconfiguration and redeployment. Pre-configured topologies can’t easily adapt at runtime.
SLIM’s Channel Manager API enables runtime topology changes without restarts or downtime. When a monitor agent detects elevated latency, it can create an incident-specific channel and invite specialized diagnostic agents to join the telemetry stream. Once the investigation completes, it removes participants dynamically via API calls.
Broadcast Telemetry to Multiple Consumers
Point-to-point OTLP forces an awkward choice: either implement multiple exporters in your applications (adding complexity and resource consumption) or introduce intermediate message queues (adding latency and operational complexity). Neither option is ideal when you want collectors, dashboards, and AI agents all receiving the same telemetry simultaneously.
SLIM’s broadcast channels support multiple simultaneous consumers. A single application publishes to a channel, and all participants—traditional collectors storing to Prometheus, real-time Grafana dashboards, AI agents performing anomaly detection, and specialized diagnostic agents analyzing patterns—receive the same data stream.
AI Agent Integration
Standard telemetry pipelines were designed for collectors and storage systems, not AI agents. Agents that need to analyze telemetry typically must poll storage backends (introducing latency) and have limited ways to coordinate with other agents.
SLIM treats AI agents as channel participants. Agents receive live telemetry streams alongside collectors, eliminating polling latency. They can use the A2A protocol on separate SLIM channels to coordinate their analysis and share findings. Agents join and leave channels dynamically based on incident needs, with both telemetry and agent communication using the same MLS security model.
Demo Scenario: Intelligent Incident Response
The following demo demonstrates SLIM’s capabilities for dynamic observability and AI-powered incident response.
The monitor application starts up and creates a SLIM channel for telemetry. It invites two participants: the monitored application (which generates metrics using the SLIM OpenTelemetry SDK) and an OpenTelemetry Collector (configured with a SLIM receiver to export metrics to Prometheus and Grafana). The monitor app itself also joins the channel to watch the telemetry stream for anomalies.
During normal operation, the monitored app continuously sends metrics—active connections and service latency. The collector stores these metrics for dashboards while the monitor app observes in the background.
When the application enters a high load period where the number of active connections increases and processing latency exceeds 200ms (the threshold used by the monitoring application), the monitor app triggers an alert.
The monitor app then invites a specialized diagnostic agent to join the telemetry channel. This agent collects metrics for 10 seconds, performs statistical analysis, and queries Azure OpenAI’s GPT-4 to diagnose the root cause. After completing the analysis, the agent prints the findings and signals completion.
The monitor app receives the completion notification, removes the special agent from the channel, and waits for the next incident. In the demo, this flow repeats in a cycle, simulating periodic issues. While the demo runs locally for simplicity, the same architecture works across cluster boundaries.
Here’s the architecture:
graph TB
APP[Monitored App<br/>Metrics: connections, latency]
subgraph "SLIM Node"
CHANNEL[SLIM Broadcast Channel<br/>MLS Encrypted]
end
subgraph "Monitoring & Storage"
COLLECTOR[OTel Collector<br/>SLIM Receiver]
PROM[Prometheus]
GRAF[Grafana Dashboard]
end
MONITOR[Monitor App<br/>Creates Channel & Detects Anomalies]
SPECIAL[Special Agent<br/>AI-Powered Diagnosis]
MONITOR -->|0. Create channel & invite| CHANNEL
APP -->|1. Publish metrics| CHANNEL
CHANNEL -->|Broadcast metrics| COLLECTOR
CHANNEL -->|Broadcast metrics| MONITOR
CHANNEL -.->|Broadcast metrics - after invite| SPECIAL
COLLECTOR --> PROM
PROM --> GRAF
MONITOR -->|2. Detect high latency| MONITOR
MONITOR -.->|3. Invite special agent| SPECIAL
SPECIAL -->|4. Analyze with GPT-4| SPECIAL
style APP fill:#4a90e2,color:#fff
style CHANNEL fill:#f39c12,color:#fff
style COLLECTOR fill:#27ae60,color:#fff
style MONITOR fill:#9b59b6,color:#fff
style SPECIAL fill:#e74c3c,color:#fff
style PROM fill:#95a5a6,color:#fff
style GRAF fill:#95a5a6,color:#fff
The architecture shows: a single source (the monitored app) broadcasts telemetry to multiple consumers (collector and monitor app). The monitor app creates the channel and manages participant lifecycles. The special agent joins when needed, with no pre-configuration. LLM-powered diagnostics operate on live telemetry streams. Standard OpenTelemetry metrics flow through SLIM to Prometheus/Grafana. All communication is encrypted with MLS.
The building blocks for observability over SLIM—including the SLIM receiver and exporter for collectors and SDK exporter for applications—are available in the slim-otel repository. The complete working demo instead is available in the agentic-apps repository in the observability_app folder.
Running the Demo
This section provides step-by-step instructions to run the incident response demo.
Step 1: Clone and Build the Collector
Clone the repository and build the custom OpenTelemetry Collector with SLIM components:
git clone https://github.com/agntcy/agentic-apps.git
cd agentic-apps/observability_app
# Build the collector with SLIM receiver/exporter as Docker image
task collector:docker:build
This will install the OpenTelemetry Collector Builder (OCB), generate the collector code with SLIM components using builder-config.yaml, and create a Docker image.
Step 2: Start Infrastructure
Start all infrastructure services (SLIM node, OTel Collector, Prometheus, and Grafana):
task infra:start
Verify all services are running:
task infra:status
Step 3: Configure Grafana Dashboard
Import the pre-configured dashboard:
- Open http://localhost:3000 and login with
admin/admin - Navigate to Dashboards → Import
- Upload the
grafana-dashboard.jsonfile available in the repo
Step 4: Run the Applications
Open three separate terminal windows and run each application:
Terminal 1 - Monitored Application (generates metrics):
task monitored-application:run
Terminal 2 - Monitor Application (creates channel, detects anomalies):
task monitor-application:run
This will invite both the monitored-application and the OTel Collector to the same SLIM channel. This channel is used to distribute telemetry
Terminal 3 - Special Agent (AI-powered diagnostics):
First, set your Azure OpenAI credentials:
export AZURE_OPENAI_API_KEY="your-api-key"
export AZURE_OPENAI_ENDPOINT="https://your-endpoint.openai.azure.com/"
export AZURE_OPENAI_DEPLOYMENT="gpt-4o" # Optional, defaults to gpt-4o
Then run the agent:
task special-agent:run
The special-agent will start and wait to be invited by the monitor-application
Step 5: Observe the Demo Flow
Watch the terminal outputs to see the incident response cycle:
Low Load Period (20 seconds)
- Monitored app sends metrics (connections: ~50, latency: ~50ms)
- Monitor app observes quietly
- Metrics flow to Collector → Prometheus → Grafana
High Load Period (20 seconds) - Incident Detected
- Connections increase and Service Latency spikes
- Monitor app detects consecutive samples above 200ms threshold and triggers the alert
- Monitor app invites the special agent
AI Analysis (10 seconds)
- Special agent joins the channel and collects telemetry
- GPT-4 analyzes the metrics and produces diagnostic insights
- Special agent reports findings
Cleanup and Reset
- Monitor app receives completion notification
- Monitor app removes the special agent from the channel
- Cycle repeats as app alternates between low and high load
Step 6: View Dashboards
View real-time metrics in Grafana:
- Navigate to http://localhost:3000
- Open the metrics on the imported dashboard
Step 7: Clean Up
Stop all applications with Ctrl+C in each terminal.
Stop infrastructure:
task infra:stop
This stops and removes all Docker containers (SLIM node, collector, Prometheus, Grafana).
Conclusion
SLIM extends beyond agentic AI to support observability and remediation in distributed applications. The demo shows how AI agents can join telemetry channels dynamically to diagnose incidents using LLMs, collaborating with traditional collectors over MLS-encrypted channels that work across cluster boundaries.
The SLIM-OTel code and observability demo are open source. You can try it in your environment, adapt it for your setup, or contribute to the codebase.
SLIM and SLIM OpenTelemetry components are developed by AGNTCY Contributors and released under the Apache 2.0 License.