AI-Powered Performance Analysis Platform for OpenShift/Kubernetes Clusters
OCP Performance Analyzer MCP is a comprehensive, AI-driven performance analysis and monitoring platform for OpenShift/Kubernetes clusters. Through the Model Context Protocol (MCP), combined with LangGraph intelligent agents, it provides deep performance insights, automated root cause analysis, and actionable optimization recommendations.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Client Layer โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Web UI โ โ CLI Tools โ โ REST API โ โ
โ โ Interactive UIโ โ CLI Tools โ โ RESTful API โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ
โโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AI Agents Layer (Port 8080) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ LangGraph AI Agents: Chat ยท Report ยท Storage โ โ
โ โ โข Streaming response support โ โ
โ โ โข Tool orchestration โ โ
โ โ โข Conversation context memory โ โ
โ โ โข OpenAI/Compatible LLM integration โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MCP Protocol Communication
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MCP Server Layer (Port 8000) โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ ETCD Server โ โNetwork Serverโ โ OVNK Server โ โ
โ โ 15+ Tools โ โ 10+ Tools โ โ 8+ Tools โ โ
โ โโโโโโโฌโโโโโโโโ โโโโโโโฌโโโโโโโโ โโโโโโโฌโโโโโโโโ โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ โ
โ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โ
โ โ Tools/ โ โ Tools/ โ โ Tools/ โ โ
โ โ Collectors โ โ Collectors โ โ Collectors โ โ
โ โ Metrics โ โ Metrics โ โ Metrics โ โ
โ โ Collector โ โ Collector โ โ Collector โ โ
โ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โ
โ โ Analysis โ โ Analysis โ โ Analysis โ โ
โ โ Modules โ โ Modules โ โ Modules โ โ
โ โ โ โ โ โ โ โ
โ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โโโโโโโผโโโโโโโ โ
โ โ ELT โ โ ELT โ โ ELT โ โ
โ โ Pipeline โ โ Pipeline โ โ Pipeline โ โ
โ โ Data โ โ Data โ โ Data โ โ
โ โ Transform โ โ Transform โ โ Transform โ โ
โ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โโโโโโโฌโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโผโโโโโโโ โ
โ โ Storage Layer (DuckDB) โ โ
โ โ โข Time-series storage โข SQL interface โข Historyโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OpenShift/Kubernetes Cluster Infrastructure โ
โ โข ETCD Cluster โข Prometheus/Thanos โข Kubernetes API โ
โ โข Master Nodes โข OVN-Kubernetes โข Network Components โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
FastMCP-based server that exposes analysis tools for AI agent invocation. Supports SSE streaming and HTTP REST API.
Dedicated Prometheus metrics collector that retrieves cluster performance data via PromQL queries.
Performance analysis engine implementing bottleneck detection, threshold comparison, and root cause analysis.
Data transformation pipeline converting raw JSON data into structured tables and HTML visualizations.
DuckDB-based time-series data persistence supporting historical data analysis and trend prediction.
Intelligent agent layer providing conversational analysis, automatic report generation, and data collection services.
Real-time collection of 200+ performance metrics via Prometheus/Thanos, covering ETCD, network, nodes, pods and more. Supports custom time ranges and sampling frequencies.
Integrates LangGraph and OpenAI for conversational performance analysis. AI agents understand natural language queries, automatically invoke relevant tools, and generate professional analysis reports.
DuckDB-based time-series data storage supporting long-term performance trend analysis, baseline comparison, and anomaly detection.
Multi-dimensional bottleneck analysis engine that automatically identifies performance bottlenecks in CPU, memory, disk I/O, network, and provides optimization recommendations.
Automatically generates HTML-formatted performance reports including executive summaries, detailed metric tables, trend charts, and optimization recommendations.
Configuration-file-based metric management supporting rapid addition of new monitoring metrics and analysis tools without modifying core code.
| Metric Category | Metric Name | Threshold | Importance |
|---|---|---|---|
| Disk Performance | WAL Fsync P99 Latency | < 10ms (Excellent) | ๐ด Critical |
| Disk Performance | Backend Commit P99 Latency | < 25ms (Excellent) | ๐ด Critical |
| Resource Usage | etcd Pod CPU Usage | < 70% (Warning) < 85% (Critical) |
๐ก Important |
| Resource Usage | etcd PodMemory Usage | < 70% (Warning) < 85% (Critical) |
๐ก Important |
| Database | Database Space Utilization | < 90% | ๐ก Important |
| Network | Peer Latency | < 50ms (Warning) < 100ms (Critical) |
๐ก Important |
| Cluster Health | Proposal Failure Rate | 0% | ๐ด Critical |
| Compression/Defrag | Compression/Defrag Count | Periodic Monitoring | ๐ต Normal |
| Analysis Layer Level | Tool | Key Metrics |
|---|---|---|
| L1 Physical Layer | query_network_l1_metrics | Received/Sent bytes, packets, errors, drops |
| Network I/O | query_network_io_metrics | Throughput, IOPS, bandwidth utilization |
| TCP Socket | query_network_socket_tcp_metrics | Active connections, TIME_WAIT, retransmissions |
| UDP Socket | query_network_socket_udp_metrics | Received/Sent packets, errors |
| IP Socket | query_network_socket_ip_metrics | IP layer statistics, fragmentation, routing |
| Memory Statistics | query_network_socket_mem_metrics | Socket Buffer usage, memory pressure |
| Softnet statistics | query_network_socket_softnet_metrics | Softirq processing, drops, squeeze |
| TCP Netstat | query_network_netstat_tcp_metrics | TCP state distribution, connection tracking |
| UDP Netstat | query_network_netstat_udp_metrics | UDP statistics, error rate |
| Component | Tool | Monitoring Content |
|---|---|---|
| OVN Pod | query_ovnk_pod_metrics | CPU, Memory, Restart Count |
| Multus CNI | query_multus_pod_metrics | MultusComponentHealthyStatus |
| OVN Container | query_ovnk_container_metrics | Container-level Resource Usage |
| OVN Sync | query_ovnk_sync_metrics | NB/SB Database Sync Latency |
| OVS Daemon | query_ovnk_ovs_metrics | OVS Flow Table, Connection Tracking |
| Network Latency | query_ovnk_latency_metrics | Inter-Pod Latency, CNI Operation Latency |
| Kubernetes API | query_kube_api_metrics | API Request Latency, Error Rate |
# Scenario: Analyze ETCD cluster performance over the past 1 hour User Query: "Analyze etcd performance over the past 1 hour, focusing on WAL fsync and backend commit latency" AI Agents Execution Flow: 1. Call get_etcd_performance_deep_drive(duration="1h") 2. Collect subsystem data: - General info (CPU, memory, database size) - WAL Fsync Performance (P50/P90/P99 latency) - Backend Commit Performance (P50/P90/P99 latency) - Disk I/O (IOPS, throughput, latency) - Network I/O (bandwidth, packet rate) - Node Resource Usage (CPU, memory, cgroup) - Compression/Defrag Statistics 3. Execute bottleneck detection algorithm 4. Generate analysis report
Time Range: 2026-04-12 14:00:00 UTC ~ 2026-04-12 15:00:00 UTC
Cluster Status: Healthy โ
# Scenario: Troubleshoot cluster network latency issues User Query: "Why is my inter-Pod communication slow? Help me analyze network performance" AI Agents Execution Flow: 1. Call query_network_l1_metrics(duration="30m") - Check physical layer 2. Call query_network_io_metrics(duration="30m") - Check throughput 3. Call query_network_socket_tcp_metrics(duration="30m") - Check TCP connections 4. Call query_network_socket_softnet_metrics(duration="30m") - Check softirq 5. Analyze data, identify anomalous patterns 6. Generate diagnostic report
| Network Layer | Metric | Current Value | Status |
|---|---|---|---|
| L1 Physical Layer | Packet Loss Rate | 2.3% | ๐ด Abnormal |
| L1 Physical Layer | Error Packets | 1,234 pps | ๐ด Abnormal |
| TCP | Retransmission Rate | 1.8% | ๐ก Warning |
| TCP | TIME_WAIT Connections | 8,523 | ๐ข Normal |
| Softnet | Dropped Packets | 5,678 | ๐ด Abnormal |
| Network I/O | Bandwidth Utilization | 78% | ๐ก Near Saturation |
Main Issue: Physical layer packet loss and Softnet Dropped packets indicate network interface or driver issues
ethtool -S eth0dmesg | grep -i ethsysctl -w net.core.netdev_max_backlog=5000# Scenario: Analyze OVN-Kubernetes network latency User Query: "What is the OVN Database sync latency? Are there any performance issues?" AI Agents Execution: 1. Call query_ovnk_sync_metrics(duration="1h") 2. Call query_ovnk_latency_metrics(duration="1h") 3. Call query_ovnk_pod_metrics(duration="1h") 4. Call query_ovnk_ovs_metrics(duration="1h") 5. Comprehensive OVN performance analysis
User: "Analyze etcd performance for the past 24 hours"
AI: Analyzing ETCD performance... [Calling tool: get_etcd_performance_deep_drive]
User: "Is the WAL fsync latency exceeding thresholds?"
AI: Let me check WAL fsync detailed data... [Calling tool: get_etcd_disk_wal_fsync]
User: "Generate a complete performance report for me"
AI: Generating report... [Calling tool: generate_etcd_performance_report]
Report generated and saved to exports/etcd_performance_report_20260412.html