OCP Performance Analyzer MCP - Architecture & Demo

📊 Project Overview

OCP Performance Analyzer MCP is a comprehensive, AI-driven performance analysis and monitoring platform for OpenShift/Kubernetes clusters. Through the Model Context Protocol (MCP), combined with LangGraph intelligent agents, it provides deep performance insights, automated root cause analysis, and actionable optimization recommendations.

Analysis Components

3

ETCD · Network · OVN-K

Performance Metrics

200+

Across 11 config files

Analysis Tools

35+

MCP Server Tools

AI Agents

3

Chat · Report · Storage

🎯 Main Analysis Areas

🔷 ETCD Analyzer

15+Analysis Tools
Cluster health monitoring
WAL Fsync performance (P99 <10ms)
Backend Commit latency (P99 <25ms)
Disk I/O performance analysis
Network I/O monitoring
Deep performance profiling
Automatic bottleneck detection

🔗 OVN-Kubernetes Analyzer

8+ dedicated tools
OVN database monitoring
Kubelet CNI performance
Network latency analysis
OVS usage
Pod metrics monitoring
API statistics
NB/SB database size

🏗️ System Architecture

High-Level Architecture Design

┌─────────────────────────────────────────────────────────────────────┐
│                         Client Layer                                │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   Web UI     │  │  CLI Tools    │  │  REST API     │              │
│  │ Interactive UI│  │  CLI Tools    │  │  RESTful API  │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
└─────────┼─────────────────┼─────────────────┼───────────────────────┘
          │                 │                 │
          └─────────────────┼─────────────────┘
                            │
┌───────────────────────────┼──────────────────────────────────────────┐
│                    AI Agents Layer (Port 8080)                       │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  LangGraph AI Agents: Chat · Report · Storage                │   │
│  │  • Streaming response support                                │   │
│  │  • Tool orchestration                                        │   │
│  │  • Conversation context memory                               │   │
│  │  • OpenAI/Compatible LLM integration                         │   │
│  └──────────────────────────────────────────────────────────────┘   │
└───────────────────────────┬──────────────────────────────────────────┘
                            │ MCP Protocol Communication
┌───────────────────────────┼──────────────────────────────────────────┐
│                    MCP Server Layer (Port 8000)                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                 │
│  │ ETCD Server │  │Network Server│  │ OVNK Server │                 │
│  │  15+ Tools   │  │   10+ Tools   │  │   8+ Tools   │                 │
│  └─────┬───────┘  └─────┬───────┘  └─────┬───────┘                 │
└────────┼─────────────────┼─────────────────┼─────────────────────────┘
         │                 │                 │
┌────────┼─────────────────┼─────────────────┼─────────────────────────┐
│        │                 │                 │                         │
│  ┌─────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐                │
│  │ Tools/     │    │ Tools/     │    │ Tools/     │                │
│  │ Collectors │    │ Collectors │    │ Collectors │                │
│  │ Metrics    │    │ Metrics    │    │ Metrics    │                │
│  │ Collector  │    │ Collector  │    │ Collector  │                │
│  └─────┬──────┘    └─────┬──────┘    └─────┬──────┘                │
│        │                 │                 │                         │
│  ┌─────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐                │
│  │ Analysis   │    │ Analysis   │    │ Analysis   │                │
│  │ Modules    │    │ Modules    │    │ Modules    │                │
│  │            │    │            │    │            │                │
│  └─────┬──────┘    └─────┬──────┘    └─────┬──────┘                │
│        │                 │                 │                         │
│  ┌─────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐                │
│  │    ELT     │    │    ELT     │    │    ELT     │                │
│  │  Pipeline  │    │  Pipeline  │    │  Pipeline  │                │
│  │    Data    │    │    Data    │    │    Data    │                │
│  │ Transform  │    │ Transform  │    │ Transform  │                │
│  └─────┬──────┘    └─────┬──────┘    └─────┬──────┘                │
│        │                 │                 │                         │
│  ┌─────▼─────────────────▼─────────────────▼──────┐                │
│  │              Storage Layer (DuckDB)             │                │
│  │  • Time-series storage • SQL interface • History│                │
│  └──────────────────────────────────────────────────┘                │
└──────────────────────────┬───────────────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────────────┐
│          OpenShift/Kubernetes Cluster Infrastructure                 │
│  • ETCD Cluster          • Prometheus/Thanos    • Kubernetes API     │
│  • Master Nodes          • OVN-Kubernetes       • Network Components │
└──────────────────────────────────────────────────────────────────────┘

Component Architecture Pattern

1️⃣ MCP Server (FastMCP)

FastMCP-based server that exposes analysis tools for AI agent invocation. Supports SSE streaming and HTTP REST API.

2️⃣ Tools/Collectors

Dedicated Prometheus metrics collector that retrieves cluster performance data via PromQL queries.

3️⃣ Analysis Modules

Performance analysis engine implementing bottleneck detection, threshold comparison, and root cause analysis.

4️⃣ ELT Pipeline (Extract-Load-Transform)

Data transformation pipeline converting raw JSON data into structured tables and HTML visualizations.

5️⃣ Storage Layer

DuckDB-based time-series data persistence supporting historical data analysis and trend prediction.

6️⃣ AI Agents (LangGraph)

Intelligent agent layer providing conversational analysis, automatic report generation, and data collection services.

⚡ Core Features

🔍 Real-time Performance Monitoring

Real-time collection of 200+ performance metrics via Prometheus/Thanos, covering ETCD, network, nodes, pods and more. Supports custom time ranges and sampling frequencies.

🤖 AI-Driven Intelligent Analysis

Integrates LangGraph and OpenAI for conversational performance analysis. AI agents understand natural language queries, automatically invoke relevant tools, and generate professional analysis reports.

📈 Historical Trend Analysis

DuckDB-based time-series data storage supporting long-term performance trend analysis, baseline comparison, and anomaly detection.

🎯 Automatic Bottleneck Detection

Multi-dimensional bottleneck analysis engine that automatically identifies performance bottlenecks in CPU, memory, disk I/O, network, and provides optimization recommendations.

📊 Visual Report Generation

Automatically generates HTML-formatted performance reports including executive summaries, detailed metric tables, trend charts, and optimization recommendations.

🔌 Extensible Plugin Architecture

Configuration-file-based metric management supporting rapid addition of new monitoring metrics and analysis tools without modifying core code.

🧩 Components

ETCD Analyzer - Key Performance Metrics

Metric Category	Metric Name	Threshold	Importance
Disk Performance	WAL Fsync P99 Latency	< 10ms (Excellent)	🔴 Critical
Disk Performance	Backend Commit P99 Latency	< 25ms (Excellent)	🔴 Critical
Resource Usage	etcd Pod CPU Usage	< 70% (Warning) < 85% (Critical)	🟡 Important
Resource Usage	etcd PodMemory Usage	< 70% (Warning) < 85% (Critical)	🟡 Important
Database	Database Space Utilization	< 90%	🟡 Important
Network	Peer Latency	< 50ms (Warning) < 100ms (Critical)	🟡 Important
Cluster Health	Proposal Failure Rate	0%	🔴 Critical
Compression/Defrag	Compression/Defrag Count	Periodic Monitoring	🔵 Normal

ETCD Analyzer Tool List

get_server_health - Server Health Check
get_etcd_cluster_status - Cluster Status
get_ocp_cluster_info - Cluster Info
get_etcd_general_info - General Metrics
get_etcd_node_usage - Node Usage
get_etcd_disk_wal_fsync - WAL Fsync Performance
get_etcd_disk_backend_commit - Backend Commit Performance
get_node_disk_io - Disk I/O
get_etcd_disk_compact_defrag - Compression/Defrag
get_etcd_network_io - Network I/O
get_etcd_performance_deep_drive - Deep Analysis
get_etcd_bottleneck_analysis - Bottleneck Detection
generate_etcd_performance_report - Performance Report

Network Analyzer - 95+ Network Metrics

Analysis Layer Level	Tool	Key Metrics
L1 Physical Layer	query_network_l1_metrics	Received/Sent bytes, packets, errors, drops
Network I/O	query_network_io_metrics	Throughput, IOPS, bandwidth utilization
TCP Socket	query_network_socket_tcp_metrics	Active connections, TIME_WAIT, retransmissions
UDP Socket	query_network_socket_udp_metrics	Received/Sent packets, errors
IP Socket	query_network_socket_ip_metrics	IP layer statistics, fragmentation, routing
Memory Statistics	query_network_socket_mem_metrics	Socket Buffer usage, memory pressure
Softnet statistics	query_network_socket_softnet_metrics	Softirq processing, drops, squeeze
TCP Netstat	query_network_netstat_tcp_metrics	TCP state distribution, connection tracking
UDP Netstat	query_network_netstat_udp_metrics	UDP statistics, error rate

OVN-Kubernetes Analyzer

Component	Tool	Monitoring Content
OVN Pod	query_ovnk_pod_metrics	CPU, Memory, Restart Count
Multus CNI	query_multus_pod_metrics	MultusComponentHealthyStatus
OVN Container	query_ovnk_container_metrics	Container-level Resource Usage
OVN Sync	query_ovnk_sync_metrics	NB/SB Database Sync Latency
OVS Daemon	query_ovnk_ovs_metrics	OVS Flow Table, Connection Tracking
Network Latency	query_ovnk_latency_metrics	Inter-Pod Latency, CNI Operation Latency
Kubernetes API	query_kube_api_metrics	API Request Latency, Error Rate

🎬 Performance Demo

Example 1: ETCD Deep Performance Analysis

# Scenario: Analyze ETCD cluster performance over the past 1 hour

User Query: "Analyze etcd performance over the past 1 hour, focusing on WAL fsync and backend commit latency"

AI Agents Execution Flow:
1. Call get_etcd_performance_deep_drive(duration="1h")
2. Collect subsystem data:
   - General info (CPU, memory, database size)
   - WAL Fsync Performance (P50/P90/P99 latency)
   - Backend Commit Performance (P50/P90/P99 latency)
   - Disk I/O (IOPS, throughput, latency)
   - Network I/O (bandwidth, packet rate)
   - Node Resource Usage (CPU, memory, cgroup)
   - Compression/Defrag Statistics
3. Execute bottleneck detection algorithm
4. Generate analysis report

📋 Analysis Results Example

Time Range: 2026-04-12 14:00:00 UTC ~ 2026-04-12 15:00:00 UTC

Cluster Status: Healthy ✓

Critical Findings:

WAL Fsync P99 Latency: 8.2ms - Excellent (Target: <10ms)
Backend Commit P99 Latency: 32.5ms - Needs Attention (Target: <25ms)
etcd Pod Average CPU Usage: 45% - Normal
Database Space Usage: 62% - Healthy
Disk I/O Wait: 15% - Bottleneck!
Proposal Failure Rate: 0% - Normal

💡 Optimization Recommendations

Backend Commit Optimization:
- Check Disk Performance, consider using faster SSD (NVMe)
- Confirm Disk I/O scheduler settings (recommend noop or deadline)
- Check if other processes are competing for Disk I/O resources
Disk I/O Wait Optimization:
- Current 15% I/O Wait indicates disk bottleneck
- Recommend using dedicated high-performance storage devices
- Consider enabling disk cache or using more memory
Continuous Monitoring:
- Regularly check Compression/Defrag execution
- Monitor Database growth rate, plan capacity in advance
- Set alert thresholds, detect performance degradation timely

Example 2: Network Performance Analysis

# Scenario: Troubleshoot cluster network latency issues

User Query: "Why is my inter-Pod communication slow? Help me analyze network performance"

AI Agents Execution Flow:
1. Call query_network_l1_metrics(duration="30m") - Check physical layer
2. Call query_network_io_metrics(duration="30m") - Check throughput
3. Call query_network_socket_tcp_metrics(duration="30m") - Check TCP connections
4. Call query_network_socket_softnet_metrics(duration="30m") - Check softirq
5. Analyze data, identify anomalous patterns
6. Generate diagnostic report

📋 Network Diagnosis Results

Network Layer	Metric	Current Value	Status
L1 Physical Layer	Packet Loss Rate	2.3%	🔴 Abnormal
L1 Physical Layer	Error Packets	1,234 pps	🔴 Abnormal
TCP	Retransmission Rate	1.8%	🟡 Warning
TCP	TIME_WAIT Connections	8,523	🟢 Normal
Softnet	Dropped Packets	5,678	🔴 Abnormal
Network I/O	Bandwidth Utilization	78%	🟡 Near Saturation

🔧 Network Issue Root Cause Analysis

Main Issue: Physical layer packet loss and Softnet Dropped packets indicate network interface or driver issues

Troubleshooting Steps:

Check Network Hardware:
- Check NIC Status: ethtool -S eth0
- Check NIC Errors: dmesg | grep -i eth
- Confirm NIC driver and firmware versions
Optimize Softnet Parameters:
- Increase netdev_max_backlog: sysctl -w net.core.netdev_max_backlog=5000
- Adjust RPS/RFS to balance CPU load
- Consider increasing interrupt coalescing time
Bandwidth Management:
- Current 78% bandwidth utilization near saturation
- Consider upgrading network bandwidth or implementing QoS policies
- Analyze traffic patterns and identify abnormal traffic sources

Example 3: OVN-Kubernetes Latency Analysis

# Scenario: Analyze OVN-Kubernetes network latency

User Query: "What is the OVN Database sync latency? Are there any performance issues?"

AI Agents Execution:
1. Call query_ovnk_sync_metrics(duration="1h")
2. Call query_ovnk_latency_metrics(duration="1h")
3. Call query_ovnk_pod_metrics(duration="1h")
4. Call query_ovnk_ovs_metrics(duration="1h")
5. Comprehensive OVN performance analysis

📋 OVN-Kubernetes Performance Report

NB DB Sync Latency (P99)

125ms

🟡 Slight Delay

SB DB Sync Latency (P99)

89ms

🟢 Normal

OVN Pod CPU Usage

52%

🟢 Healthy

OVS Flow Table Entries

12,345