๐Ÿš€ OCP Performance Analyzer MCP

AI-Powered Performance Analysis Platform for OpenShift/Kubernetes Clusters

Python 3.8+ MCP Protocol LangGraph AI FastAPI DuckDB

๐Ÿ“Š Project Overview

OCP Performance Analyzer MCP is a comprehensive, AI-driven performance analysis and monitoring platform for OpenShift/Kubernetes clusters. Through the Model Context Protocol (MCP), combined with LangGraph intelligent agents, it provides deep performance insights, automated root cause analysis, and actionable optimization recommendations.

Analysis Components
3
ETCD ยท Network ยท OVN-K
Performance Metrics
200+
Across 11 config files
Analysis Tools
35+
MCP Server Tools
AI Agents
3
Chat ยท Report ยท Storage

๐ŸŽฏ Main Analysis Areas

๐Ÿ”ท ETCD Analyzer

  • 15+Analysis Tools
  • Cluster health monitoring
  • WAL Fsync performance (P99 <10ms)
  • Backend Commit latency (P99 <25ms)
  • Disk I/O performance analysis
  • Network I/O monitoring
  • Deep performance profiling
  • Automatic bottleneck detection

๐ŸŒ Network Analyzer

  • 10+ network tools
  • L1 physical layer statistics
  • Socket statistics (TCP/UDP/IP)
  • Memory statistics (Socket Buffer)
  • Softnet statistics
  • Netstat metrics (TCP/UDP)
  • Network I/O throughput
  • 95+ network metrics coverage

๐Ÿ”— OVN-Kubernetes Analyzer

  • 8+ dedicated tools
  • OVN database monitoring
  • Kubelet CNI performance
  • Network latency analysis
  • OVS usage
  • Pod metrics monitoring
  • API statistics
  • NB/SB database size

๐Ÿ—๏ธ System Architecture

High-Level Architecture Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Client Layer                                โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
โ”‚  โ”‚   Web UI     โ”‚  โ”‚  CLI Tools    โ”‚  โ”‚  REST API     โ”‚              โ”‚
โ”‚  โ”‚ Interactive UIโ”‚  โ”‚  CLI Tools    โ”‚  โ”‚  RESTful API  โ”‚              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                 โ”‚                 โ”‚
          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AI Agents Layer (Port 8080)                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚  LangGraph AI Agents: Chat ยท Report ยท Storage                โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Streaming response support                                โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Tool orchestration                                        โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Conversation context memory                               โ”‚   โ”‚
โ”‚  โ”‚  โ€ข OpenAI/Compatible LLM integration                         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚ MCP Protocol Communication
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    MCP Server Layer (Port 8000)                       โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚  โ”‚ ETCD Server โ”‚  โ”‚Network Serverโ”‚  โ”‚ OVNK Server โ”‚                 โ”‚
โ”‚  โ”‚  15+ Tools   โ”‚  โ”‚   10+ Tools   โ”‚  โ”‚   8+ Tools   โ”‚                 โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                 โ”‚                 โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚        โ”‚                 โ”‚                 โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚ Tools/     โ”‚    โ”‚ Tools/     โ”‚    โ”‚ Tools/     โ”‚                โ”‚
โ”‚  โ”‚ Collectors โ”‚    โ”‚ Collectors โ”‚    โ”‚ Collectors โ”‚                โ”‚
โ”‚  โ”‚ Metrics    โ”‚    โ”‚ Metrics    โ”‚    โ”‚ Metrics    โ”‚                โ”‚
โ”‚  โ”‚ Collector  โ”‚    โ”‚ Collector  โ”‚    โ”‚ Collector  โ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚        โ”‚                 โ”‚                 โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚ Analysis   โ”‚    โ”‚ Analysis   โ”‚    โ”‚ Analysis   โ”‚                โ”‚
โ”‚  โ”‚ Modules    โ”‚    โ”‚ Modules    โ”‚    โ”‚ Modules    โ”‚                โ”‚
โ”‚  โ”‚            โ”‚    โ”‚            โ”‚    โ”‚            โ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚        โ”‚                 โ”‚                 โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚    ELT     โ”‚    โ”‚    ELT     โ”‚    โ”‚    ELT     โ”‚                โ”‚
โ”‚  โ”‚  Pipeline  โ”‚    โ”‚  Pipeline  โ”‚    โ”‚  Pipeline  โ”‚                โ”‚
โ”‚  โ”‚    Data    โ”‚    โ”‚    Data    โ”‚    โ”‚    Data    โ”‚                โ”‚
โ”‚  โ”‚ Transform  โ”‚    โ”‚ Transform  โ”‚    โ”‚ Transform  โ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ”‚        โ”‚                 โ”‚                 โ”‚                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚
โ”‚  โ”‚              Storage Layer (DuckDB)             โ”‚                โ”‚
โ”‚  โ”‚  โ€ข Time-series storage โ€ข SQL interface โ€ข Historyโ”‚                โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          OpenShift/Kubernetes Cluster Infrastructure                 โ”‚
โ”‚  โ€ข ETCD Cluster          โ€ข Prometheus/Thanos    โ€ข Kubernetes API     โ”‚
โ”‚  โ€ข Master Nodes          โ€ข OVN-Kubernetes       โ€ข Network Components โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                

Component Architecture Pattern

1๏ธโƒฃ MCP Server (FastMCP)

FastMCP-based server that exposes analysis tools for AI agent invocation. Supports SSE streaming and HTTP REST API.

2๏ธโƒฃ Tools/Collectors

Dedicated Prometheus metrics collector that retrieves cluster performance data via PromQL queries.

3๏ธโƒฃ Analysis Modules

Performance analysis engine implementing bottleneck detection, threshold comparison, and root cause analysis.

4๏ธโƒฃ ELT Pipeline (Extract-Load-Transform)

Data transformation pipeline converting raw JSON data into structured tables and HTML visualizations.

5๏ธโƒฃ Storage Layer

DuckDB-based time-series data persistence supporting historical data analysis and trend prediction.

6๏ธโƒฃ AI Agents (LangGraph)

Intelligent agent layer providing conversational analysis, automatic report generation, and data collection services.

โšก Core Features

๐Ÿ” Real-time Performance Monitoring

Real-time collection of 200+ performance metrics via Prometheus/Thanos, covering ETCD, network, nodes, pods and more. Supports custom time ranges and sampling frequencies.

๐Ÿค– AI-Driven Intelligent Analysis

Integrates LangGraph and OpenAI for conversational performance analysis. AI agents understand natural language queries, automatically invoke relevant tools, and generate professional analysis reports.

๐Ÿ“ˆ Historical Trend Analysis

DuckDB-based time-series data storage supporting long-term performance trend analysis, baseline comparison, and anomaly detection.

๐ŸŽฏ Automatic Bottleneck Detection

Multi-dimensional bottleneck analysis engine that automatically identifies performance bottlenecks in CPU, memory, disk I/O, network, and provides optimization recommendations.

๐Ÿ“Š Visual Report Generation

Automatically generates HTML-formatted performance reports including executive summaries, detailed metric tables, trend charts, and optimization recommendations.

๐Ÿ”Œ Extensible Plugin Architecture

Configuration-file-based metric management supporting rapid addition of new monitoring metrics and analysis tools without modifying core code.

๐Ÿงฉ Components

ETCD Analyzer - Key Performance Metrics

Metric Category Metric Name Threshold Importance
Disk Performance WAL Fsync P99 Latency < 10ms (Excellent) ๐Ÿ”ด Critical
Disk Performance Backend Commit P99 Latency < 25ms (Excellent) ๐Ÿ”ด Critical
Resource Usage etcd Pod CPU Usage < 70% (Warning)
< 85% (Critical)
๐ŸŸก Important
Resource Usage etcd PodMemory Usage < 70% (Warning)
< 85% (Critical)
๐ŸŸก Important
Database Database Space Utilization < 90% ๐ŸŸก Important
Network Peer Latency < 50ms (Warning)
< 100ms (Critical)
๐ŸŸก Important
Cluster Health Proposal Failure Rate 0% ๐Ÿ”ด Critical
Compression/Defrag Compression/Defrag Count Periodic Monitoring ๐Ÿ”ต Normal

ETCD Analyzer Tool List

  • get_server_health - Server Health Check
  • get_etcd_cluster_status - Cluster Status
  • get_ocp_cluster_info - Cluster Info
  • get_etcd_general_info - General Metrics
  • get_etcd_node_usage - Node Usage
  • get_etcd_disk_wal_fsync - WAL Fsync Performance
  • get_etcd_disk_backend_commit - Backend Commit Performance
  • get_node_disk_io - Disk I/O
  • get_etcd_disk_compact_defrag - Compression/Defrag
  • get_etcd_network_io - Network I/O
  • get_etcd_performance_deep_drive - Deep Analysis
  • get_etcd_bottleneck_analysis - Bottleneck Detection
  • generate_etcd_performance_report - Performance Report

Network Analyzer - 95+ Network Metrics

Analysis Layer Level Tool Key Metrics
L1 Physical Layer query_network_l1_metrics Received/Sent bytes, packets, errors, drops
Network I/O query_network_io_metrics Throughput, IOPS, bandwidth utilization
TCP Socket query_network_socket_tcp_metrics Active connections, TIME_WAIT, retransmissions
UDP Socket query_network_socket_udp_metrics Received/Sent packets, errors
IP Socket query_network_socket_ip_metrics IP layer statistics, fragmentation, routing
Memory Statistics query_network_socket_mem_metrics Socket Buffer usage, memory pressure
Softnet statistics query_network_socket_softnet_metrics Softirq processing, drops, squeeze
TCP Netstat query_network_netstat_tcp_metrics TCP state distribution, connection tracking
UDP Netstat query_network_netstat_udp_metrics UDP statistics, error rate

OVN-Kubernetes Analyzer

Component Tool Monitoring Content
OVN Pod query_ovnk_pod_metrics CPU, Memory, Restart Count
Multus CNI query_multus_pod_metrics MultusComponentHealthyStatus
OVN Container query_ovnk_container_metrics Container-level Resource Usage
OVN Sync query_ovnk_sync_metrics NB/SB Database Sync Latency
OVS Daemon query_ovnk_ovs_metrics OVS Flow Table, Connection Tracking
Network Latency query_ovnk_latency_metrics Inter-Pod Latency, CNI Operation Latency
Kubernetes API query_kube_api_metrics API Request Latency, Error Rate

๐ŸŽฌ Performance Demo

Example 1: ETCD Deep Performance Analysis

# Scenario: Analyze ETCD cluster performance over the past 1 hour

User Query: "Analyze etcd performance over the past 1 hour, focusing on WAL fsync and backend commit latency"

AI Agents Execution Flow:
1. Call get_etcd_performance_deep_drive(duration="1h")
2. Collect subsystem data:
   - General info (CPU, memory, database size)
   - WAL Fsync Performance (P50/P90/P99 latency)
   - Backend Commit Performance (P50/P90/P99 latency)
   - Disk I/O (IOPS, throughput, latency)
   - Network I/O (bandwidth, packet rate)
   - Node Resource Usage (CPU, memory, cgroup)
   - Compression/Defrag Statistics
3. Execute bottleneck detection algorithm
4. Generate analysis report
                

๐Ÿ“‹ Analysis Results Example

Time Range: 2026-04-12 14:00:00 UTC ~ 2026-04-12 15:00:00 UTC

Cluster Status: Healthy โœ“

Critical Findings:
  • WAL Fsync P99 Latency: 8.2ms - Excellent (Target: <10ms)
  • Backend Commit P99 Latency: 32.5ms - Needs Attention (Target: <25ms)
  • etcd Pod Average CPU Usage: 45% - Normal
  • Database Space Usage: 62% - Healthy
  • Disk I/O Wait: 15% - Bottleneck!
  • Proposal Failure Rate: 0% - Normal

๐Ÿ’ก Optimization Recommendations

  1. Backend Commit Optimization:
    • Check Disk Performance, consider using faster SSD (NVMe)
    • Confirm Disk I/O scheduler settings (recommend noop or deadline)
    • Check if other processes are competing for Disk I/O resources
  2. Disk I/O Wait Optimization:
    • Current 15% I/O Wait indicates disk bottleneck
    • Recommend using dedicated high-performance storage devices
    • Consider enabling disk cache or using more memory
  3. Continuous Monitoring:
    • Regularly check Compression/Defrag execution
    • Monitor Database growth rate, plan capacity in advance
    • Set alert thresholds, detect performance degradation timely

Example 2: Network Performance Analysis

# Scenario: Troubleshoot cluster network latency issues

User Query: "Why is my inter-Pod communication slow? Help me analyze network performance"

AI Agents Execution Flow:
1. Call query_network_l1_metrics(duration="30m") - Check physical layer
2. Call query_network_io_metrics(duration="30m") - Check throughput
3. Call query_network_socket_tcp_metrics(duration="30m") - Check TCP connections
4. Call query_network_socket_softnet_metrics(duration="30m") - Check softirq
5. Analyze data, identify anomalous patterns
6. Generate diagnostic report
                

๐Ÿ“‹ Network Diagnosis Results

Network Layer Metric Current Value Status
L1 Physical Layer Packet Loss Rate 2.3% ๐Ÿ”ด Abnormal
L1 Physical Layer Error Packets 1,234 pps ๐Ÿ”ด Abnormal
TCP Retransmission Rate 1.8% ๐ŸŸก Warning
TCP TIME_WAIT Connections 8,523 ๐ŸŸข Normal
Softnet Dropped Packets 5,678 ๐Ÿ”ด Abnormal
Network I/O Bandwidth Utilization 78% ๐ŸŸก Near Saturation

๐Ÿ”ง Network Issue Root Cause Analysis

Main Issue: Physical layer packet loss and Softnet Dropped packets indicate network interface or driver issues

Troubleshooting Steps:
  1. Check Network Hardware:
    • Check NIC Status: ethtool -S eth0
    • Check NIC Errors: dmesg | grep -i eth
    • Confirm NIC driver and firmware versions
  2. Optimize Softnet Parameters:
    • Increase netdev_max_backlog: sysctl -w net.core.netdev_max_backlog=5000
    • Adjust RPS/RFS to balance CPU load
    • Consider increasing interrupt coalescing time
  3. Bandwidth Management:
    • Current 78% bandwidth utilization near saturation
    • Consider upgrading network bandwidth or implementing QoS policies
    • Analyze traffic patterns and identify abnormal traffic sources

Example 3: OVN-Kubernetes Latency Analysis

# Scenario: Analyze OVN-Kubernetes network latency

User Query: "What is the OVN Database sync latency? Are there any performance issues?"

AI Agents Execution:
1. Call query_ovnk_sync_metrics(duration="1h")
2. Call query_ovnk_latency_metrics(duration="1h")
3. Call query_ovnk_pod_metrics(duration="1h")
4. Call query_ovnk_ovs_metrics(duration="1h")
5. Comprehensive OVN performance analysis
                

๐Ÿ“‹ OVN-Kubernetes Performance Report

NB DB Sync Latency (P99)
125ms
๐ŸŸก Slight Delay
SB DB Sync Latency (P99)
89ms
๐ŸŸข Normal
OVN Pod CPU Usage
52%
๐ŸŸข Healthy
OVS Flow Table Entries
12,345
๐ŸŸข Normal

๐Ÿ“– Usage Examples

Quick Start

# 1. Clone repository git clone git@github.com:liqcui/ocp-performance-analyzer-mcp.git cd ocp-performance-analyzer-mcp # 2. Create virtual environment python3 -m venv venv source venv/bin/activate # 3. Install dependencies pip install -e . # 4. Configure environment variables export KUBECONFIG=/path/to/kubeconfig export OPENAI_API_KEY=your-api-key # 5. Start ETCD Analyzer cd mcp/etcd ./etcd_analyzer_command.sh start # 6. Access Web UI # Open browser: http://localhost:8080/ui

CLI Tools Usage

# Start MCP server python etcd_analyzer_mcp_server.py # Start AI chat client python etcd_analyzer_client_chat.py # Generate performance report python etcd_analyzer_mcp_agent_report.py # Collect data to database python etcd_analyzer_mcp_agent_stor2db.py

Conversational Query Examples

๐Ÿ’ฌ Example Conversation

User: "Analyze etcd performance for the past 24 hours"

AI: Analyzing ETCD performance... [Calling tool: get_etcd_performance_deep_drive]

User: "Is the WAL fsync latency exceeding thresholds?"

AI: Let me check WAL fsync detailed data... [Calling tool: get_etcd_disk_wal_fsync]

User: "Generate a complete performance report for me"

AI: Generating report... [Calling tool: generate_etcd_performance_report]
Report generated and saved to exports/etcd_performance_report_20260412.html

API Call Examples

# Query ETCD performance using REST API curl -X POST http://localhost:8000/tools/get_etcd_disk_wal_fsync \ -H "Content-Type: application/json" \ -d '{"duration": "1h"}' # Query Network I/O Metrics curl -X POST http://localhost:8000/tools/query_network_io_metrics \ -H "Content-Type: application/json" \ -d '{"duration": "30m"}' # Health check curl http://localhost:8000/health # Get tool list curl http://localhost:8080/api/tools

Configuration File Example

# config/metrics-etcd.yml snippet metrics: - name: etcd_pods_cpu_usage title: "etcd_pods_cpu_usage" expr: 'sum(irate(container_cpu_usage_seconds_total{ namespace="openshift-etcd", container="etcd", pod=~".*" }[2m])) by (pod) * 100' unit: "percent" category: "general_info" description: "CPU usage percentage for etcd containers" - name: etcd_disk_wal_fsync_p99 title: "WAL Fsync P99 Latency" expr: 'histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]) ) * 1000' unit: "ms" category: "wal_fsync" threshold: 10.0