Resource Monitoring

Resource Performance Monitoring

Overview

ZStack ZSphere provides visual charts that display various monitoring data for resources over a period of time. These charts include multiple key performance monitoring metrics, helping you gain an intuitive understanding of resource performance conditions.

Monitoring Chart Types

Chart Type	Description
Bar Chart	Displays monitoring data of resource capacity load in the form of proportional bars, providing an intuitive understanding of resource capacity information.
Line Chart	Displays monitoring data of various loads on resources in the form of a line chart, offering an intuitive understanding of resource health status.

Monitoring Data Collection Methods

ZStack ZSphere provides two monitoring methods for virtual machines. Generally speaking, for memory data, Advanced Monitoring offers better accuracy than Basic Monitoring. It is recommended to use Advanced Monitoring when monitoring memory data.

Basic Monitoring: Monitoring data is obtained from the host via Libvirt.
Advanced Monitoring: Monitoring data is obtained from the virtual machine by an advanced monitoring agent. VMTools must be pre-installed on the virtual machine for this method.

Monitoring Data Collection Intervals

ZStack ZSphere uses real-time monitoring, with resource monitoring charts refreshing data every 10 seconds by default.

Capacity Monitoring

ZStack ZSphere provides information on the usage and allocation of various computing and storage resources, including virtual machines, hosts, clusters, data storage, data centers, and root nodes (management nodes). This allows you to comprehensively understand the platform's resource usage from both micro and macro perspectives.

Capacity Monitoring Metrics

You can go to the overview details page of the corresponding resource to understand the platform's resource usage from the Capacity Information card. The following table lists the detailed monitoring metrics for various resources.

Object	Monitoring Metrics and Description
Root Node	CPU: Total physical CPU GHz and average utilization rate across all data centers. Memory: Total physical memory, average utilization rate, and remaining available capacity across all data centers. Storage: Total physical storage, average utilization rate, and remaining available capacity across all data centers.
Data Center	CPU: Total physical CPU GHz and average utilization rate within the data center. Memory: Total physical memory, average utilization rate, and remaining available capacity within the data center. Storage: Total physical storage, average utilization rate, and remaining available capacity within the data center.
Data Storage	Storage Utilization: Total storage resources, utilization rate, and remaining available capacity. Storage Allocation Ratio: Allocation status of storage resources. Storage Distribution: Distribution of storage resources, including: total capacity after overcommitted, reserved capacity, allocated capacity (such as snapshot capacity, image cache, migration storage, virtual machine disk capacity.), and remaining allocatable capacity.
Cluster	Resource Utilization: Total physical CPU and memory resources, utilization rate, and remaining available capacity in the cluster. Resource Allocation Ratio: Allocation status of physical CPU and memory resources in the cluster. Resource Distribution: Distribution of CPU and memory resources after overcommitted in the cluster, including: total capacity after overcommitted, reserved capacity, allocated capacity, and remaining allocatable capacity.
Host	Resource Utilization: Total physical CPU, memory, and storage resources on the host, utilization rate, and remaining available capacity. Resource Allocation Ratio: Allocation status of CPU, memory, and storage resources on the host. Resource Distribution: Distribution of CPU, memory, and storage resources after overcommitted on the host, including: total capacity after overcommitted, reserved capacity, allocated capacity, and remaining allocatable capacity.
Virtual Machine	CPU: Number of CPU cores and utilization rate for the virtual machine. Memory: Total memory capacity, used capacity, and remaining available capacity for the virtual machine. Storage: Total storage capacity, used capacity, and remaining available capacity for the virtual machine.

Capacity Calculation Rules

Category	Calculation Rules
Resource Utilization Rate	Total CPU = Physical Cores × Single-Core GHz
Resource Allocation Ratio	Allocation Ratio = Allocated : Total Overcommit Capacity Total Overcommit Capacity = Physical Total − Reserved Physical Capacity Total Allocatable = Total Overcommit Capacity × Overcommit Ratio Free to Allocate = Total Allocatable − Allocated
Resource Distribution	CPU CPU Overcommitted Total = Physical CPU Total × Overcommit Ratio Memory Memory Overcommitted Total = Reserved Memory + Total Allocatable Memory Capacity Total Allocatable Memory = (Physical Memory Total − Reserved Memory) × Overcommit Ratio Storage Storage Overcommitted Total = Reserved Capacity + Total Allocatable Storage Capacity Total Allocatable Storage = (Physical Storage Total − Reserved Capacity) × Overcommit Ratio

View Monitoring Charts

ZStack ZSphere supports visualizing load monitoring data for various resources in the form of line charts. This not only helps you quickly understand the inventory of computing, storage, and network resources for resource objects but also provides an intuitive understanding of resource health conditions.

Procedure

In the navigation pane, choose Inventory.
Select a valid resource object, such as a virtual machine, host, cluster, image storage, data storage, or distributed port group.
In the right-side pane, click Monitoring.
(Optional) Select the monitoring items you want to display.
(Optional) Choose or customize the time range.
(Optional) Select one or multiple monitoring objects.

Customize Monitoring Charts

You can customize monitoring charts to view more monitoring data.

Details: Hover the mouse over the chart to display detailed information about the relevant data points.
Custom Time Span: By default, it displays monitoring data for the past 15 minutes. Valid values include 15 minutes, 1 hour, 6 hours, 1 day, 1 week, 1 month, 1 year, and custom.
Custom Monitoring Items: Flexibly select the monitoring metrics you want to focus on based on your business needs.
Custom Monitoring Objects: Display data for all or specified monitoring objects.
Custom Chart Position: Freely drag and rearrange the position of monitoring charts.

Appendix of Monitoring Items

Object	Metric	Item and Description
Cluster	CPU Memory Disk NIC	CPU Utilization Sum Memory Usage Percentage Disk IOPS Sum NIC Data Transfer Rate Sum
Host	CPU	CPU Utilization: The proportion of time the CPU is in a non-idle state. CPU Idle Rate: The proportion of time the CPU is in an idle state. CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes. CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers. CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
	Memory	Memory usage: The amount of used and free resource memory.
	Disk	Disk Speed: The read and write speed of the resource disk. Disk IOPS: The read and write IOPS of the resource disk. Disk Latency: The latency of the resource disk. Total Disk Usage Ratio: The percentage of used capacity across all host disks. Total Disk Usage: The amount of used capacity across all host disks. Disk Usage Ratio of Platform System Files: The percentage of disk capacity occupied by the platform system files. Disk Usage of Platform System Files: The amount of disk capacity occupied by the platform system files.
	NIC	NIC Data Transfer Rate: The current send and receive rate of the resource's NIC. NIC Packet Rate: The current send and receive packet rate of the resource's NIC. NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Virtual Machine	CPU	CPU Utilization: The proportion of time the CPU is in a non-idle state. CPU Idle Rate: The proportion of time the CPU is in an idle state. CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes. CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers. CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
	Memory	Memory Usage: The amount of used and free resource memory. Available Memory Capacity: The available amount of resource memory that can be used. Free Memory Capacity: The amount of free resource memory. Total Memory Capacity: The total amount of resource memory. Memory Idle Rate: The percentage of resource memory currently in an idle state. Memory Utilization: The percentage of resource memory that is currently in use.
	Disk	Disk Speed: The read and write speed of the resource disk. Disk IOPS: The read and write IOPS of the resource disk. Disk Utilization: The percentage of used capacity on the resource disk. Disk Idle Rate: The percentage of idle capacity on the resource disk. Disk Usage Capacity: The amount of used capacity on the resource disk. Disk Idle Capacity: The amount of free capacity on the resource disk.
	NIC	NIC Data Transfer Rate: The current send and receive rate of the resource's NIC. NIC Packet Rate: The current send and receive packet rate of the resource's NIC. NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Data Storage	Capacity	Capacity Percent Used: The percentage of capacity currently used by the resource.
Image Storage - Standalone Image Storage/Distributed Image Storage	Capacity	Capacity Percent Used: The percentage of capacity currently used by the resource.
Image Storage - Standalone Image Storage	CPU	CPU Utilization: The proportion of time the CPU is in a non-idle state. CPU Idle Rate: The proportion of time the CPU is in an idle state. CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes. CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers. CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
	Disk	Disk Speed: The read and write speed of the resource disk. Disk IOPS: The read and write IOPS of the resource disk.
	Memory	Memory Usage: The amount of used and free resource memory.
	NIC	NIC Data Transfer Rate: The current send and receive rate of the resource's NIC. NIC Packet Rate: The current send and receive packet rate of the resource's NIC. NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Distributed Port Group	IP	Used IP Percentage (IPv4): The percentage of IPv4 addresses currently used by the resource. Available IP Percentage (IPv4): The percentage of remaining available IPv4 addresses on the resource.

Dashboard Monitoring

ZStack ZSphere The dashboard displays platform resource status statistics, platform load trends, platform usage statistics, resource top rankings, user information, and unread alarm statistics for the past seven days in a card format.

Each time you enter or refresh the dashboard, the latest data is fetched and displayed in real-time. Additionally, chart-based modules automatically refresh data every 30 seconds by default.
The dashboard by default shows the resource data for the current data center. You can click the switch button in the top left corner of the page to specify which data center’s resource data to display.
Status charts use a standardized color scheme: green indicates normal status, red indicates an abnormal status, and gray indicates other statuses.
Percentage progress bars are color-coded as blue (less than 60%), yellow (greater than or equal to 60% but less than 80%), and red (greater than or equal to 80%) to visually represent the current resource usage state.
For resource status cards and some load trend and usage statistics cards, you can click on the resource name or statistical numbers to navigate to the corresponding resource page.

Dual Management Node Monitoring

If your environment consists of two management nodes, navigate to Reliability > MN Monitoring page to view the management node monitoring data.

Before you check the management node monitoring data, you should be aware of the following information:

This page uses three colors: green, red, and gray. Green indicates normal status, while other colors indicate abnormal status.
The dual-management node setup follows a active-standby model, with only one active management node. The node displaying VIP is the active management node, and the one without VIP is the standby management node.
If the standby management node is in an abnormal state, the active management node will fail to switch and the management nodes will go down. Therefore, address any management node issues promptly.

The management node monitoring displays the management node IPs, node status, VIP, and management service status for multiple management nodes. The main services monitored include the following:

Arbiter Gateway Reachable:
Monitors whether the arbitration IP of the active-standby management node is reachable. If unreachable, it may cause the high availability feature of the management node to fail.
Peer MN Reachable:
Monitors whether the standby management node is reachable. If the standby management node is unreachable, communication with the standby node will not be possible.
VIP Reachable:
Monitors whether the VIP is reachable. If the VIP is unreachable, the primary management node cannot access the UI interface via the VIP.
Database Status:
Monitors the status of the database. If the database is abnormal, there may be a risk of data loss. Please restore the fault promptly.

Host Hardware Monitoring

ZStack ZSphere supports monitoring the status of host hardware components such as CPU, memory, sensors, PCIe devices, and more.

The hardware components that can be monitored on the host include:

CPU
Memory
Physical Disks
Physical Network Cards
GPU Devices
Block Devices
USB Devices
Sensors (Voltage, Current, Fans, Temperature)
Power Supply
PCIe Devices

上一篇O&M Management 下一篇Alarm Service