Resource Monitoring

Resource Performance Monitoring

Overview

ZStack ZSphere provides visual charts that display various monitoring data for resources over a period of time. These charts include multiple key performance monitoring metrics, helping you gain an intuitive understanding of resource performance conditions.

Monitoring Chart Types

Chart Type Description
Bar Chart Displays monitoring data of resource capacity load in the form of proportional bars, providing an intuitive understanding of resource capacity information.
Line Chart Displays monitoring data of various loads on resources in the form of a line chart, offering an intuitive understanding of resource health status.

Monitoring Data Collection Methods

ZStack ZSphere provides two monitoring methods for virtual machines. Generally speaking, for memory data, Advanced Monitoring offers better accuracy than Basic Monitoring. It is recommended to use Advanced Monitoring when monitoring memory data.
  • Basic Monitoring: Monitoring data is obtained from the host via Libvirt.
  • Advanced Monitoring: Monitoring data is obtained from the virtual machine by an advanced monitoring agent. VMTools must be pre-installed on the virtual machine for this method.

Monitoring Data Collection Intervals

ZStack ZSphere uses real-time monitoring, with resource monitoring charts refreshing data every 10 seconds by default.

Capacity Monitoring

ZStack ZSphere provides information on the usage and allocation of various computing and storage resources, including virtual machines, hosts, clusters, data storage, data centers, and root nodes (management nodes). This allows you to comprehensively understand the platform's resource usage from both micro and macro perspectives.

Capacity Monitoring Metrics

You can go to the overview details page of the corresponding resource to understand the platform's resource usage from the Capacity Information card. The following table lists the detailed monitoring metrics for various resources.

Object Monitoring Metrics and Description
Root Node
  • CPU: Total physical CPU GHz and average utilization rate across all data centers.
  • Memory: Total physical memory, average utilization rate, and remaining available capacity across all data centers.
  • Storage: Total physical storage, average utilization rate, and remaining available capacity across all data centers.
Data Center
  • CPU: Total physical CPU GHz and average utilization rate within the data center.
  • Memory: Total physical memory, average utilization rate, and remaining available capacity within the data center.
  • Storage: Total physical storage, average utilization rate, and remaining available capacity within the data center.
Data Storage
  • Storage Utilization: Total storage resources, utilization rate, and remaining available capacity.
  • Storage Allocation Ratio: Allocation status of storage resources.
  • Storage Distribution: Distribution of storage resources, including: total capacity after overcommitted, reserved capacity, allocated capacity (such as snapshot capacity, image cache, migration storage, virtual machine disk capacity.), and remaining allocatable capacity.
Cluster
  • Resource Utilization: Total physical CPU and memory resources, utilization rate, and remaining available capacity in the cluster.
  • Resource Allocation Ratio: Allocation status of physical CPU and memory resources in the cluster.
  • Resource Distribution: Distribution of CPU and memory resources after overcommitted in the cluster, including: total capacity after overcommitted, reserved capacity, allocated capacity, and remaining allocatable capacity.
Host
  • Resource Utilization: Total physical CPU, memory, and storage resources on the host, utilization rate, and remaining available capacity.
  • Resource Allocation Ratio: Allocation status of CPU, memory, and storage resources on the host.
  • Resource Distribution: Distribution of CPU, memory, and storage resources after overcommitted on the host, including: total capacity after overcommitted, reserved capacity, allocated capacity, and remaining allocatable capacity.
Virtual Machine
  • CPU: Number of CPU cores and utilization rate for the virtual machine.
  • Memory: Total memory capacity, used capacity, and remaining available capacity for the virtual machine.
  • Storage: Total storage capacity, used capacity, and remaining available capacity for the virtual machine.

Capacity Calculation Rules

Category Calculation Rules
Resource Utilization Rate Total CPU = Physical Cores × Single-Core GHz
Resource Allocation Ratio
  • Allocation Ratio = Allocated : Total Overcommit Capacity
  • Total Overcommit Capacity = Physical Total − Reserved Physical Capacity
  • Total Allocatable = Total Overcommit Capacity × Overcommit Ratio
  • Free to Allocate = Total Allocatable − Allocated
Resource Distribution CPU
  • CPU Overcommitted Total = Physical CPU Total × Overcommit Ratio
Memory
  • Memory Overcommitted Total = Reserved Memory + Total Allocatable Memory Capacity
  • Total Allocatable Memory = (Physical Memory Total − Reserved Memory) × Overcommit Ratio
Storage
  • Storage Overcommitted Total = Reserved Capacity + Total Allocatable Storage Capacity
  • Total Allocatable Storage = (Physical Storage Total − Reserved Capacity) × Overcommit Ratio
The meaning of overcommitment and allocation are as follows:
  • CPU Overcommitment: This indicates that a single physical CPU core can be virtually divided into N logical CPU cores for allocation to virtual machines.

    For example, if the CPU overcommitment ratio is 2:1, then one physical CPU core can be virtually divided into 2 logical CPU cores. Therefore, if a host has 10 physical CPU cores, it can be virtually divided into 20 logical CPU cores for allocation to virtual machines.

  • Memory/Storage Overcommitment: This indicates that a unit of memory/storage capacity can be virtually expanded into N units of memory/storage capacity for allocation to virtual machines.

    For example, if the memory/storage overcommitment ratio is 2:1, then 1 GB of memory/storage capacity can be virtually expanded into 2 GB. Therefore, if a host has 100 GB of memory/storage, it can be virtually expanded into 200 GB for allocation to virtual machines.

  • CPU Allocation: This indicates that a physical CPU core is actually virtually divided into N logical CPU cores for use by virtual machines. Therefore, the CPU allocation ratio ≤ CPU overcommitment ratio.

    For example, if the CPU allocation ratio is 1.5:1, then one physical CPU core is actually virtually divided into 1.5 logical CPU cores. Therefore, if a host has 10 physical CPU cores, they have actually been virtually divided into 15 logical CPU cores for allocation to virtual machines.

  • Memory/Storage Allocation: This indicates that a unit of memory/storage capacity is actually virtually expanded into N units of memory/storage capacity. Therefore, the memory/storage allocation ratio ≤ memory/storage overcommitment ratio.

    For example, if the memory/storage overcommitment ratio is 1.5:1, then 1 GB of memory/storage capacity is actually virtually expanded into 1.5 GB. Therefore, if a host has 100 GB of memory/storage, it has actually been virtually expanded into 150 GB for allocation to virtual machines.

Using host storage as an example, if the total physical storage capacity is 100 GB, the reserved physical capacity is 10 GB, the overcommitment ratio is 2:1, and the allocated capacity is 150 GB, then:
  • Storage Allocation Ratio = 150 GB : 90 GB = 1.67
  • Total Overcommit Storage = 100 GB - 10 GB = 90 GB
  • Total Allocatable Storage = 90 GB × 2 = 180 GB
  • Remaining Allocatable Storage = 180 GB - 150 GB = 30 GB

View Monitoring Charts

ZStack ZSphere supports visualizing load monitoring data for various resources in the form of line charts. This not only helps you quickly understand the inventory of computing, storage, and network resources for resource objects but also provides an intuitive understanding of resource health conditions.

Procedure

  1. In the navigation pane, choose Inventory.
  2. Select a valid resource object, such as a virtual machine, host, cluster, image storage, data storage, or distributed port group.
  3. In the right-side pane, click Monitoring.
  4. (Optional) Select the monitoring items you want to display.
  5. (Optional) Choose or customize the time range.
  6. (Optional) Select one or multiple monitoring objects.

Customize Monitoring Charts

You can customize monitoring charts to view more monitoring data.
  • Details: Hover the mouse over the chart to display detailed information about the relevant data points.
  • Custom Time Span: By default, it displays monitoring data for the past 15 minutes. Valid values include 15 minutes, 1 hour, 6 hours, 1 day, 1 week, 1 month, 1 year, and custom.
  • Custom Monitoring Items: Flexibly select the monitoring metrics you want to focus on based on your business needs.
  • Custom Monitoring Objects: Display data for all or specified monitoring objects.
  • Custom Chart Position: Freely drag and rearrange the position of monitoring charts.

Appendix of Monitoring Items

Object Metric Item and Description
Cluster
  • CPU
  • Memory
  • Disk
  • NIC
  • CPU Utilization Sum
  • Memory Usage Percentage
  • Disk IOPS Sum
  • NIC Data Transfer Rate Sum
Host CPU
  • CPU Utilization: The proportion of time the CPU is in a non-idle state.
  • CPU Idle Rate: The proportion of time the CPU is in an idle state.
  • CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes.
  • CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers.
  • CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
Memory Memory usage: The amount of used and free resource memory.
Disk
  • Disk Speed: The read and write speed of the resource disk.
  • Disk IOPS: The read and write IOPS of the resource disk.
  • Disk Latency: The latency of the resource disk.
  • Total Disk Usage Ratio: The percentage of used capacity across all host disks.
  • Total Disk Usage: The amount of used capacity across all host disks.
  • Disk Usage Ratio of Platform System Files: The percentage of disk capacity occupied by the platform system files.
  • Disk Usage of Platform System Files: The amount of disk capacity occupied by the platform system files.
NIC
  • NIC Data Transfer Rate: The current send and receive rate of the resource's NIC.
  • NIC Packet Rate: The current send and receive packet rate of the resource's NIC.
  • NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Virtual Machine CPU
  • CPU Utilization: The proportion of time the CPU is in a non-idle state.
  • CPU Idle Rate: The proportion of time the CPU is in an idle state.
  • CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes.
  • CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers.
  • CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
Memory
  • Memory Usage: The amount of used and free resource memory.
  • Available Memory Capacity: The available amount of resource memory that can be used.
  • Free Memory Capacity: The amount of free resource memory.
  • Total Memory Capacity: The total amount of resource memory.
  • Memory Idle Rate: The percentage of resource memory currently in an idle state.
  • Memory Utilization: The percentage of resource memory that is currently in use.
Disk
  • Disk Speed: The read and write speed of the resource disk.
  • Disk IOPS: The read and write IOPS of the resource disk.
  • Disk Utilization: The percentage of used capacity on the resource disk.
  • Disk Idle Rate: The percentage of idle capacity on the resource disk.
  • Disk Usage Capacity: The amount of used capacity on the resource disk.
  • Disk Idle Capacity: The amount of free capacity on the resource disk.
NIC
  • NIC Data Transfer Rate: The current send and receive rate of the resource's NIC.
  • NIC Packet Rate: The current send and receive packet rate of the resource's NIC.
  • NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Data Storage Capacity

Capacity Percent Used: The percentage of capacity currently used by the resource.

Image Storage - Standalone Image Storage/Distributed Image Storage Capacity

Capacity Percent Used: The percentage of capacity currently used by the resource.

Image Storage - Standalone Image Storage CPU
  • CPU Utilization: The proportion of time the CPU is in a non-idle state.
  • CPU Idle Rate: The proportion of time the CPU is in an idle state.
  • CPU Occupancy Rate (System Process): The proportion of time the CPU spends in kernel space, performing typical operations such as memory allocation, I/O operations, and creating child processes.
  • CPU Occupancy Rate (User Process): The proportion of time the CPU spends in user space, running typical user-space programs such as shells, databases, and web servers.
  • CPU Occupancy Rate Average (Waiting): The proportion of time the CPU spends waiting for the hard disk drive to load data into memory after initiating a read or write operation.
Disk
  • Disk Speed: The read and write speed of the resource disk.
  • Disk IOPS: The read and write IOPS of the resource disk.
Memory Memory Usage: The amount of used and free resource memory.
NIC
  • NIC Data Transfer Rate: The current send and receive rate of the resource's NIC.
  • NIC Packet Rate: The current send and receive packet rate of the resource's NIC.
  • NIC Packet Discard Rate: The current packet drop rate for outgoing and incoming packets on the resource's NIC.
Distributed Port Group IP
  • Used IP Percentage (IPv4): The percentage of IPv4 addresses currently used by the resource.
  • Available IP Percentage (IPv4): The percentage of remaining available IPv4 addresses on the resource.

Dashboard Monitoring

ZStack ZSphere The dashboard displays platform resource status statistics, platform load trends, platform usage statistics, resource top rankings, user information, and unread alarm statistics for the past seven days in a card format.

  • Each time you enter or refresh the dashboard, the latest data is fetched and displayed in real-time. Additionally, chart-based modules automatically refresh data every 30 seconds by default.
  • The dashboard by default shows the resource data for the current data center. You can click the switch button in the top left corner of the page to specify which data center’s resource data to display.
  • Status charts use a standardized color scheme: green indicates normal status, red indicates an abnormal status, and gray indicates other statuses.
  • Percentage progress bars are color-coded as blue (less than 60%), yellow (greater than or equal to 60% but less than 80%), and red (greater than or equal to 80%) to visually represent the current resource usage state.
  • For resource status cards and some load trend and usage statistics cards, you can click on the resource name or statistical numbers to navigate to the corresponding resource page.

Dual Management Node Monitoring

If your environment consists of two management nodes, navigate to Reliability > MN Monitoring page to view the management node monitoring data.

Before you check the management node monitoring data, you should be aware of the following information:

  • This page uses three colors: green, red, and gray. Green indicates normal status, while other colors indicate abnormal status.
  • The dual-management node setup follows a active-standby model, with only one active management node. The node displaying VIP is the active management node, and the one without VIP is the standby management node.
  • If the standby management node is in an abnormal state, the active management node will fail to switch and the management nodes will go down. Therefore, address any management node issues promptly.

The management node monitoring displays the management node IPs, node status, VIP, and management service status for multiple management nodes. The main services monitored include the following:

  • Arbiter Gateway Reachable:

    Monitors whether the arbitration IP of the active-standby management node is reachable. If unreachable, it may cause the high availability feature of the management node to fail.

  • Peer MN Reachable:

    Monitors whether the standby management node is reachable. If the standby management node is unreachable, communication with the standby node will not be possible.

  • VIP Reachable:

    Monitors whether the VIP is reachable. If the VIP is unreachable, the primary management node cannot access the UI interface via the VIP.

  • Database Status:

    Monitors the status of the database. If the database is abnormal, there may be a risk of data loss. Please restore the fault promptly.

Host Hardware Monitoring

ZStack ZSphere supports monitoring the status of host hardware components such as CPU, memory, sensors, PCIe devices, and more.

The hardware components that can be monitored on the host include:

  • CPU
  • Memory
  • Physical Disks
  • Physical Network Cards
  • GPU Devices
  • Block Devices
  • USB Devices
  • Sensors (Voltage, Current, Fans, Temperature)
  • Power Supply
  • PCIe Devices