One-Click Inspection

What is One-Click Inspection?

One-Click Inspection: Comprehensively inspects the health status of key resources and services of the Cloud and scores their healthiness based on the inspection results. In addition, the one-click inspection service provides O&M suggestions and inspection reports. One-click inspection is applicable to centralized O&M scenarios.

图 1. One-Click Inspection


Concepts

  • Inspection Categories and Items:
    One-Click Inspection provides five inspection categories, including platform, compute resources, network resources, storage resources, and global settings. You can use the service to inspect key resources and services of the Cloud, such as the management node, hosts, VM instances, image storage, primary storage, physical and virtual NICs and networks, and licenses.
    • Platform: Check the basic services and running status of the Cloud.
    • Compute: Check the usage and running status of physical and virtual compute resources of the Cloud.
    • Network: Check the configurations and status of physical and virtual networks of the Cloud.
    • Storage: Check the usage and running status of physical storage resources of the Cloud.
    • Global Setting: Check the configurations of key resources of the Cloud.

    After you select items from certain inspection categories and launch inspection, related resources or services are inspected and their healthiness is scored. For more information about inspection items, see Inspection Items.

  • Inspection Results:
    One-Click Inspection provides four inspection results, including Normal, Warning, Fault, Failed.
    • Normal: The inspected resources or services are in normal status. This result is marked with a green icon.
    • Warning: The health status of inspected resources or services is compromised, which may to some extent affect their performance and stability. This result is marked with a yellow icon.
    • Fault: The inspected resources or services are in critical condition and may seriously affect business operations. This result is marked with a red icon.
    • Failed: The inspection on related resources or services fails, which may seriously affect business operations. This result is marked with a grey icon.
  • Healthiness Scoring:

    One-Click Inspection provides an in-built healthiness scoring mechanism for Cloud resources and services. It allows you to grasp the overall running status of the Cloud in a visualized way.

    Scoring on inspected resources/services: Scores resources and services based on the inspection results of related resource and service attributes.
    • If all attributes of a resource or service under inspection are in Normal status, the inspection result of the resource or service is Normal. The score is 100 points.
    • If one attribute of a resource or service under inspection is in Warning state and the other attributes are in Normal status, the inspection result of the resource or service is Warning. The score is 50 points.
    • If one attribute of a resource or service under inspection is in Fault or Failed state, the inspection result of the resource or service is Fault or Failed. The score is 0 points.
    Scoring on inspection items: Scores inspection items based on the inspection results of related resources and services.
    • If an inspection item does not belong to the Global Setting category, the inspection item is scored based on the following mechanism:
      • Score of Inspection Item = (Score of Resource 1 + Score of Resource 2 + …… + Score of Resource N)/(N*100)*100
      • For example, if an inspection item involves 3 resources, which are in Normal, Warning, and Fault/Failed status respectively, the scores of the three resources are 100, 50, and 0 points respectively. Then the score of the inspection item is (100 + 50 + 0)/(3*100)*100=50 points.
    • If an inspection item belongs to the Global Setting category, the inspection item is scored based on the following mechanism:
      • The score of the inspection item is the score of the involved global setting.
      • For example, if the inspection result of the involved global setting is Warning, the score of the global setting is 50 points. Then the score of the inspection item is 50 points.
    Scoring on the Cloud: Scores the Cloud based on the scores of all inspection items.
    • Score of the Cloud = (Score of Inspection Item 1 + Score of Inspection Item 2 + …… + Score of Inspection Item N)/(N*100)*100
    • For example, if you select 3 inspection items, which is scored 100 points, 50 points, and 0 points respectively, then the score of the Cloud is (100 + 50 + 0)/(3*100)*100=50 points.
  • O&M Suggestions:

    If resources and services are detected in Warning or Fault status, One-Click Inspection analyzes the hidden dangers and their effects on these resources and services, and provides suggestions on O&M. For more information, see Inspection Items.

  • Inspection Reports:

    One-Click Inspection allows you to export PDF-formatted inspection reports. An inspection report summarizes platform configurations, resource status, and inspection results. It also provides details of all abnormal inspection items and corresponding O&M suggestions.

Benefits

One-Click Inspection has the following benefits:
  • Comprehensive, customized, and efficient inspection capabilities: Provides five inspection categories that cover all key resources and services of the Cloud and allows you to select inspection items based on your business scenarios. After you launch an inspection, the inspection can be completed within a few minutes.
  • Multi-layered scoring mechanism: The in-built three-layer mechanism of scoring on resources/services, inspection items, and the Cloud allows you to grasp the overall picture as well as details of the Cloud running status.
  • Intelligent O&M suggestions: Provides risk analysis of resources and corresponding countermeasures, facilitating efficient O&M.

Manage One-Click Inspection

On the main menu of ZStack Cloud, choose Platform O&M > One-Click Inspection. Then, the One-Click Inspection page is displayed.

One-Click Inspection allows you to perform different operations in different inspection status. The following table describes the operations.
Action Description Inspection Status
Start Inspection Launches inspection on selected items. /
Pause Inspection Pauses inspection on selected items. Inspecting
Resume Inspection Resumes inspection on selected items. Paused
Cancel Inspection Cancels inspection on selected items. Inspecting
Reinspect Reinspects items that were last inspected. Inspection Completed
Export Inspection Report Export an PDF-formatted inspection report. Inspection Completed

Check Inspection Results

On the main menu of ZStack Cloud, choose Platform O&M > One-Click Inspection. Then, the One-Click Inspection page is displayed. Select inspection items and click Start Inspection. After the inspection is completed, you can check the inspection results.

图 1. Inspection Result


On the inspection result page, an overall score is presented for selected inspection items. These inspection items are categorized based on their status so you can quickly locate abnormal inspection items. In addition, the information and status of resources involved in each inspection item are presented in tables. O&M suggestions are also provided.

Overall Inspection Result

On the upper half of the inspection page, the number of total inspected items and that of abnormal items are presented. A score is also provided based on the in-built scoring mechanism to indicate the health status of Cloud resources and services involved in the selected items. In addition, the time consumed for this round of inspection and the time at which the inspection is completed are also recorded so you can properly arrange subsequent inspections. For more information about the healthiness scoring mechanism, see Healthiness Scoring.

Inspection Items Categorized based on Status

On the lower half of the inspection page, all inspected items are categorized into Normal or Abnormal based on the inspection results of involved resources and services. You can locate inspected items and check their status on the Normal or Abnormal tab. The inspected items are categorized based on the following mechanism:
  • If the inspection results of all resources and services involved in an inspection item are Normal, the inspection item is marked with a green icon and categorized as Normal.
  • If the inspection results of all resources and services involved in an inspection item are Warning, or the inspection results of some resources and services are Warning while others are Normal, the inspection item is marked with a yellow icon and categorized as Abnormal.
  • If the inspection results of all resources and services involved in an inspection item are Fault, or the inspection results of some resources and services are Fault while others are Normal or Warning, the inspection item is marked with a red icon and categorized as Abnormal.
  • If the inspection results of all resources and services involved in an inspection item are Failed, or the inspection results of some resources and services are Failed while others are Normal, Warning, or Fault, the inspection item is marked with a grey icon and categorized as Abnormal.

Inspected Resource Details and O&M Suggestions

Click an inspected item on the left lower half of the page, you can view the basic information of related resources and their inspection results presented in a table on the right. O&M suggestions are also provided for resources and services that are in Warning and Fault status.
  • Table Information:
    • Basic information: One-Click Inspection presents different information for different inspected resources and services. For example, for the item Image Storage Status, the name, type, and status of the image storage on the Cloud are presented.
    • Inspection results: One-Click Inspection provides four inspection results, including Normal, Warning, Fault, Failed.
      • Normal: The inspected resources or services are in normal status. This result is marked with a green icon.
      • Warning: The health status of inspected resources or services is compromised, which may to some extent affect their performance and stability. This result is marked with a yellow icon.
      • Fault: The inspected resources or services are in critical condition and may seriously affect business operations. This result is marked with a red icon.
      • Failed: The inspection on related resources or services fails, which may seriously affect business operations. This result is marked with a grey icon.
  • If resources and services are detected in Warning or Fault status, One-Click Inspection analyzes the hidden dangers and their effects on these resources and services, and provides suggestions on O&M. For more information, see Inspection Items.

Inspection Items

Category Inspection Item Description O&M Suggestions
Platform License Expiration Check whether the base license and module licenses of the Cloud are soon to expire or have expired. If detected that the base license of the Cloud is soon to expire or have expired, to ensure the Cloud works as expected, contact related personnel and update the license.
Host Time Consistency with Time Server Check whether time sync with the time server is configured for a host and whether the time sync configuration of a host in a cluster is consistent with other hosts in the cluster. If detected that the time server configured for some hosts in a cluster is not consistent with the time server configured for other hosts in the cluster, SSH to related host systems and check the time server configurations.
Storage Space Occupied by Monitoring Data Check the proportion of the storage space taken up by Cloud monitoring data to the MN volume where the data is stored. If detected that the Cloud monitoring data occupies over 50% of the storage space of the MN system volume, go to Global Setting and modify the reserved monitoring data size or the retention period of monitoring data.
System Volume Usage of Management Node Check the storage usage and utilization of the system volume of the management node on the Cloud. If detected that the usage of the system volume of the Cloud management node exceeds 70%, SSH to the MN system and clean up data insensitive to the business.
MN Database Backup Job Check whether remote backup is configured for the MN database and whether the configuration takes effect.

If detected that the remote backup mechanism is not configured for the management node (MN) database, SSH to the MN system and check whether crontab scheduled job is configured.

If detected that the remote backup mechanism configured for the management node (MN) database does not take effect, SSH to the MN system and check whether password-free login to a specified backup server is allowed.

Management Node HA Status Check whether high availability (HA) service is configured for the Cloud management node and whether the HA service works as expected.

If detected that the HA service is not configured for the Cloud management node, to ensure high availability of the Cloud, we recommend that you configure this service as soon as possible.

If detected that the HA service of the Cloud management node is in abnormal state, check the system status of the management node as soon as possible.

Backup Server Storage Usage Check the storage usage of local and remote backup servers on the Cloud.

If detected that the storage usage of backup servers is no lower than 70% and no higher than 90%, delete expired backups or expand the storage capacity of the backup servers.

If detected that the storage usage of backup servers is no lower than 90%, this may make backup jobs unable to be executed. Delete expired backups or expand the storage capacity of the backup servers.

Compute Host CPU Check the status and temperature of every host CPU on the Cloud.
If detected that the CPU temperature of some hosts has kept no lower than 80℃ for 5 minutes, sustained high temperature may cause hosts to run unstably, auto power off or restart, and thus interrupt application workloads running on VM instances. Troubleshooting:
  • Check whether the temperature of the machine room has exceeded the higher threshold required for the stable running of hosts.
  • Check the out-of-band management interface whether warnings are triggered because of low rotation speed of the fan, or the fault of the CPU or motherboard.

If detected that the CPU of some hosts is offline, this may cause hosts to run unstably, auto power off or restart, and thus interrupt application workloads running on VM instances.

Host Memory Check the memory utilization, SWAP utilization, and ECC warnings of hosts on the Cloud. If detected that the memory of hosts are in warning state, it may cause host OOM errors, deteriorate host performance, and interrupt application workloads running on VM instances. Troubleshooting:
  • Memory Utilization: If the host memory utilization is no less than 90%, check the VM workloads running on related hosts. If the workloads are too heavy, migrate some VM instances to other hosts. If the workloads are in normal level, go to the host system and check whether memory leak occurs due to abnormal processes.
  • SWAP Utilization: If host SWAP utilization is no less than 10%, check the status of VM instances running on related hosts, and migrate some VM instances and increase host memory if necessary.
  • ECC Warning: If ECC warning occurs, check the status of VM instances running on related hosts and migrate some VM instances if necessary. In addition, check if memory fault errors. If so, change the host memory as soon as possible.
Average CPU Utilization of Host Check the average CPU utilization of hosts on the Cloud.

If detected that the average CPU utilization of some hosts exceeds 70%, log into the host system and check whether there are abnormal processes running on the hosts. If it is not the case, we recommend that you add more hosts to the cluster where the preceding hosts reside.

If detected that the average CPU utilization of some hosts exceeds 90%, log into the host system and check whether there are abnormal processes running on the hosts. If it is not the case, we recommend that you add more hosts to the cluster where the preceding hosts reside as soon as possible.

System Volume Usage of Host Check the storage usage and utilization of the system volume of the hosts on the Cloud. If detected that the storage usage of the system volumes of some hosts exceeds 70% or even 90%, log into the host system and cleanup data insensitive to your business.
VM Instances on Host Check the number of running VM instances on the hosts of the Cloud. If detected that the number of running VM instances on a host exceeds 20, check the resource usage of hosts and hot migrate VM instances as needed so that host resource usage is balanced.
Host Status Check whether hosts on the Cloud are disconnected. If detected that hosts on the Cloud are disconnected, check the system status of related hosts as soon as possible.
Host System Password Strength Check whether the root password strength of hosts on the Cloud suits the actual business scenarios. If detected that the root password of hosts is weak, we recommend that you reset a password that is at least 8 characters in length and contains digits, letters and special characters.
Host SWAP Check whether SWAP is disabled for hosts on the Cloud.

If detected that SWAP is not disabled for hosts on the Cloud, this may affect application workloads running on VM instances. We recommend that you log into the system of related hosts and disable SWAP for these hosts.

If detected that distributed storage is added to the Cloud and SWAP is not disabled for hosts, this may seriously affect application workloads running on VM instances. Log into the system of related hosts and disable SWAP for these hosts as soon as possible.

Host Zombie Process Check the number of zombie processes running on the hosts. If detected that zombie processes are running on hosts, it may be because some VM instances or other system processes running on the hosts do not exit as expected. Zombie processes may cause VM launch failures or host disconnection. Check the services provided by the zombie processes. You can implement VM migrations and restart the hosts as needed.
Running State of HA VM Instance Check the running status of HA-enabled VM instances on the Cloud. If detected that VM instances for which HA is enabled are not running, check whether these VM instances work as expected.
Average CPU Utilization of VM Instance Check the average CPU utilization of running VM instances on the Cloud. If detected that the average CPU utilization of some VM instances exceeds 80% or even 95%, log into the system of the VM instances and check whether there are abnormal application workloads. Optimize the workloads or upgrade the instance offering as needed.
Storage Usage of VM System Volume Check the storage usage of VM system volumes (non-thick provisioned system volumes) on the Cloud. If detected that the storage usage of VM system volumes (non-thick provisioned system volumes) exceeds 70% or even 90%, log into the VM system and cleanup data insensitive to your business. You can also expand the system volume capacity as needed.
Error Policy-Enabled VM Instance State Checks whether errors occur to the VM instances for which the error policy is enabled. If detected that VM instances are in fault state, check whether the system of the VM instances works as expected.
Long Stopped VM Instance Check whether there are VM instances stopped for over 30 days. If detected that VM instances have been stopped for 30 days or even longer, check whether these VM instances run application workloads. If they do not, delete these VM instances to release resources.
Network Host NIC Check the NIC status, connection mode, packet loss rate, speed, and duplex mode of host NICs. If detected that the NICs of hosts are in warning state, this may cause host disconnection and affect data transmissions of business networks. Troubleshooting:
  • Packet Loss Rate: If the NIC packet loss rate is no lower than 0.1%, it may be because there are network fluctuations or network hardware failures. Check whether related host NICs or switches work as expected.
  • NIC Connection Mode: If the negotiated port speed of a host NIC is lower than the rated port speed, it may be because of network hardware failures or lower-than-expected speed of uplink ports on the switch. Check the health status of related network hardware.
  • Full Duplex Mode: If a host NIC is not in full duplex mode, it may be because of configuration errors of the uplink switch port or network hardware failures such as the NIC or network cable. Check whether these network hardware devices are in good health status and whether the configurations of related uplink switch ports are correct. You can also manually set the mode to full duplex as needed.
  • Port Speed: If the port speed is lower than 1Gps, it may cause insufficient network performance. We recommend that you use NICs with port speed higher than 1Gps.
Bonded NIC Port Status of Host Check whether the port state of the host NICs for which bonding are configured is UP. If detected that the port state of the host NICs for which bonding are configured is DOWN, check whether related host NICs are faulted.
Business Network Redundancy Check whether network bonding is configured for the physical NIC ports of business networks on the Cloud. If detected that network bonding is not configured for the physical NIC ports of business networks, network redundancy is in short. Check whether network bonding is necessary.
Host Management Network Connectivity Check whether the management network IP addresses of hosts on the Cloud are connected to each other. If detected that the management network IP addresses of some hosts on the Cloud cannot be connected to each other, check whether the system of related hosts is in normal health status.
Packet Loss on Management Network of Host Check whether packet loss occurs during data communications with the management network IP address of hosts on the Cloud.

If detected that the management network IP address of hosts are inaccessible and the packet loss rate is 100%, check whether the host system is in normal state.

If detected that the management network IP address of hosts are inaccessible and the packet loss rate is 100%, check whether the host system is in normal state.

Packet Loss on Storage Network of Host Check whether packet loss occurs during data communications with the storage network IP address of hosts on the Cloud.

If detected that packet loss occurs during data communications with the storage network IP address of hosts, check whether the physical link layer of the host works as expected and whether the physical NICs are faulted.

If detected that the storage network IP address of hosts are inaccessible and the packet loss rate is 100%, check whether the host system is in normal state.

Storage Host HDD Check the health status, IO utilization, and bad sectors of HDDs of hosts on the Cloud.
If detected that the HDDs of hosts are in warning state, this may cause data reads and writes of VM instances on the hosts to get stuck and affect application workloads running on the VM instances. Troubleshooting:
  • Health Status: If the health status of host HDDs is abnormal, check whether there are bad disk sectors, contact failures, or other faults, and change faulted disks if necessary.
    Note: The Cloud does not check the health states of HDDs whose model cannot be recognized and displays the health states of these HDDs as Unknown. You can confirm the HDD health state on the corresponding hardware platform.
  • IO Utilization: If the HDD IO utilization has kept no less than 90% for 5 minutes, check whether the disk IO latency of related hosts is excessively too high, disk IO performance is deteriorated, and whether there are other exceptions. If disks are faulted, change the disks as needed.
  • Bad Disk Sector: If there are bad disk sectors, check the disk IO of related hosts and the coverage of the faults and change the hardware as soon as possible.
Host SSD Check the health status, IO utilization, remaining life expectancy, and temperature of SDDs of hosts on the Cloud.
If detected that the SSDs of hosts are in warning state, troubleshooting:
  • Health Status: If the health status of host SSDs is abnormal, check whether there are disk faults and change the faulted SSDs as soon as possible. Abnormal health status of SSDs may cause data reads and writes of VM instances to get stuck or even system hangs.
    Note: The Cloud does not check the health states of SSDs whose model cannot be recognized and displays the health states of these SSDs as Unknown. You can confirm the SSD health state on the corresponding hardware platform.
  • IO Utilization: If the SSD IO utilization has kept no less than 90% for 5 minutes, check whether the disk IO latency is excessively too high, disk IO performance is deteriorated, and whether there are other exceptions. Sustained high IO utilization may cause application workloads on VM instances to get stuck.
  • Temperature: If the SSD temperature of hosts is equal to or higher than 60℃ and lower than 70℃, check whether there are sustained and intensive data writes on the SSDs. High temperature may cause SSDs to run unstably and affect data reads and writes of VM instances.
  • Remaining Life Expectancy: If the remaining life expectancy of an SSD is no lower than 10% yet no higher than 30%, replace the SSD with an SSD of the same specification. Otherwise, when the remaining life expectancy of the SSD decreases to 0, data reads and writes are terminated.
If SSDs of hosts are in fault state, troubleshooting:
  • Temperature: If the SSD temperature of hosts is no lower than 70℃, check whether the temperature of the machine room is too high and whether there are sustained and intensive data writes on the SSDs. High temperature may cause SSDs to run unstably and affect data reads and writes of VM instances.
  • Remaining Life Expectancy: If the remaining life expectancy of an SSD is lower than 10% , replace the SSD with an SSD of the same specification. Otherwise, the SSD may be faulted and data reads and writes are terminated.
Host RAID Card Check the RAID card status and cache mode of hosts on the Cloud.

If detected that the RAID card is degraded, the data redundancy may be affected. Check the health state of the RAID cards and process this problem as soon as possible.

If detected that the cache mode of host RAID cards is not write-through, this may make storage services unable to be launched and data on the system volume unable to be resumed in case of power outage. Reset the cache mode of related host RAID cards to write-through.

If detected that exceptions occur to host RAID cards, this may be because of RAID card faults or contact failures and may cause host system hangs and VM I/O stuck. Check whether related host RAID cards are in good health status and check whether there are warnings on RAID card failures on the out-of-band management interface. If there are RAID card failures, change the RAID cards as soon as possible.

Volume Snapshots Check the number of snapshots created on a volume. If detected that the number of snapshots of some volumes exceeds 20, this will lower VM performance, increase data security risks, and occupy storage space of primary storage. Clean up snapshots that are insensitive to your business as needed.
Primary Storage Status Check whether primary storages on the Cloud are disconnected. If detected that primary storages on the Cloud are disconnected, check whether related primary storage is in normal status as soon as possible.
Image Storage Status Check whether image storage on the Cloud are disconnected. If detected that image storage are disconnected, check whether the storage status of these image storage is abnormal.
Used Physical Storage Space of Image Storage Check the storage usage and utilization of primary storages on the Cloud.

If detected that the storage usage of the image storage exceeds 70%, we recommend that you expand the storage capacity.

If detected that the storage usage of the image storage exceeds 85%, clean up images insensitive to your business to free up some storage space and expand the storage capacity if necessary.

Used Physical Storage Space of Primary Storage Check the storage usage and utilization of primary storages on the Cloud.

If detected that the storage usage of the primary storages on the Cloud exceeds 70%, we recommend that you expand the storage capacities.

If detected that the storage utilization of primary storages exceeds 85%, to prevent storage space fully occupied, clean up VM instances and volumes insensitive to your business so as to free up some storage space and expand the capacity of primary storages.

Distributed Storage Monitor Node State Check whether the monitor node of distributed storage on the Cloud is well connected. If detected that the monitor node of distributed storage on the Cloud is disconnected, check whether the distributed storage is normal state.
Distributed Storage Status Check whether the distributed storage on the Cloud is in normal health status. If detected that the health status of the distributed storage on the Cloud is abnormal, log into the server system and check the system status of the distributed storage.
Primary Storage Heartbeat Network Check whether heartbeat networks are configured for primary storages on the Cloud. If detected that heartbeat network is not configured for the primary storages of the Cloud, to ensure that you can monitor the health status of the Cloud primary storages, configure a heartbeat network as soon as possible.
Global Setting HA Policy of VM Instance Check whether the global setting VM HA Policy of the Cloud is set to Force. If detected that the high availability (HA) policy of VM instances is set to Permissive, HA is not supported for VM instances. To ensure high availability of business running on VM instances, go to Global Setting and reset the policy to Force.
Host Reserved Memory Check whether the specified value of the global setting Host Reserved Memory well suits the actual business scenarios. If detected that the reserved host memory is quite little, the system services of the Cloud occupies some memory. To ensure the smooth running of the system services, go to Global Setting and set the host reserved memory to at least 30 GB.
Memory Overcommitment Check whether the specified value of the global setting Host Memory Overcommitment well suits the actual business scenarios. If detected that memory overcommitment is set to above 1, we recommend that you do not overcommit memory in the production environment. Memory overcommitment may cause host OOM errors. Go to Global Setting and set the memory overcommitment to 1.
Primary Storage Overcommitment Check whether the specified value of the global setting Primary Storage Overcommitment well suits the actual business scenarios. If detected that primary storage overcommitment is set to above 1, we recommend that you do not overcommit primary storage in the production environment. Primary storage may cause all storage space to be fully occupied. Go to Global Setting and set the primary storage overcommitment to 1.
Primary Storage Usage Threshold Check whether the specified value of the global setting Primary Storage Usage Threshold well suits the actual business scenarios. If detected that the usage threshold set for primary storages on the Cloud is quite high, to prevent excessive usage of primary storage space, go to Global Setting and set Primary Storage Usage Threshold to 0.85.
Reserved Storage of Primary Storage Check whether the specified value of the global setting Primary Storage Reserved Capacity well suits the actual business scenarios. If detected that the reserved storage space of primary storage is quite little, go to Global Setting and set the reserved storage space to 200 GB.
Reserved Storage Space of Image Storage Check whether the specified value of the global setting Image Storage Reserved Capacity well suits the actual business scenarios. If detected that the storage space reserved for image storage is quite little, go to Global Setting and set the reserved storage space to 200 GB.