Monitoring a Cluster
There are several ways to monitor your ClickHouse® clusters:
Grafana dashboards
Altinity.Cloud uses Grafana as its default monitoring tool. You can access Grafana from the Monitoring section of a cluster panel:
Figure 1 - The Monitoring section of the cluster panel
Clicking the View in Grafana link displays the following menu:
Figure 2 - The Grafana monitoring menu
We’ll go through those menu items next. If you’d like to jump to a particular Grafana view, click any of these links:
The Cluster Metrics view
Selecting Cluster Metrics opens this Grafana dashboard in another browser tab:
Figure 3 - The Cluster Metrics dashboard
Cluster metrics include things like the number of bytes and rows inserted into databases in the ClickHouse cluster, merges, queries, connections, and memory / CPU usage.
The System Metrics view
Selecting System Metrics opens this Grafana dashboard in another browser tab:
Figure 4 - The System Metrics dashboard
System metrics include things like CPU load, OS threads and processes, network traffic for each network connection, and activity on storage devices.
The Queries view
Selecting Queries opens this Grafana dashboard in another browser tab:
Figure 5 - The Queries dashboard
The Queries dashboard includes information about your most common queries, slow queries, failed queries, and the queries that used the most memory.
The Logs view
Selecting Logs opens this Grafana dashboard in another browser tab:
Figure 6 - The Logs dashboard
The Logs dashboard shows all of the log messages as well as the frequency of messages over time. You can add a query to the Logs visualization to filter the view for particular messages.
Cluster alerts
You can define cluster alerts to notify users when certain events occur. You can access alerts from the button on a cluster panel:
Figure 7 - The ALERTS item in the cluster panel
Clicking on the button displays the Cluster Alerts dialog:
Figure 8 - The Cluster Alerts dialog
Enter one or more comma-separated email addresses of the user(s) who should be alerted when particular events occur. For each event, you can send them a popup message in the Altinity Cloud Manager UI and/or send an email.
The different types of alerts are:
- System Alerts: Triggered by a significant system event such as a network outage. See the table below for details of all the system alerts.
- ClickHouse Version Upgrade: Triggered by an update to the version of ClickHouse installed in the cluster.
- Cluster Rescale: Triggered when the cluster is rescaled.
- Cluster Stop: Triggered when some event has caused the cluster to stop running. This could be some event that caused a problem, a user stopping the cluster, or a stop caused by your cluster uptime settings.
- Cluster Resume: Triggered when a previously stopped cluster is restarted.
A popup alert appears at the top of the ACM UI:
Figure 9 - A popup alert for a resumed cluster
System alerts
Here’s the complete list of system alerts:
Alert | Severity | Description |
---|---|---|
ClickHouse Disk Threshold Crossed | Critical | The free space on a particular disk in a particular ClickHouse cluster has fallen below a certain threshold. |
ClickHouse Server Down | Critical | The ClickHouse server is down. |
ClickHouse Distributed Files To Insert Continuously Growing | Critical | The number of distributed files to insert into MergeTree tables that use the Distributed table engine has been growing continuously for four hours. Keep an eye on that. |
ClickHouse Rejected Insert | Critical | The ClickHouse cluster rejected a number of INSERT statements due to a high number of active data parts for a partition in a MergeTree. You should decrease the frequency of INSERT s. For more information see the documentation for system.part_log, system.merge_tree_settings, and merge_tree_settings. |
ClickHouse Memory Resident Utilization High | Critical | ClickHouse’s resident memory utilization has been 80% or more for longer than 10 minutes. |
Kube Inodes Persistent Volume Usage | High | A PersistentVolumeClaim is using more than 85% of its iNode’s capacity. |
ClickHouse Disk Usage High | High | The amount of free space on a particular disk in a particular cluster will run out in the next 24 hours. |
ClickHouse Too Many Mutations | High | The ClickHouse cluster has too many active mutations. This likely means something is wrong with ALTER TABLE DELETE / UPDATE queries. For more information, run clickhouse-client \-q "SELECT \* FROM system.mutations WHERE is\_done=0 FORMAT Vertical and look for mutation errors. Also see the documentation for the KILL MUTATION statement. |
ClickHouse Disk Usage | High | The free space on a particular disk in a particular ClickHouse cluster will run out in the next 24 hours. To avoid switching to a read-only state, you should rescale the storage available in the cluster. The Kubernetes CSI supports resizing Persistent Volumes; you can also add another volume to a Pod and then restart that Pod. For more information, see the following documentation: - Resizing and rescaling storage in the Altinity Cloud Manager - Resizing Persistent Volumes using Kubernetes - Using Persistent Volumes with the Altinity Kubernetes operator for ClickHouse - Using multiple block devices for data storage - Using TTL in ClickHouse |
ClickHouse Background Pool Fetches High | High | Fetches pool utilized high The number of threads used for fetching data parts from another replica for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_fetches_pool_size and background_pool_size for more information. |
ClickHouse Move Pool Utilization High | High | Move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_move_pool_size and background_pool_size for more information. |
ClickHouse Common Pool Utilization High | High | Common move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_pool_size for more information. |
ClickHouse Background Pool Merges And Mutations High | High | Merges and mutations pool utilized high The ratio between the number of threads and the number of background merges and mutations that can be executed concurrently. See the ClickHouse documentation on the global server settings for background_merges_mutations_concurrency_ratio and background_pool_size for more information. |
ClickHouse Distributed Files To Insert High | Warning | ClickHouse has too many files to insert to MergeTree tables via the Distributed table engine. When you insert data into a Distributed table, data is written to target MergeTree tables asynchronously. When inserted into the table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible. The period for sending data is managed by the distributed_directory_monitor_sleep_time_ms and distributed_directory_monitor_max_sleep_time_ms settings. The Distributed engine sends each file with inserted data separately, but you can enable batch sending of files with the distributed_directory_monitor_batch_insert setting. Finally, see the ClickHouse documentation for more information on managing distributed tables. |
ClickHouse Max Part Count For Partition | Warning | The ClickHouse server has too many parts in a partition. The Clickhouse MergeTree table engine splits each INSERT query to partitions (PARTITION BY expressiosn) and adds one or more PARTS per INSERT inside each partition. After that the background merge process runs, and when there any too many unmerged parts inside the partition, SELECT queries performance can significantly degrade, so clickhouse tries to delay or reject the INSERT . |
ClickHouse Too Many Running Queries | Warning | The ClickHouse server has too many running queries. Please analyze your workload. Each concurrent SELECT query uses memory in JOIN s and uses CPU to run aggregation functions. It can also read lots of data from disk when scan parts in partitions and utilize disk I/O. Each concurrent INSERT query allocates around 1MB per column in an inserted table and utilizes disk I/O. For more information, see the following ClickHouse documentation: - Restrictions on query complexity - Quotas - The max-concurrent-queries global server setting - The system.query_log table |
ClickHouse Replicas Max Absolute Delay | Warning | The ClickHouse server has a replication lag. When a replica has too much lag, it can be skipped from distributed SELECT queries without errors, leading to inaccurate query results. Check the system.replicas table, the system.replication_queue table, free disk space, the network connection between the ClickHouse pod and Zookeeper on monitored clickhouse-server pods. Also see the ClickHouse documentation on system.replicas and the system.replication queue. |
ClickHouse Delayed Insert Throttling | Info | The ClickHouse server has throttled INSERT s due to a high number of active data parts for a MergeTree partition. Please decrease the INSERT frequency. See the MergeTree documentation for more information. |
ClickHouse Longest Running Query | Info | The ClickHouse server has queries that are running longer than expected. See the ClickHouse Processes documentation for more information. |
ClickHouse Query Preempted | Info | The ClickHouse server has a number of queries that are stopped and waiting due to the priority setting. See the ClickHouse documentation on processes. Also try the command clickhouse-client \-q "SELECT \* FROM system.processes FORMAT Vertical . |
System alerts sent via email look like this:
Figure 10 - An email alert
Health checks
You can check the health of a cluster or node from the ACM. For clusters, there are two basic checks: the health of the nodes in the cluster and the health of the cluster itself. The health checks for a node are whether the node is online and, as you would expect, the health of the node itself.
Cluster health checks
Cluster health checks appear near the top of a Cluster view. For example, here is the panel view of a cluster with the two health checks:
Figure 11 - A cluster panel with its two health checks
The health check at the top of the panel indicates that 2 of the 2 nodes in the cluster are online:
Clicking on this green bar takes you to the detailed view of the cluster. From there you can see the individual nodes and their status.
The second health check indicates that 6 of the 6 cluster health checks passed:
Clicking on this green bar shows you the health check dialog:
Figure 12 - The Health Checks dialog
The cluster health checks are based on six SELECT
statements executed against the cluster and its infrastructure. The six statements look at the following cluster properties:
- Access point availability
- Distributed query availability
- Zookeeper availability
- Zookeeper contents
- Readonly replicas
- Delayed inserts
Clicking any of the checks shows the SQL statement used in the check along with its results:
Figure 13 - Details of a particular cluster health check
Depending on the cluster’s status, you may see other indicators:
Health check | Meaning |
---|---|
The cluster or node is rescaling | |
The cluster or node is being terminated | |
The cluster or node is stopped |
Node health checks
The basic “Node is online” check appears next to the node name in the Nodes view of the cluster:
Figure 14 - The Nodes view of a cluster
Opening the Node view shows more details:
Figure 15 - The health checks for a single node in the cluster
The first health check indicates that the node is online:
The second health check indicates that 5 of the 5 node health checks passed:
Clicking on this green bar takes you to a more detailed view of the health checks and their results, similar to Figure 11 above.
Cluster logs
You can look at a variety of logs by clicking the button on a cluster panel:
Figure 16 - The LOGS item in the cluster panel
You’ll see this panel:
Figure 17 - The Logs panel
Notice at the top of the panel that there are five different logs available:
- ClickHouse Logs: messages issued by ClickHouse itself
- Backup Logs: messages related to system backups
- Operator Logs: messages issued by the Altinity Kubernetes operator for ClickHouse
- Audit Logs: messages related to significant system events initiated by a user
The upper right corner of the Logs panel includes the Download Logs button and the Refresh button.
Notifications
You can see your notifications by clicking on your username in the upper right corner of Altinity Cloud Manager:
The Notifications menu item lets you view any notifications you have received:
Figure 18 - The Notification History dialog
Here the history shows a single message. The text of the message, its severity (Info, News, Warning, or Danger), and the time the message was received and acknowledged are displayed. The meanings of the message severities are:
- - Updates for general information
- - Notifications of general news and updates in Altinity.Cloud
- - Notifications of possible issues that are less than critical
- - Critical notifications that can effect your clusters or account