Monitoring a Cluster

How to monitor and manager your ClickHouse® clusters’ performance.

There are several ways to monitor your ClickHouse® clusters:

Grafana dashboards

Altinity.Cloud uses Grafana as its default monitoring tool. You can access Grafana from the Monitoring section of a cluster panel:

Cluster Monitoring View

Figure 1 - The Monitoring section of the cluster panel

Clicking the View in Grafana link displays the following menu:

Cluster Monitoring menu

Figure 2 - The Grafana monitoring menu

We’ll go through those menu items next. If you’d like to jump to a particular Grafana view, click any of these links:

The Cluster Metrics view

The Cluster Metrics view

Selecting Cluster Metrics opens this Grafana dashboard in another browser tab:

The Cluster Metrics view

Figure 3 - The Cluster Metrics dashboard

Cluster metrics include things like the number of bytes and rows inserted into databases in the ClickHouse cluster, merges, queries, connections, and memory / CPU usage.

The System Metrics view

The System Metrics view

Selecting System Metrics opens this Grafana dashboard in another browser tab:

The System Metrics view

Figure 4 - The System Metrics dashboard

System metrics include things like CPU load, OS threads and processes, network traffic for each network connection, and activity on storage devices.

The Queries view

The Queries view

Selecting Queries opens this Grafana dashboard in another browser tab:

The Queries view

Figure 5 - The Queries dashboard

The Queries dashboard includes information about your most common queries, slow queries, failed queries, and the queries that used the most memory.

The Logs view

The Logs view

Selecting Logs opens this Grafana dashboard in another browser tab:

The Logs view

Figure 6 - The Logs dashboard

The Logs dashboard shows all of the log messages as well as the frequency of messages over time. You can add a query to the Logs visualization to filter the view for particular messages.

Cluster alerts

You can define cluster alerts to notify users when certain events occur. You can access alerts from the button on a cluster panel:

Cluster Alerts

Figure 7 - The ALERTS item in the cluster panel

Clicking on the button displays the Cluster Alerts dialog:

Cluster Alerts dialog

Figure 8 - The Cluster Alerts dialog

Enter one or more comma-separated email addresses of the user(s) who should be alerted when particular events occur. For each event, you can send them a popup message in the Altinity Cloud Manager UI and/or send an email.

The different types of alerts are:

  • System Alerts: Triggered by a significant system event such as a network outage. See the table below for details of all the system alerts.
  • ClickHouse Version Upgrade: Triggered by an update to the version of ClickHouse installed in the cluster.
  • Cluster Rescale: Triggered when the cluster is rescaled.
  • Cluster Stop: Triggered when some event has caused the cluster to stop running. This could be some event that caused a problem, a user stopping the cluster, or a stop caused by your cluster uptime settings.
  • Cluster Resume: Triggered when a previously stopped cluster is restarted.

A popup alert appears at the top of the ACM UI:

Cluster resumed alert

Figure 9 - A popup alert for a resumed cluster

System alerts

Here’s the complete list of system alerts:

Alert Severity Description
ClickHouse Disk Threshold Crossed Critical The free space on a particular disk in a particular ClickHouse cluster has fallen below a certain threshold.
ClickHouse Server Down Critical The ClickHouse server is down.
ClickHouse Distributed Files To Insert Continuously Growing Critical The number of distributed files to insert into MergeTree tables that use the Distributed table engine has been growing continuously for four hours. Keep an eye on that.
ClickHouse Rejected Insert Critical The ClickHouse cluster rejected a number of INSERT statements due to a high number of active data parts for a partition in a MergeTree. You should decrease the frequency of INSERTs. For more information see the documentation for system.part_log, system.merge_tree_settings, and merge_tree_settings.
ClickHouse Memory Resident Utilization High Critical ClickHouse’s resident memory utilization has been 80% or more for longer than 10 minutes.
Kube Inodes Persistent Volume Usage High A PersistentVolumeClaim is using more than 85% of its iNode’s capacity.
ClickHouse Disk Usage High High The amount of free space on a particular disk in a particular cluster will run out in the next 24 hours.
ClickHouse Too Many Mutations High The ClickHouse cluster has too many active mutations. This likely means something is wrong with ALTER TABLE DELETE / UPDATE queries. For more information, run clickhouse-client \-q "SELECT \* FROM system.mutations WHERE is\_done=0 FORMAT Vertical and look for mutation errors. Also see the documentation for the KILL MUTATION statement.
ClickHouse Disk Usage High The free space on a particular disk in a particular ClickHouse cluster will run out in the next 24 hours. To avoid switching to a read-only state, you should rescale the storage available in the cluster. The Kubernetes CSI supports resizing Persistent Volumes; you can also add another volume to a Pod and then restart that Pod.

For more information, see the following documentation:
- Resizing and rescaling storage in the Altinity Cloud Manager
- Resizing Persistent Volumes using Kubernetes
- Using Persistent Volumes with the Altinity Kubernetes operator for ClickHouse
- Using multiple block devices for data storage
- Using TTL in ClickHouse
ClickHouse Background Pool Fetches High High Fetches pool utilized high The number of threads used for fetching data parts from another replica for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_fetches_pool_size and background_pool_size for more information.
ClickHouse Move Pool Utilization High High Move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_move_pool_size and background_pool_size for more information.
ClickHouse Common Pool Utilization High High Common move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_pool_size for more information.
ClickHouse Background Pool Merges And Mutations High High Merges and mutations pool utilized high The ratio between the number of threads and the number of background merges and mutations that can be executed concurrently. See the ClickHouse documentation on the global server settings for background_merges_mutations_concurrency_ratio and background_pool_size for more information.
ClickHouse Distributed Files To Insert High Warning ClickHouse has too many files to insert to MergeTree tables via the Distributed table engine. When you insert data into a Distributed table, data is written to target MergeTree tables asynchronously. When inserted into the table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible.

The period for sending data is managed by the distributed_directory_monitor_sleep_time_ms and distributed_directory_monitor_max_sleep_time_ms settings. The Distributed engine sends each file with inserted data separately, but you can enable batch sending of files with the distributed_directory_monitor_batch_insert setting. Finally, see the ClickHouse documentation for more information on managing distributed tables.
ClickHouse Max Part Count For Partition Warning The ClickHouse server has too many parts in a partition. The Clickhouse MergeTree table engine splits each INSERT query to partitions (PARTITION BY expressiosn) and adds one or more PARTS per INSERT inside each partition. After that the background merge process runs, and when there any too many unmerged parts inside the partition, SELECT queries performance can significantly degrade, so clickhouse tries to delay or reject the INSERT.
ClickHouse Too Many Running Queries Warning The ClickHouse server has too many running queries. Please analyze your workload. Each concurrent SELECT query uses memory in JOINs and uses CPU to run aggregation functions. It can also read lots of data from disk when scan parts in partitions and utilize disk I/O. Each concurrent INSERT query allocates around 1MB per column in an inserted table and utilizes disk I/O.

For more information, see the following ClickHouse documentation:

- Restrictions on query complexity
- Quotas
- The max-concurrent-queries global server setting
- The system.query_log table
ClickHouse Replicas Max Absolute Delay Warning The ClickHouse server has a replication lag. When a replica has too much lag, it can be skipped from distributed SELECT queries without errors, leading to inaccurate query results. Check the system.replicas table, the system.replication_queue table, free disk space, the network connection between the ClickHouse pod and Zookeeper on monitored clickhouse-server pods. Also see the ClickHouse documentation on system.replicas and the system.replication queue.
ClickHouse Delayed Insert Throttling Info The ClickHouse server has throttled INSERTs due to a high number of active data parts for a MergeTree partition. Please decrease the INSERT frequency. See the MergeTree documentation for more information.
ClickHouse Longest Running Query Info The ClickHouse server has queries that are running longer than expected. See the ClickHouse Processes documentation for more information.
ClickHouse Query Preempted Info The ClickHouse server has a number of queries that are stopped and waiting due to the priority setting. See the ClickHouse documentation on processes. Also try the command clickhouse-client \-q "SELECT \* FROM system.processes FORMAT Vertical.

System alerts sent via email look like this:

An email alert

Figure 10 - An email alert

Health checks

You can check the health of a cluster or node from the ACM. For clusters, there are two basic checks: the health of the nodes in the cluster and the health of the cluster itself. The health checks for a node are whether the node is online and, as you would expect, the health of the node itself.

Cluster health checks

Cluster health checks appear near the top of a Cluster view. For example, here is the panel view of a cluster with the two health checks:

Cluster Alerts

Figure 11 - A cluster panel with its two health checks

The health check at the top of the panel indicates that 2 of the 2 nodes in the cluster are online:

All cluster nodes online

Clicking on this green bar takes you to the detailed view of the cluster. From there you can see the individual nodes and their status.

The second health check indicates that 6 of the 6 cluster health checks passed:

All cluster checks passed

Clicking on this green bar shows you the health check dialog:

Details of the cluster health checks

Figure 12 - The Health Checks dialog

The cluster health checks are based on six SELECT statements executed against the cluster and its infrastructure. The six statements look at the following cluster properties:

  • Access point availability
  • Distributed query availability
  • Zookeeper availability
  • Zookeeper contents
  • Readonly replicas
  • Delayed inserts

Clicking any of the checks shows the SQL statement used in the check along with its results:

Details of the access point check

Figure 13 - Details of a particular cluster health check

Depending on the cluster’s status, you may see other indicators:

Health check Meaning
A cluster or node that is restarting
The cluster or node is rescaling
A cluster or node that is being terminated
The cluster or node is being terminated
A cluster or node that is stopped
The cluster or node is stopped

Node health checks

The basic “Node is online” check appears next to the node name in the Nodes view of the cluster:

The Nodes view

Figure 14 - The Nodes view of a cluster

Opening the Node view shows more details:

Node health

Figure 15 - The health checks for a single node in the cluster

The first health check indicates that the node is online:

Node is online

The second health check indicates that 5 of the 5 node health checks passed:

All node checks passed

Clicking on this green bar takes you to a more detailed view of the health checks and their results, similar to Figure 11 above.

Cluster logs

You can look at a variety of logs by clicking the button on a cluster panel:

Cluster Logs

Figure 16 - The LOGS item in the cluster panel

You’ll see this panel:

Cluster Logs view

Figure 17 - The Logs panel

Notice at the top of the panel that there are five different logs available:

  • ClickHouse Logs: messages issued by ClickHouse itself
  • Backup Logs: messages related to system backups
  • Operator Logs: messages issued by the Altinity Kubernetes operator for ClickHouse
  • Audit Logs: messages related to significant system events initiated by a user

The upper right corner of the Logs panel includes the Download Logs button and the Refresh button.

Notifications

You can see your notifications by clicking on your username in the upper right corner of Altinity Cloud Manager:

Cluster Lock button

The Notifications menu item lets you view any notifications you have received:

The Notification History panel

Figure 18 - The Notification History dialog

Here the history shows a single message. The text of the message, its severity (Info, News, Warning, or Danger), and the time the message was received and acknowledged are displayed. The meanings of the message severities are:

  • - Updates for general information
  • - Notifications of general news and updates in Altinity.Cloud
  • - Notifications of possible issues that are less than critical
  • - Critical notifications that can effect your clusters or account