Cluster alerts

Defining alerts and the events that trigger them

You can define cluster alerts to notify users when certain events occur. You can access alerts from the button on a cluster panel:

Cluster Alerts

Figure 1 - The ALERTS item in the cluster panel

Types of cluster alerts

Clicking on the button displays the Cluster Alerts dialog:

Cluster Alerts dialog

Figure 2 - The Cluster Alerts dialog

Enter one or more comma-separated email addresses of the user(s) who should be alerted when particular events occur. For each event, you can send them a popup message in the Altinity Cloud Manager UI and/or send an email.

The different types of alerts are:

  • System Alerts: Triggered by a significant system event such as a network outage. See the table below for details of all the system alerts.
  • ClickHouse Version Upgrade: Triggered by an update to the version of ClickHouse installed in the cluster.
  • Cluster Rescale: Triggered when the cluster is rescaled.
  • Cluster Stop: Triggered when some event has caused the cluster to stop running. This could be some event that caused a problem, a user stopping the cluster, or a stop caused by your cluster's activity schedule.
  • Cluster Resume: Triggered when a previously stopped cluster is restarted.

A popup alert appears at the top of the ACM UI:

Cluster resumed alert

Figure 3 - A popup alert for a resumed cluster

System alerts sent via email look like this:

An email alert

Figure 4 - An email alert

System alerts

Here’s the complete list of system alerts:

Alert Severity Description
ClickHouse Disk Threshold Crossed Critical The free space on a particular disk in a particular ClickHouse cluster has fallen below a certain threshold.
ClickHouse Server Down Critical The ClickHouse server is down.
ClickHouse Distributed Files To Insert Continuously Growing Critical The number of distributed files to insert into MergeTree tables that use the Distributed table engine has been growing continuously for four hours. Keep an eye on that.
ClickHouse Rejected Insert Critical The ClickHouse cluster rejected a number of INSERT statements due to a high number of active data parts for a partition in a MergeTree. You should decrease the frequency of INSERTs. For more information see the documentation for the system.part_log and system.merge_tree_settings tables.
ClickHouse Memory Resident Utilization High Critical ClickHouse’s resident memory utilization has been 80% or more for longer than 10 minutes.
Kube Inodes Persistent Volume Usage High A PersistentVolumeClaim is using more than 85% of its iNode’s capacity.
ClickHouse Disk Usage High High The amount of free space on a particular disk in a particular cluster will run out in the next 24 hours.
ClickHouse Too Many Mutations High The ClickHouse cluster has too many active mutations. This likely means something is wrong with ALTER TABLE DELETE / UPDATE queries. For more information, run clickhouse-client \-q "SELECT \* FROM system.mutations WHERE is\_done=0 FORMAT Vertical and look for mutation errors. Also see the documentation for the KILL MUTATION statement.
ClickHouse Disk Usage High The free space on a particular disk in a particular ClickHouse cluster will run out in the next 24 hours. To avoid switching to a read-only state, you should rescale the storage available in the cluster. The Kubernetes CSI supports resizing Persistent Volumes; you can also add another volume to a Pod and then restart that Pod.

For more information, see the following documentation:
- Resizing and rescaling storage in the Altinity Cloud Manager
- Resizing Persistent Volumes using Kubernetes
- Using Persistent Volumes with the Altinity Kubernetes operator for ClickHouse
- Using multiple block devices for data storage
- Using TTL in ClickHouse
ClickHouse Background Pool Fetches High High Fetches pool utilized high The number of threads used for fetching data parts from another replica for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_fetches_pool_size and background_pool_size for more information.
ClickHouse Move Pool Utilization High High Move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_move_pool_size and background_pool_size for more information.
ClickHouse Common Pool Utilization High High Common move pool utilized high The number of threads used for moving data parts in the background to another disk or volume for MergeTree engine tables is high. See the ClickHouse documentation on the global server settings for background_pool_size for more information.
ClickHouse Background Pool Merges And Mutations High High Merges and mutations pool utilized high The ratio between the number of threads and the number of background merges and mutations that can be executed concurrently. See the ClickHouse documentation on the global server settings for background_merges_mutations_concurrency_ratio and background_pool_size for more information.
ClickHouse Distributed Files To Insert High Warning ClickHouse has too many files to insert to MergeTree tables via the Distributed table engine. When you insert data into a Distributed table, data is written to target MergeTree tables asynchronously. When inserted into the table, the data block is just written to the local file system. The data is sent to the remote servers in the background as soon as possible.

The period for sending data is managed by the distributed_directory_monitor_sleep_time_ms and distributed_directory_monitor_max_sleep_time_ms settings. The Distributed engine sends each file with inserted data separately, but you can enable batch sending of files with the distributed_directory_monitor_batch_insert setting. Finally, see the ClickHouse documentation for more information on managing distributed tables.
ClickHouse Max Part Count For Partition Warning The ClickHouse server has too many parts in a partition. The Clickhouse MergeTree table engine splits each INSERT query to partitions (PARTITION BY expressions) and adds one or more PARTS per INSERT inside each partition. After that the background merge process runs, and when there any too many unmerged parts inside the partition, SELECT queries performance can significantly degrade, so clickhouse tries to delay or reject the INSERT.
ClickHouse Too Many Running Queries Warning The ClickHouse server has too many running queries. Please analyze your workload. Each concurrent SELECT query uses memory in JOINs and uses CPU to run aggregation functions. It can also read lots of data from disk when scan parts in partitions and utilize disk I/O. Each concurrent INSERT query allocates around 1MB per column in an inserted table and utilizes disk I/O.

For more information, see the following ClickHouse documentation:

- Restrictions on query complexity
- Quotas
- The max-concurrent-queries global server setting
- The system.query_log table
ClickHouse Replicas Max Absolute Delay Warning The ClickHouse server has a replication lag. When a replica has too much lag, it can be skipped from distributed SELECT queries without errors, leading to inaccurate query results. Check the system.replicas table, the system.replication_queue table, free disk space, the network connection between the ClickHouse pod and Zookeeper on monitored clickhouse-server pods. Also see the ClickHouse documentation on system.replicas and the system.replication queue.
ClickHouse Delayed Insert Throttling Info The ClickHouse server has throttled INSERTs due to a high number of active data parts for a MergeTree partition. Please decrease the INSERT frequency. See the MergeTree documentation for more information.
ClickHouse Longest Running Query Info The ClickHouse server has queries that are running longer than expected. See the ClickHouse Processes documentation for more information.
ClickHouse Query Preempted Info The ClickHouse server has a number of queries that are stopped and waiting due to the priority setting. See the ClickHouse documentation on processes. Also try the command clickhouse-client \-q "SELECT \* FROM system.processes FORMAT Vertical.