Monitoring Considerations

Monitoring Considerations

Monitoring helps to track potential issues in your cluster before they cause a critical error.

External Monitoring

External monitoring collects data from the ClickHouse cluster and uses it for analysis and review. Recommended external monitoring systems include:

ClickHouse can collect the recording of metrics internally by enabling system.metric_log in config.xml.

For dashboard system:

  • Grafana is recommended for graphs, reports, alerts, dashboard, etc.
  • Other options are Nagios or Zabbix.

The following metrics should be collected:

  • For Host Machine:
    • CPU
    • Memory
    • Network (bytes/packets)
    • Storage (iops)
    • Disk Space (free / used)
  • For ClickHouse:
    • Connections (count)
    • RWLocks
    • Read / Write / Return (bytes)
    • Read / Write / Return (rows)
    • Zookeeper operations (count)
    • Absolute delay
    • Query duration (optional)
    • Replication parts and queue (count)
  • For Zookeeper:

The following queries are recommended to be included in monitoring:

  • SELECT * FROM system.replicas
    • For more information, see the ClickHouse guide on System Tables
  • SELECT * FROM system.merges
    • Checks on the speed and progress of currently executed merges.
  • SELECT * FROM system.mutations
    • This is the source of information on the speed and progress of currently executed merges.

Monitor and Alerts

Configure the notifications for events and thresholds based on the following table:

Health Checks

The following health checks should be monitored:

Check Name Shell or SQL command Severity
ClickHouse status $ curl 'http://localhost:8123/'Ok. Critical
Too many simultaneous queries. Maximum: 100 select value from system.metrics where metric='Query' Critical
Replication status $ curl 'http://localhost:8123/replicas_status'Ok. High
Read only replicas (reflected by replicas_status as well) select value from system.metrics where metric='ReadonlyReplica’ High
ReplicaPartialShutdown (not reflected by replicas_status, but seems to correlate with ZooKeeperHardwareExceptions) select value from where event='ReplicaPartialShutdown' HighI turned this one off. It almost always correlates with ZooKeeperHardwareExceptions, and when it’s not, then there is nothing bad happening…
Some replication tasks are stuck select count()from system.replication_queuewhere num_tries > 100 High
ZooKeeper is available select count() from system.zookeeper where path='/' Critical for writes
ZooKeeper exceptions select value from where event='ZooKeeperHardwareExceptions' Medium
Other CH nodes are available $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" curl 'http://localhost:8123/' –silent –data-binary @-`; do curl "http://$node:8123/" –silent ; done
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" curl 'http://localhost:8123/' –silent –data-binary @-` ; do clickhouse-client –query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done
There are files in 'detached' folders $ find /var/lib/clickhouse/data///detached/* -type d

wc -l;

19.8+select count() from system.detached_parts

Too many parts:

Number of parts is growing;

Inserts are being delayed;

Inserts are being rejected

select value from system.asynchronous_metrics where metric='MaxPartCountForPartition';select value from where event/metric='DelayedInserts';

select value from where event='RejectedInserts'

Dictionaries: exception select concat(name,': ',last_exception) from system.dictionarieswhere last_exception != '' Medium
ClickHouse has been restarted select uptime();select value from system.asynchronous_metrics where metric='Uptime'
DistributedFilesToInsert should not be always increasing select value from system.metrics where metric='DistributedFilesToInsert' Medium
A data part was lost select value from where event='ReplicatedDataLoss' High
Data parts are not the same on different replicas

select value from where event='DataAfterMergeDiffersFromReplica';

select value from where event='DataAfterMutationDiffersFromReplica'


Monitoring References

Last modified 2021.12.15: Updated with Kubernetes scripts.