Monitoring Considerations
Monitoring helps to track potential issues in your cluster before they cause a critical error.
External Monitoring
External monitoring collects data from the ClickHouse cluster and uses it for analysis and review. Recommended external monitoring systems include:
- Prometheus: Use embedded exporter or clickhouse-exporter
- Graphite: Use the embedded exporter. See config.xml.
- InfluxDB: Use the embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
ClickHouse can collect the recording of metrics internally by enabling system.metric_log
in config.xml
.
For dashboard system:
- Grafana is recommended for graphs, reports, alerts, dashboard, etc.
- Other options are Nagios or Zabbix.
The following metrics should be collected:
- For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
- For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
- For Zookeeper:
The following queries are recommended to be included in monitoring:
- SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables
- SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
- SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.
Monitor and Alerts
Configure the notifications for events and thresholds based on the following table:
Health Checks
The following health checks should be monitored:
Check Name | Shell or SQL command | Severity |
---|---|---|
ClickHouse status | $ curl 'http://localhost:8123/'Ok. | Critical |
Too many simultaneous queries. Maximum: 100 | select value from system.metrics where metric='Query' | Critical |
Replication status | $ curl 'http://localhost:8123/replicas_status'Ok. | High |
Read only replicas (reflected by replicas_status as well) | select value from system.metrics where metric='ReadonlyReplica’ | High |
ReplicaPartialShutdown (not reflected by replicas_status, but seems to correlate with ZooKeeperHardwareExceptions) | select value from system.events where event='ReplicaPartialShutdown' | HighI turned this one off. It almost always correlates with ZooKeeperHardwareExceptions, and when it’s not, then there is nothing bad happening… |
Some replication tasks are stuck | select count()from system.replication_queuewhere num_tries > 100 | High |
ZooKeeper is available | select count() from system.zookeeper where path='/' | Critical for writes |
ZooKeeper exceptions | select value from system.events where event='ZooKeeperHardwareExceptions' | Medium |
Other CH nodes are available | $ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' –silent –data-binary @-`; do curl "http://$node:8123/" –silent ; done |
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries) | for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'" | curl 'http://localhost:8123/' –silent –data-binary @-` ; do clickhouse-client –query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done |
There are files in 'detached' folders | $ find /var/lib/clickhouse/data///detached/* -type d |
wc -l; 19.8+select count() from system.detached_parts |
Too many parts: Number of parts is growing; Inserts are being delayed; Inserts are being rejected |
select value from system.asynchronous_metrics where metric='MaxPartCountForPartition';select value from system.events/system.metrics where event/metric='DelayedInserts'; select value from system.events where event='RejectedInserts' |
Critical |
Dictionaries: exception | select concat(name,': ',last_exception) from system.dictionarieswhere last_exception != '' | Medium |
ClickHouse has been restarted | select uptime();select value from system.asynchronous_metrics where metric='Uptime' | |
DistributedFilesToInsert should not be always increasing | select value from system.metrics where metric='DistributedFilesToInsert' | Medium |
A data part was lost | select value from system.events where event='ReplicatedDataLoss' | High |
Data parts are not the same on different replicas |
select value from system.events where event='DataAfterMergeDiffersFromReplica'; select value from system.events where event='DataAfterMutationDiffersFromReplica' |
Medium |
Monitoring References
- altinity-kb-monitoring
- https://tech.marksblogg.com/clickhouse-prometheus-grafana.html
- Key Metrics for Monitoring ClickHouse
- ClickHouse Monitoring Key Metrics to Monitor
- ClickHouse Monitoring Tools: Five Tools to Consider
- Monitoring ClickHouse
- Monitor ClickHouse with Datadog
Feedback
Was this page helpful?
Glad to hear it!
Sorry to hear that. We'll track this issue and see how we can improve.