Zookeeper Recovery
If there are issues with your Zookeeper environment managing your ClickHouse® clusters, the following steps can resolve them. Altinity customers can contact Support if those issues persist.
Fault Diagnosis and Remediation
The following procedures can resolve issues.
- IMPORTANT NOTE: Some procedures shown below may have a degree of risk depending on the underlying problem. For particularly dangerous procedures, we recommend that you contact Altinity Support as your first step.
Restarting a crashed ClickHouse server
ClickHouse servers are managed by systemd
and normally restart following a crash. If a server does not restart automatically, follow these steps:
- Access the ClickHouse error log for the failed server at
/var/lib/clickhouse-server/clickhouse-server.err.log
. - Examine the last log entry and look for a stack trace showing the cause of the failure.
- If there is a stack trace:
- If the problem is obvious, fix the problem and run
systemctl restart clickhouse-server
to restart. Confirm that the server restarts. - If the problem is not obvious, contact Altinity support and provide the error log message.
- If the problem is obvious, fix the problem and run
- If there is no stack trace, ClickHouse may have been terminated by the OOM-killer due to excessive memory usage:
- Open the most recent syslog file at
/var/log/syslog
. - Look for OOM-killer messages.
- If found, see Handling out-of-memory errors below.
- If the problem is not obvious, contact Altinity Support and provide a description of the problem.
- Open the most recent syslog file at
Replacing a failed cluster node
-
Ensure the old node is truly offline and will not return.
-
Create a new node with the same
macros.xml
definitions as the previous node. -
If possible, use the same hostname as the failed node.
-
Copy the metadata folder from a healthy replica.
-
Set the
force_restore_data
flag so that ClickHouse wipes out existing Zookeeper information for the node and replicates all data:sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data
-
Start ClickHouse.
-
Wait until all tables are replicated. You can check progress using:
SELECT count(*) FROM system.replication_queue
Replacing a failed Zookeeper node
- Configure Zookeeper on a new server.
- Use the same hostname and
myid
as the failed node if possible. - Start Zookeeper on the new node.
- Verify the new node can connect to the ensemble.
- If the Zookeeper environment does not support dynamic confirmation changes:
- If the new node has a different hostname or
myid
, modifyzoo.cfg
on the other nodes of the ensemble and restart them. - ClickHouse’s sessions will be interrupted during this process.
- If the new node has a different hostname or
- Make changes in ClickHouse configuration files if needed. A restart might be required for the changes to take effect.
Recovering from complete Zookeeper loss
Complete loss of Zookeeper is a serious event and should be avoided at all costs by proper ZooKeeper management. Contact Altinity Support before starting this procedure. Follow this procedure only if you have lost all data in Zookeeper as it is time-intensive and will cause affected tables to be unavailable.
- **Again, start by contacting Altinity Support.
- Ensure that Zookeeper is empty and working properly.
- Follow the instructions from Recovering from complete metadata loss in Zookeeper or from the blog post A New Way to Restore ClickHouse After Zookeeper Metadata Is Lost.
- ClickHouse will sync from the healthy table to all other tables.
Read-only tables
Read-only tables occur when ClickHouse cannot access Zookeeper to record inserts on a replicated table.
-
Login with
clickhouse-client
. -
Execute the following query to confirm that ClickHouse can connect to Zookeeper:
$ clickhouse-client -q "select * from system.zookeeper where path='/'"
-
This query should return one or more ZNode directories.
-
Execute the following query to check the state of the table.
SELECT * from system.replicas where table='table_name'
-
If there are connectivity problems, check the following.
- Ensure the
<zookeeper>
tag in ClickHouse configuration has the correct Zookeeper host names and ports. - Ensure that Zookeeper is running.
- Ensure that Zookeeper is accepting connections. Login to the Zookeeper host and try to connect using
zkClient.sh
.
- Ensure the