This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Care and Feeding of Zookeeper with ClickHouse

Installing, configuring, and recovering Zookeeper

ZooKeeper is required for ClickHouse cluster replication. Keeping ZooKeeper properly maintained and fed provides the best performance and reduces the likelihood that your ZooKeeper nodes will become “sick”.

Elements of this guide can also be found on the ClickHouse on Kubernetes Quick Start guide, which details how to use Kubernetes and ZooKeeper with the clickhouse-operator.

1 - ZooKeeper Installation and Configuration

How to configure Zookeeper to work best with ClickHouse

Prepare and Start ZooKeeper

Preparation

Before beginning, determine whether ZooKeeper will run in standalone or replicated mode.

  • Standalone mode: One zookeeper server to service the entire ClickHouse cluster. Best for evaluation, development, and testing.
    • Should never be used for production environments.
  • Replicated mode: Multiple zookeeper servers in a group called an ensemble. Replicated mode is recommended for production systems.
    • A minimum of 3 zookeeper servers are required.
    • 3 servers is the optimal setup that functions even with heavily loaded systems with proper tuning.
    • 5 servers is less likely to lose quorum entirely, but also results in longer quorum acquisition times.
    • Additional servers can be added, but should always be an odd number of servers.

Precautions

The following practices should be avoided:

  • Never deploy even numbers of ZooKeeper servers in an ensemble.
  • Do not install ZooKeeper on ClickHouse nodes.
  • Do not share ZooKeeper with other applications like Kafka.
  • Place the ZooKeeper dataDir and logDir on fast storage that will not be used for anything else.

Applications to Install

Install the following applications in your servers:

  1. zookeeper (3.4.9 or later)
  2. netcat

Configure ZooKeeper

  1. /etc/zookeeper/conf/myid

    The myid file consists of a single line containing only the text of that machine’s id. So myid of server 1 would contain the text “1” and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

  2. /etc/zookeeper/conf/zoo.cfg

    Every machine that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. You accomplish this with a series of lines of the form server.id=host:port:port

    # specify all zookeeper servers
    # The first port is used by followers to connect to the leader
    # The second one is used for leader election
    server.1=zookeeper1:2888:3888
    server.2=zookeeper2:2888:3888
    server.3=zookeeper3:2888:3888
    

    These lines must be the same on every ZooKeeper node

  3. /etc/zookeeper/conf/zoo.cfg

    This setting MUST be added on every ZooKeeper node:

    # The time interval in hours for which the purge task has to be triggered.
    # Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
    autopurge.purgeInterval=1
    autopurge.snapRetainCount=5
    

Install Zookeeper

Depending on your environment, follow the Apache Zookeeper Getting Started guide, or the Zookeeper Administrator's Guide.

Start ZooKeeper

Depending on your installation, start ZooKeeper with the following command:

sudo -u zookeeper /usr/share/zookeeper/bin/zkServer.sh

Verify ZooKeeper is Running

Use the following commands to verify ZooKeeper is available:

echo ruok | nc localhost 2181
echo mntr | nc localhost 2181
echo stat | nc localhost 2181

Check the following files and directories to verify ZooKeeper is running and making updates:

  • Logs: /var/log/zookeeper/zookeeper.log
  • Snapshots: /var/lib/zookeeper/version-2/

Connect to ZooKeeper

From the localhost, connect to ZooKeeper with the following command to verify access (replace the IP address with your Zookeeper server):

bin/zkCli.sh -server 127.0.0.1:2181

Tune ZooKeeper

The following optional settings can be used depending on your requirements.

Improve Node Communication Reliability

The following settings can be used to improve node communication reliability:

/etc/zookeeper/conf/zoo.cfg
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an acknowledgement
syncLimit=5

Reduce Snapshots

The following settings will create fewer snapshots which may reduce system requirements.

/etc/zookeeper/conf/zoo.cfg
# To avoid seeks ZooKeeper allocates space in the transaction log file in blocks of preAllocSize kilobytes.
# The default block size is 64M. One reason for changing the size of the blocks is to reduce the block size
# if snapshots are taken more often. (Also, see snapCount).
preAllocSize=65536
# ZooKeeper logs transactions to a transaction log. After snapCount transactions are written to a log file a
# snapshot is started and a new transaction log file is started. The default snapCount is 10,000.
snapCount=10000

Documentation

Configuring ClickHouse to use ZooKeeper

Once ZooKeeper has been installed and configured, ClickHouse can be modified to use ZooKeeper. After the following steps are completed, a restart of ClickHouse will be required.

To configure ClickHouse to use ZooKeeper, follow the steps shown below. The recommended settings are located on ClickHouse.tech zookeeper server settings.

  1. Create a configuration file with the list of ZooKeeper nodes. Best practice is to put the file in /etc/clickhouse-server/config.d/zookeeper.xml.

    <yandex>
        <zookeeper>
            <node>
                <host>example1</host>
                <port>2181</port>
            </node>
            <node>
                <host>example2</host>
                <port>2181</port>
            </node>
            <session_timeout_ms>30000</session_timeout_ms>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <!-- Optional. Chroot suffix. Should exist. -->
            <root>/path/to/zookeeper/node</root>
            <!-- Optional. ZooKeeper digest ACL string. -->
            <identity>user:password</identity>
        </zookeeper>
    </yandex>
    
  2. Check the distributed_ddl parameter in config.xml. This parameter can be defined in another configuration file, and can change the path to any value that you like. If you have several ClickHouse clusters using the same zookeeper, distributed_ddl path should be unique for every ClickHouse cluster setup.

    <!-- Allow to execute distributed DDL queries (CREATE, DROP, ALTER, RENAME) on cluster. -->
    <!-- Works only if ZooKeeper is enabled. Comment it out if such functionality isn't required. -->
    <distributed_ddl>
        <!-- Path in ZooKeeper to queue with DDL queries -->
        <path>/clickhouse/task_queue/ddl</path>
    
        <!-- Settings from this profile will be used to execute DDL queries -->
        <!-- <profile>default</profile> -->
    </distributed_ddl>
    
  3. Check /etc/clickhouse-server/preprocessed/config.xml. You should see your changes there.

  4. Restart ClickHouse. Check ClickHouse connection to ZooKeeper detailed in ZooKeeper Monitoring.

Converting Tables to Replicated Tables

Creating a replicated table

Replicated tables use a replicated table engine, for example ReplicatedMergeTree. The following example shows how to create a simple replicated table.

This example assumes that you have defined appropriate macro values for cluster, shard, and replica in macros.xml to enable cluster replication using zookeeper. For details consult the ClickHouse.tech Data Replication guide.

CREATE TABLE test ON CLUSTER '{cluster}'
(
    timestamp DateTime,
    contractid UInt32,
    userid UInt32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/default/test', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (contractid, toDate(timestamp), userid)
SAMPLE BY userid;

The ON CLUSTER clause ensures the table will be created on the nodes of {cluster} (a macro value). This example automatically creates a ZooKeeper path for each replica table that looks like the following:

/clickhouse/tables/{cluster}/{replica}/default/test

becomes:

/clickhouse/tables/c1/0/default/test

You can see ZooKeeper replication data for this node with the following query (updating the path based on your environment):

SELECT *
  FROM system.zookeeper
  WHERE path = '/clickhouse/tables/c1/0/default/test'

Removing a replicated table

To remove a replicated table, use DROP TABLE as shown in the following example. The ON CLUSTER clause ensures the table will be deleted on all nodes. Omit it to delete the table on only a single node.

DROP TABLE test ON CLUSTER '{cluster}';

As each table is deleted the node is removed from replication and the information for the replica is cleaned up. When no more replicas exist, all ZooKeeper data for the table will be cleared.

Cleaning up ZooKeeper data for replicated tables

  • IMPORTANT NOTE: Cleaning up ZooKeeper data manually can corrupt replication if you make a mistake. Raise a support ticket and ask for help if you have any doubt concerning the procedure.

New ClickHouse versions now support SYSTEM DROP REPLICA which is an easier command.

For example:

SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/path/to/table/in/zk';

ZooKeeper data for the table might not be cleared fully if there is an error when deleting the table, or the table becomes corrupted, or the replica is lost. You can clean up ZooKeeper data in this case manually using the ZooKeeper rmr command. Here is the procedure:

  1. Login to ZooKeeper server.
  2. Run zkCli.sh command to connect to the server.
  3. Locate the path to be deleted, e.g.:
    ls /clickhouse/tables/c1/0/default/test
  4. Remove the path recursively, e.g.,
    rmr /clickhouse/tables/c1/0/default/test

2 - ZooKeeper Monitoring

Verifying Zookeeper and ClickHouse are working together.

ZooKeeper Monitoring

For organizations that already have Apache ZooKeeper configured either manually, or with a Kubernetes operator such as the clickhouse-operator for Kubernetes, monitoring your ZooKeeper nodes will help you recover from issues before they happen.

Checking ClickHouse connection to ZooKeeper

To check connectivity between ClickHouse and ZooKeeper.

  1. Confirm that ClickHouse can connect to ZooKeeper. You should be able to query the system.zookeeper table, and see the path for distributed DDL created in ZooKeeper through that table. If something went wrong, check the ClickHouse logs.

    $ clickhouse-client -q "select * from system.zookeeper where path='/clickhouse/task_queue/'"
    ddl 17183334544    17183334544    2019-02-21 21:18:16    2019-02-21 21:18:16    0    8    0    0    0    8    17183370142    /clickhouse/task_queue/
    
  2. Confirm ZooKeeper accepts connections from ClickHouse. You can also see on ZooKeeper nodes if a connection was established and the IP address of the ClickHouse server in the list of clients:

    $ echo stat | nc localhost 2181
    ZooKeeper version: 3.4.9-3--1, built on Wed, 23 May 2018 22:34:43 +0200
    Clients:
     /10.25.171.52:37384[1](queued=0,recved=1589379,sent=1597897)
     /127.0.0.1:35110[0](queued=0,recved=1,sent=0)
    

ZooKeeper Monitoring Quick List

The following commands are available to verify ZooKeeper availability and highlight potential issues:

Check Name Shell or SQL command Severity
ZooKeeper is available select count() from system.zookeeper
where path=’/’
Critical for writes
ZooKeeper exceptions select value from system.events
where event=‘ZooKeeperHardwareExceptions’
Medium
Read only tables are unavailable for writes select value from system.metrics
where metric=‘ReadonlyReplica’
High
A data part was lost select value from system.events
where event=‘ReplicatedDataLoss’
High
Data parts are not the same on different replicas select value from system.events where event=‘DataAfterMergeDiffersFromReplica’;
select value from system.events where event=‘DataAfterMutationDiffersFromReplica’
Medium

3 - ZooKeeper Recovery

How to recover when Zookeeper has issues.

If there are issues with your ZooKeeper environment managing your ClickHouse clusters, the following steps can resolve them. If there are still issues, support is available to current Altinity customers.

Raising a Support Ticket

Altinity accepts support cases from its support partners via the Altinity Zendesk Portal, email, or Slack.

To log a ticket in the support portal:

  1. Login to https://altinity.zendesk.com using your email address.
  2. Press the “Add +” button to open a case.
  3. Enter case topic and details. Please include relevant information such as:
    1. ClickHouse version
    2. Error messages
    3. How to recreate the problem if you know

You can log a ticket by sending the above information to support@altinity.com using your registered email address or to the shared Slack channel if available.

Fault Diagnosis and Remediation

The following procedures can resolve issues.

  • IMPORTANT NOTE: Some procedures shown below may have a degree of risk depending on the underlying problem. These procedures are marked with Call Support and include raising a support ticket as the first step.

Restarting a crashed ClickHouse server

ClickHouse servers are managed by systemd and normally restart following a crash. If a server does not restart automatically, follow these steps:

  1. Access the ClickHouse error log for the failed server at /var/lib/clickhouse-server/clickhouse-server.err.log.
  2. Examine the last log entry and look for a stack trace showing the cause of the failure.
  3. If there is a stack trace:
    1. If the problem is obvious, fix the problem and run systemctl restart clickhouse-server to restart. Confirm that the server restarts.
    2. If the problem is not obvious, open an Altinity Support Ticket and provide the error log message.
  4. If there is no stack trace, ClickHouse may have been terminated by the OOM-killer due to excessive memory usage:
    1. Open the most recent syslog file at /var/log/syslog.
    2. Look for OOM-killer messages.
    3. If found, see Handling out-of-memory errors below.
    4. If the problem is not obvious, raise a support ticket and provide a description of the problem.

Replacing a failed cluster node

  1. Ensure the old node is truly offline and will not return.

  2. Create a new node with the same macros.xml definitions as the previous node.

  3. If possible use the same hostname as the failed node.

  4. Copy the metadata folder from a healthy replica.

  5. Set the force_restore_data so that ClickHouse wipes out existing ZooKeeper information for the node and replicates all data:

    sudo -u clickhouse touch /var/lib/clickhouse/flags/force_restore_data

  6. Start ClickHouse.

  7. Wait until all tables are replicated. You can check progress using:

    SELECT count(*) FROM system.replication_queue

Replacing a failed zookeeper node

  1. Configure ZooKeeper on a new server.
  2. Use the same hostname and myid as the failed node if possible.
  3. Start ZooKeeper on the new node.
  4. Verify the new node can connect to the ensemble.
  5. If the ZooKeeper environment does not support dynamic confirmation changes:
    1. If the new node has a different hostname or myid, modify zoo.cfg on the other nodes of the ensemble and restart them.
    2. ClickHouse’s sessions will be interrupted during this process.
  6. Make changes in ClickHouse configuration files if needed. A restart might be required for the changes to take effect.

Recovering from complete ZooKeeper loss (Call Support)

Complete loss of ZooKeeper is a serious event and should be avoided at all costs by proper ZooKeeper management. Follow this procedure only if you have lost all data in ZooKeeper as it is time-intensive and will cause affected tables to be unavailable.

  1. Raise a support ticket before taking any steps.
  2. Ensure that ZooKeeper is empty and working properly.
  3. Follow the instructions from Recovering from complete metadata loss in ZooKeeper or from the blog post A New Way to Restore ClickHouse After ZooKeeper Metadata Is Lost.
  4. ClickHouse will sync from the healthy table to all other tables.

Read-only tables

Read-only tables occur when ClickHouse cannot access ZooKeeper to record inserts on a replicated table.

  1. Login with clickhouse-client.

  2. Execute the following query to confirm that ClickHouse can connect to ZooKeeper:

    $ clickhouse-client -q "select * from system.zookeeper where path='/'"

  3. This query should return one or more ZNode directories.

  4. Execute the following query to check the state of the table.

    SELECT * from system.replicas where table='table_name'

  5. If there are connectivity problems, check the following.

    1. Ensure the <zookeeper> tag in ClickHouse configuration has the correct ZooKeeper host names and ports.
    2. Ensure that ZooKeeper is running.
    3. Ensure that ZooKeeper is accepting connections. Login to the ZooKeeper host and try to connect using zkClient.sh.