Zookeeper Installation and Configuration

How to configure Zookeeper to work best with ClickHouse®

Prepare and Start Zookeeper

Preparation

Before beginning, determine whether Zookeeper will run in standalone or replicated mode.

Standalone mode: One Zookeeper server to service the entire ClickHouse® cluster. Best for evaluation, development, and testing.
- Should never be used for production environments.
Replicated mode: Multiple Zookeeper servers in a group called an ensemble. Replicated mode is recommended for production systems.
- A minimum of 3 Zookeeper servers are required.
- 3 servers is the optimal setup that functions even with heavily loaded systems with proper tuning.
- 5 servers is less likely to lose quorum entirely, but also results in longer quorum acquisition times.
- Additional servers can be added, but should always be an odd number of servers.

Precautions

The following practices should be avoided:

Never deploy even numbers of Zookeeper servers in an ensemble.
Do not install Zookeeper on ClickHouse nodes.
Do not share Zookeeper with other applications like Kafka.
Place the Zookeeper dataDir and logDir on fast storage that will not be used for anything else.

Applications to Install

Install the following applications in your servers:

zookeeper (3.4.9 or later)
netcat

Configure Zookeeper

/etc/zookeeper/conf/myid

The myid file consists of a single line containing only the text of that machine’s id. So myid of server 1 would contain the text “1” and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.
/etc/zookeeper/conf/zoo.cfg

Every machine that is part of the Zookeeper ensemble should know about every other machine in the ensemble. You accomplish this with a series of lines of the form server.id=host:port:port
```
# specify all zookeeper servers
# The first port is used by followers to connect to the leader
# The second one is used for leader election
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888
```
These lines must be the same on every Zookeeper node

/etc/zookeeper/conf/zoo.cfg

This setting MUST be added on every Zookeeper node:

# The time interval in hours for which the purge task has to be triggered.
# Set to a positive integer (1 and above) to enable the auto purging. Defaults to 0.
autopurge.purgeInterval=1
autopurge.snapRetainCount=5

Install Zookeeper

Depending on your environment, follow the Apache Zookeeper Getting Started guide, or the Zookeeper Administrator's Guide.

Start Zookeeper

Depending on your installation, start Zookeeper with the following command:

sudo -u zookeeper /usr/share/zookeeper/bin/zkServer.sh

Verify Zookeeper is Running

Use the following commands to verify Zookeeper is available:

echo ruok | nc localhost 2181
echo mntr | nc localhost 2181
echo stat | nc localhost 2181

Check the following files and directories to verify Zookeeper is running and making updates:

Logs: /var/log/zookeeper/zookeeper.log
Snapshots: /var/lib/zookeeper/version-2/

Connect to Zookeeper

From the localhost, connect to Zookeeper with the following command to verify access (replace the IP address with your Zookeeper server):

bin/zkCli.sh -server 127.0.0.1:2181

Tune Zookeeper

The following optional settings can be used depending on your requirements.

Improve Node Communication Reliability

The following settings can be used to improve node communication reliability:

/etc/zookeeper/conf/zoo.cfg
# The number of ticks that the initial synchronization phase can take
initLimit=10
# The number of ticks that can pass between sending a request and getting an acknowledgement
syncLimit=5

Reduce Snapshots

The following settings will create fewer snapshots which may reduce system requirements.

/etc/zookeeper/conf/zoo.cfg
# To avoid seeks Zookeeper allocates space in the transaction log file in blocks of preAllocSize kilobytes.
# The default block size is 64M. One reason for changing the size of the blocks is to reduce the block size
# if snapshots are taken more often. (Also, see snapCount).
preAllocSize=65536
# Zookeeper logs transactions to a transaction log. After snapCount transactions are written to a log file a
# snapshot is started and a new transaction log file is started. The default snapCount is 10,000.
snapCount=10000

Documentation

Configuring ClickHouse to use Zookeeper

Once Zookeeper has been installed and configured, ClickHouse can be modified to use Zookeeper. After the following steps are completed, a restart of ClickHouse will be required.

To configure ClickHouse to use Zookeeper, follow the steps shown below. The recommended settings are located on ClickHouse.tech zookeeper server settings.

Create a configuration file with the list of Zookeeper nodes. Best practice is to put the file in /etc/clickhouse-server/config.d/zookeeper.xml.

<yandex>
    <zookeeper>
        <node>
            <host>example1</host>
            <port>2181</port>
        </node>
        <node>
            <host>example2</host>
            <port>2181</port>
        </node>
        <session_timeout_ms>30000</session_timeout_ms>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <!-- Optional. Chroot suffix. Should exist. -->
        <root>/path/to/zookeeper/node</root>
        <!-- Optional. ZooKeeper digest ACL string. -->
        <identity>user:password</identity>
    </zookeeper>
</yandex>

Check the distributed_ddl parameter in config.xml. This parameter can be defined in another configuration file, and can change the path to any value that you like. If you have several ClickHouse clusters using the same Zookeeper, distributed_ddl path should be unique for every ClickHouse cluster setup.

<!-- Allow to execute distributed DDL queries (CREATE, DROP, ALTER, RENAME) on cluster. -->
<!-- Works only if Zookeeper is enabled. Comment it out if such functionality isn't required. -->
<distributed_ddl>
    <!-- Path in Zookeeper to queue with DDL queries -->
    <path>/clickhouse/task_queue/ddl</path>

    <!-- Settings from this profile will be used to execute DDL queries -->
    <!-- <profile>default</profile> -->
</distributed_ddl>

Check /etc/clickhouse-server/preprocessed/config.xml. You should see your changes there.
Restart ClickHouse. Check ClickHouse connection to Zookeeper detailed in Zookeeper Monitoring.

Converting Tables to Replicated Tables

Creating a replicated table

Replicated tables use a replicated table engine, for example ReplicatedMergeTree. The following example shows how to create a simple replicated table.

This example assumes that you have defined appropriate macro values for cluster, shard, and replica in macros.xml to enable cluster replication using zookeeper. For details consult the ClickHouse Data Replication guide.

CREATE TABLE test ON CLUSTER '{cluster}'
(
    timestamp DateTime,
    contractid UInt32,
    userid UInt32
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/default/test', '{replica}')
PARTITION BY toYYYYMM(timestamp)
ORDER BY (contractid, toDate(timestamp), userid)
SAMPLE BY userid;

The ON CLUSTER clause ensures the table will be created on the nodes of {cluster} (a macro value). This example automatically creates a Zookeeper path for each replica table that looks like the following:

/clickhouse/tables/{cluster}/{replica}/default/test

becomes:

/clickhouse/tables/c1/0/default/test

You can see Zookeeper replication data for this node with the following query (updating the path based on your environment):

SELECT *
  FROM system.zookeeper
  WHERE path = '/clickhouse/tables/c1/0/default/test'

Removing a replicated table

To remove a replicated table, use DROP TABLE as shown in the following example. The ON CLUSTER clause ensures the table will be deleted on all nodes. Omit it to delete the table on only a single node.

DROP TABLE test ON CLUSTER '{cluster}';

As each table is deleted the node is removed from replication and the information for the replica is cleaned up. When no more replicas exist, all Zookeeper data for the table will be cleared.

Cleaning up Zookeeper data for replicated tables

IMPORTANT NOTE: Cleaning up Zookeeper data manually can corrupt replication if you make a mistake. Raise a support ticket and ask for help if you have any doubt concerning the procedure.

New ClickHouse versions now support SYSTEM DROP REPLICA which is an easier command.

For example:

SYSTEM DROP REPLICA 'replica_name' FROM ZKPATH '/path/to/table/in/zk';

Zookeeper data for the table might not be cleared fully if there is an error when deleting the table, or the table becomes corrupted, or the replica is lost. You can clean up Zookeeper data in this case manually using the Zookeeper rmr command. Here is the procedure:

Login to Zookeeper server.
Run zkCli.sh command to connect to the server.
Locate the path to be deleted, e.g.:
ls /clickhouse/tables/c1/0/default/test
Remove the path recursively, e.g.,
rmr /clickhouse/tables/c1/0/default/test