ClickHouse® Cluster with Zookeeper Production Configuration Guide

Best practices on a ClickHouse® on Kubernetes Production configuration.

Moving from a single ClickHouse® server to a clustered format provides several benefits:

Replication guarantees data integrity.
Provides redundancy.
Failover by being able to restart half of the nodes without encountering downtime.

Moving from an unsharded ClickHouse environment to a sharded cluster requires redesign of schema and queries. Starting with sharding from the beginning makes it easier in the future to scale the cluster up.

Setting up a ClickHouse cluster for a production environment requires the following stages:

Hardware Requirements
Network Configuration
Create Host Names
Monitoring Considerations
Configuration Steps
Setting Up Backups
Staging Plans
Upgrading The Cluster

Hardware Requirements

ClickHouse

ClickHouse will take everything from your hardware. So the more hardware - the better. As of this publication, the hardware requirements are:

Minimum Hardware: 4-core CPU with support of SSE4.2, 16 Gb RAM, 1Tb HDD.
- Recommended for development and staging environments.
- SSE4.2 is required, and going below 4 Gb of RAM is not recommended.
Recommended Hardware: =16-cores, >=64Gb RAM, HDD-raid or SSD.
- For processing up to hundreds of millions of rows.

Zookeeper

Zookeeper requires separate servers from those used for ClickHouse. Zookeeper has poor performance when installed on the same node as ClickHouse.

Hardware Requirements for Zookeeper:

Fast disk speed (ideally NVMe, 128Gb should be enough).
Any modern CPU.
8Gb of RAM

The number of Zookeeper instances depends on the environment:

Production: 3 is an optimal number of zookeeper instances.
Development and Staging: 1 zookeeper instance is sufficient.

As a reference, see Running Zookeeper in Production.

Hardware Configuration

Configure the servers according to the ClickHouse Usage Recommendations.

Test Your Hardware

Be sure to test the following:

RAM speed.
Network speed.
Storage speed.

It’s better to find any performance issues before installing ClickHouse.

Network Configuration

Networking And Server Room Planning

It is recommended to use a fast network, ideally 10 Gbit. ClickHouse nodes generate a lot of traffic along with the Zookeeper connections and inter-Zookeeper communications.

Low latency is more important than bandwidth.

Keep the replicas isolated on hardware level. This allows for cluster failover from possible outages.

For Physical Environments: Avoid placing two ClickHouse replicas to the same server rack. Ideally they should be on isolated network switches and an isolated power supply.
For Cloud Environments: Use different availability zones between the ClickHouse replicas.

These settings are the same as the Zookeeper nodes.

For example:

Rack	Server	Server	Server	Server
Rack 1	CH_SHARD1_R1	CH_SHARD2_R1	CH_SHARD3_R1	ZOO_1
Rack 2	CH_SHARD1_R2	CH_SHARD2_R2	CH_SHARD3_R2	ZOO_2
Rack 3	ZOO3

Network Ports And Firewall

ClickHouse listens the following ports:

9000: clickhouse-client, native clients, other clickhouse-servers connect to here.
8123: HTTP clients
9009: Other replicas will connect here to download data.

For more information, see the blog post ClickHouse Networking, Part 1.

Zookeeper listens on the following ports:

2181: Client connections.
2888: Inter-ensemble connections.
3888: Leader election.

Outbound traffic from ClickHouse connects to the following ports:

ZooKeeper: On port 2181.
Other CH nodes in the cluster: On port 9000 and 9009.
Dictionary sources: Depending on what was configured such as HTTP, MySQL, Mongo, etc.
Kafka or Hadoop: If those integrations were enabled.

SSL

For non-trusted networks enable SSL/HTTPS. If possible, it is better to keep interserver communications unencrypted for performance reasons.

Create Host Names

The best time to start creating a naming schema for the servers is before they’re created and configured.

There are a few features based on good server naming in ClickHouse:

clickhouse-client prompts: Allows a different prompt for clickhouse-client per server hostname.
Nearest hostname load balancing: For more information, see Nearest Hostname.

A good option is to use the following:

{datacenter}-{serverroom}-{rack identifier}-{clickhouse cluster identifier}-{shard number or server number}.

Other examples:

rxv-olap-ch-master-sh01-r01:
- rxv - location (rack#15)
- olap - product name
- ch = clickhouse
- master = stage
- sh01 = shard 1
- r01 = replica 1
hetnzerde1-ch-prod-01.local:
- hetnzerde1 - location (also replica id)
- ch = clickhouse
- prod = stage
- 01 - server number / shard number in that DC
sh01.ch-front.dev.aws-east1a.example.com:
- sh01 - shard 01
- ch-front - cluster name
- dev = stage
- aws = cloud provider
- east1a = region and availability zone

Host Name References

Additional Host Name Tips

Host names configured on the server should not change. If you do need to change the host name, one reference to use is How to Change Hostname on Ubuntu 18.04.
The server should be accessible to other servers in the cluster via its hostname. Otherwise you will need to configure interserver_hostname in your config.

Monitoring Considerations

External Monitoring

For external monitoring:

Graphite: Use the embedded exporter. See config.xml.
InfluxDB: Use the embedded exporter, plus Telegraf. For more information, see Graphite protocol support in InfluxDB.
For Prometheus you can use the clickhouse-exporter

To enable external monitoring, enable system.metric_log in config.xml.

For dashboard system:

Grafana is recommended for graphs, reports, alerts, dashboard, etc.
Other options are Nagios or Zabbix.

The following metrics should be collected:

For Host Machine:
- CPU
- Memory
- Network (bytes/packets)
- Storage (iops)
- Disk Space (free / used)
For ClickHouse:
- Connections (count)
- RWLocks
- Read / Write / Return (bytes)
- Read / Write / Return (rows)
- Zookeeper operations (count)
- Absolute delay
- Query duration (optional)
- Replication parts and queue (count)
For Zookeeper:
- See ZooKeeper Monitoring Integration.

The following queries are recommended:

SELECT * FROM system.replicas
- For more information, see the ClickHouse guide on System Tables.
SELECT * FROM system.merges
- Checks on the speed and progress of currently executed merges.
SELECT * FROM system.mutations
- This is the source of information on the speed and progress of currently executed merges.

Monitor and Alerts

Configure the notifications for events and thresholds based on the data in the file clickhouse_kubernetes_monitoring_and_alerts.csv.

Health Checks

The following health checks should be monitored:

Check Name	Shell or SQL command	Severity
ClickHouse status	$ curl 'http://localhost:8123/'Ok.	Critical
Too many simultaneous queries. Maximum: 100	select value from system.metrics where metric='Query'	Critical
Replication status	$ curl 'http://localhost:8123/replicas_status'Ok.	High
Read only replicas (reflected by replicas_status as well)	select value from system.metrics where metric='ReadonlyReplica’	High
ReplicaPartialShutdown (not reflected by replicas_status, but seems to correlate with ZooKeeperHardwareExceptions)	select value from system.events where event='ReplicaPartialShutdown'	HighI turned this one off. It almost always correlates with ZooKeeperHardwareExceptions, and when it’s not, then there is nothing bad happening…
Some replication tasks are stuck	select count()from system.replication_queuewhere num_tries > 100	High
ZooKeeper is available	select count() from system.zookeeper where path='/'	Critical for writes
ZooKeeper exceptions	select value from system.events where event='ZooKeeperHardwareExceptions'	Medium
Other CH nodes are available	$ for node in `echo "select distinct host_address from system.clusters where host_name !='localhost'"	curl 'http://localhost:8123/' –silent –data-binary @-`; do curl "http://$node:8123/" –silent ; done
All CH clusters are available (i.e. every configured cluster has enough replicas to serve queries)	for cluster in `echo "select distinct cluster from system.clusters where host_name !='localhost'"	curl 'http://localhost:8123/' –silent –data-binary @-` ; do clickhouse-client –query="select '$cluster', 'OK' from cluster('$cluster', system, one)" ; done
There are files in 'detached' folders	$ find /var/lib/clickhouse/data///detached/* -type d	wc -l; 19.8+select count() from system.detached_parts
Too many parts: Number of parts is growing; Inserts are being delayed; Inserts are being rejected	select value from system.asynchronous_metrics where metric='MaxPartCountForPartition';select value from system.events/system.metrics where event/metric='DelayedInserts'; select value from system.events where event='RejectedInserts'	Critical
Dictionaries: exception	select concat(name,': ',last_exception) from system.dictionarieswhere last_exception != ''	Medium
ClickHouse has been restarted	select uptime();select value from system.asynchronous_metrics where metric='Uptime'
DistributedFilesToInsert should not be always increasing	select value from system.metrics where metric='DistributedFilesToInsert'	Medium
A data part was lost	select value from system.events where event='ReplicatedDataLoss'	High
Data parts are not the same on different replicas	select value from system.events where event='DataAfterMergeDiffersFromReplica'; select value from system.events where event='DataAfterMutationDiffersFromReplica'	Medium

External Monitoring References

Configuration steps

Use ansible/puppet/salt or other system to control servers configuration.

Configure ClickHouse access to zookeeper by putting file zookeeper.xml in /etc/clickhouse-server/config.d/ folder. This file must be placed on all ClickHouse servers.

<yandex>
    <zookeeper>
        <node>
            <host>zookeeper1</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper2</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper3</host>
            <port>2181</port>
        </node>
    </zookeeper>
</yandex>

On each server put the file macros.xml in /etc/clickhouse-server/config.d/ folder.

<yandex>
    <!--
        That macros are defined per server,
        and they can be used in DDL, to make the DB schema cluster/server neutral
    -->
    <macros>
        <cluster>prod_cluster</cluster>
        <shard>01</shard>
        <replica>clickhouse-sh1r1</replica> <!-- better - use the same as hostname  -->
    </macros>
</yandex>

On each server place the file cluster.xml in /etc/clickhouse-server/config.d/ folder.

</yandex>
    <remote_servers>
        <prod_cluster> <!-- you need to give a some name for a cluster -->
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>clickhouse-sh1r1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-sh1r2</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>clickhouse-sh2r1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-sh2r2</host>
                    <port>9000</port>
                </replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica>
                    <host>clickhouse-sh3r1</host>
                    <port>9000</port>
                </replica>
                <replica>
                    <host>clickhouse-sh3r2</host>
                    <port>9000</port>
                </replica>
            </shard>
        </prod_cluster>
    </remote_servers>
<yandex>

Create 2 extra cluster configurations with the following modified SQL query:
1. cluster
2. cluster_all_nodes_as_shards
3. clusters_all_nodes_as_replicas

Once this is complete, other queries that span nodes can be performed. For example:

CREATE TABLE test_table ON CLUSTER '{cluster}' (id UInt8)  Engine=ReplicatedMergeTree('/clickhouse/tables/{database}/{shard}/{table}', '{replica}') ORDER BY (id);

That will create a table on all servers in the cluster. You can insert data into this table and it will be replicated automatically to the other shards.

To store the data or read the data from all shards at the same time, create a Distributed table that links to the replicatedMergeTree table.

Users

Disable or add password for the default users default and readonly if your server is accessible from non-trusted networks.

If you will add password to the default user, you will need to adjust cluster configuration, since the other servers need to know the default user’s should know the default user’s to connect to each other.

If you’re inside a trusted network, you can leave default user set to nothing to allow the ClickHouse nodes to communicate with each other.

Engines & ClickHouse building blocks

For general explanations of roles of different engines - check the post Distributed vs Shard vs Replicated ahhh, help me!!!.

Configuration Best Practices

Attribution

Modified from a GitHub post by Mikhail Filimonov.

The following are recommended Best Practices when it comes to setting up a ClickHouse Cluster with Zookeeper:

Don’t edit/overwrite default configuration files. Sometimes a newer version of ClickHouse introduces some new settings or changes the defaults in config.xml and users.xml.

Set configurations via the extra files in conf.d directory. For example, to overwrite the interface save the file conf.d/listen.xml, with the following:
```
<?xml version="1.0"?>
<yandex>
    <listen_host replace="replace">::</listen_host>
</yandex>
```
The same is true for users. For example, change the default profile by putting the file in users.d/profile_default.xml:

<?xml version="1.0"?>
<yandex>
    <profiles>
        <default replace="replace">
            <max_memory_usage>15000000000</max_memory_usage>
            <max_bytes_before_external_group_by>12000000000</max_bytes_before_external_group_by>
            <max_bytes_before_external_sort>12000000000</max_bytes_before_external_sort>
            <distributed_aggregation_memory_efficient>1</distributed_aggregation_memory_efficient>
            <use_uncompressed_cache>0</use_uncompressed_cache>
            <load_balancing>random</load_balancing>
            <log_queries>1</log_queries>
            <max_execution_time>600</max_execution_time>
        </default>
    </profiles>
</yandex>

Or you can create a user by putting a file users.d/user_xxx.xml:

<?xml version="1.0"?>
<yandex>
    <users>
        <xxx>
            <!-- PASSWORD=$(base64 < /dev/urandom | head -c8); echo "$PASSWORD"; echo -n "$PASSWORD" | sha256sum | tr -d '-' -->
            <password_sha256_hex>...</password_sha256_hex>
            <networks incl="networks" />
            <profile>readonly</profile>
            <quota>default</quota>
            <allow_databases incl="allowed_databases" />
        </xxx>
    </users>
</yandex>

Some parts of configuration will contain repeated elements (like allowed ips for all the users). To avoid repeating that - use substitutions file. By default its /etc/metrika.xml, but you can change it for example to /etc/clickhouse-server/substitutions.xml (<include_from> section of main config). Put that repeated parts into substitutions file, like this:

<?xml version="1.0"?>
<yandex>
    <networks>
        <ip>::1</ip>
        <ip>127.0.0.1</ip>
        <ip>10.42.0.0/16</ip>
        <ip>192.168.0.0/24</ip>
    </networks>

<clickhouse_remote_servers>
<!-- cluster definition -->
    </clickhouse_remote_servers>

    <zookeeper-servers>
        <node>
            <host>zookeeper1</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper2</host>
            <port>2181</port>
        </node>
        <node>
            <host>zookeeper3</host>
            <port>2181</port>
        </node>
    </zookeeper-servers>

    <clickhouse_compression></clickhouse_compression>
</yandex>

These files can be common for all the servers inside the cluster or can be individualized per server. If you choose to use one substitutions file per cluster, not per node, you will also need to generate the file with macros, if macros are used.

This way you have full flexibility; you’re not limited to the settings described in template. You can change any settings per server or data center just by assigning files with some settings to that server or server group. It becomes easy to navigate, edit, and assign files.

Other Configuration Recommendations

Other configurations that should be evaluated:

<listen> in config.xml: Determines which IP addresses and ports the ClickHouse servers listen for incoming communications.
<max_memory_..> and <max_bytes_before_external_...> in users.xml. These are part of the profile <default>.
<max_execution_time>
<log_queries>

The following extra debug logs should be considered:

part_log
text_log

Understanding The Configuration

ClickHouse configuration stores most of its information in two files:

config.xml: Stores Server configuration parameters. They are server wide, some are hierarchical , and most of them can’t be changed in runtime. Only 3 sections will be applied w/o restart:
- macros
- remote_servers
- logging level
users.xml: Configure users, and user level / session level settings.
- Each user can change these during their session by:
  - Using parameter in http query
  - By using parameter for clickhouse-client
  - Sending query like set allow_experimental_data_skipping_indices=1.
- Those settings and their current values are visible in system.settings. You can make some settings global by editing default profile in users.xml, which does not need restart.
- You can forbid users to change their settings by using readonly=2 for that user, or using setting constraints.
- Changes in users.xml are applied w/o restart.

For both config.xml and users.xml, it’s preferable to put adjustments in the config.d and users.d subfolders instead of editing config.xml and users.xml directly.

Backups

ClickHouse is currently at the design stage of creating some universal backup solution. Some custom backup strategies are:

Each shard is backed up separately.
FREEZE the table/partition. For more information, see Alter Freeze Partition.
1. This creates hard links in shadow subdirectory.
rsync that directory to a backup location, then remove that subfolder from shadow.
1. Cloud users are recommended to use Rclone.
Always add the full contents of metadata subfolder contain the current DB schema and clickhouse configs to your backup.
For a second replica, it’s enough to copy metadata & config.

This implementation follows a similar approach by clickhouse-backup.

We have not used this tool on production systems, and can make no recommendations for or against it. As of this time clickhouse-backup is not a complete backup solution, but it does simply some parts of the backup process.
Don’t try to compress backups; the data is already compressed in ClickHouse.
One other option is an extra passive replica. This is not recommended in the cases of power user issues.

Version Upgrades

Update itself is simple: update packages, restart clickhouse-server service afterwards.

Check if the version you want to upgrade to is stable. We highly recommend the Altinity Stable® Releases.
1. Review the changelog to ensure that no configuration changes are needed.
Update staging and test to verify all systems are working.
Prepare and test downgrade procedures so the server can be returned to the previous version if necessary.
Start with a “canary” update. This is one replica with one shard that is upgraded to make sure that the procedure works.
Test and verify that everything works properly. Check for any errors in the log files.
If everything is working well, update rest of the cluster.

For small clusters, the BlueGreenDeployment technique is also a good option.

FAQ

How Do I Restrict Memory Usage?

See max_memory_usage for more information.

ClickHouse died during big query execution

Warning

???

oomkiller

How Do I Limit Memory Usage On Group By?

See max_bytes_before_external_group_by.

Data returned in blocks in clickhouse-client

Warning

???

format PrettyCompact + multithreading & streaming.

Either other format PrettyCompactMonoBlock either finish with one stream (for example by adding order by).

I Can’t Connect From Other Hosts

Check the <listen> settings in config.xml. Verify that the connection can connect on both IPV4 and IPV6.

Does clickhouse have stored procedures / UDF?

Not as of this time. PR can be sent with needed functionality.

How do I Store IPv4 and IPv6 Address In One Field?

There is a clean and simple solution for that. Any IPv4 has it’s unique IPv6 mapping:

IPv4 IP address: 191.239.213.197
IPv4-mapped IPv6 address: ::ffff:191.239.213.197

Eventual consistency & atomic insert

Warning

???

sequential = insert_quorum (no concurrent inserts currently :\ ) + select_sequential_consistency

How Do I Simulate Window Functions Using Arrays?

Group with groupArray.
Calculate the needed metrics.
Ungroup back using array join.

How to avoid bugs

ClickHouse is quite young product, but has already thousands of features and development process goes in a very high pace. All core features are well tested, and very stable.

Historically for all projects, new features can be less stable. Usually these are adopted by the community and stabilize quickly. And And of course all possible combinations of features just physically can’t be tested.

We recommend the following practices:

Check all the changes on the staging first, especially if some new features are used.
Check latest stable or test versions of ClickHouse on your staging environment regularly and pass the feedback to us or on the official ClickHouse github.
Ask for known issues report before updating production.
For production, versions with the 3rd digit should be as high as possible indicating they are the most recent big fix Release ID. For example, version 20.3.5:
1. 20 is the year of release.
2. 3 indicates a Feature Release. This is an increment where features are delivered.
3. 5 is the bug fix Release Id. New Feature Releases typically have a 1 in the Release Id section and potentially more undiscovered bugs.