Analytic systems are the eyes and ears of data-driven enterprises. It is critical to ensure they continue to work at all times despite failures small and large or users will be deprived of the ability to analyze and react to changes in the real world. Let’s start by defining two key terms.
High Availability: (HA) includes the mechanisms that allow computer systems to continue operating following the failure of individual components.
Disaster Recovery: (DR) includes the tools and procedures to enable computer systems to resume operation following a major catastrophe that affects many or all parts of a site.
These problems are closely related and depend on a small set of fungible technologies that include off-site backups and data replication..
The High Availability and Disaster Recovery guide provides an overview of the standard HA architecture for ClickHouse and a draft design for DR.
1 - Classes of Failures
The types of failures that can occur.
Failures come in many shapes and sizes. HA and DR focuses on protecting against the following:
Loss of data due to human error or deliberate attack.
Example: Deleting a table by accident.
Failure of an individual server.
Example: Host goes down/becomes unavailable due to a power supply failure or loss of network connectivity in the top-of-rack switch.
Large-scale failure extending to an entire site or even a geographic region.
Example: Severe weather or widespread outages of underlying services like Amazon Elastic Block Storage (EBS).
Database systems manage these failures using a relatively small number of procedures that have proven themselves over time. ClickHouse supports these.
Replication: Create live replicas of data on different servers. If one server fails, applications can switch to another replica. ClickHouse supports asynchronous, multi-master replication. It is flexible and works even on networks with high latency.
Backup: Create static snapshots of data that can be restored at will. Deleted tables, for instance, can be recovered from snapshots. ClickHouse has clickhouse-backup, an ecosystem project that handles static and incremental backups. It does not support point-in-time recovery.
Distance: It is important to separate copies of data by distance so that a failure cannot affect all of them. Placing replicas in different geographic regions protects against large scale failures. Both replication and backups work cross-region.
Regardless of the approach to protection, it is important to recover from failures as quickly as possible with minimum data loss. ClickHouse solutions meet these requirements to varying degrees. ClickHouse replicas are typically immediately accessible and fully up-to-date.
Backups, on the other hand may run only at intervals such as once a day, which means potential data loss since the last backup. They also can take hours or even days to restore fully.
2 - High Availability Architecture
The best practices to keep ClickHouse available.
The standard approach to ClickHouse high availability combines replication, backup, and astute service placement to maximize protection against failure of single components and accidental deletion of data.
Best Practices for ClickHouse HA
Highly available ClickHouse clusters observe a number of standard practices to ensure the best possible resilience.
Keep at least 3 replicas for each shard
ClickHouse tables should always be replicated to ensure high availability. This means that you should use ReplicatedMergeTree or a similar engine for any table that contains data you want to persist. Cluster definitions for these tables should include at least three hosts per shard.
Having 3 replicas for each ClickHouse shard allows shards to continue processing queries after a replica failure while still maintaining capacity for recovery. When a new replica is attached to the shard it will need to fetch data from the remaining replicas, which adds load.
Use 3 replicas for Zookeeper
Zookeeper ensembles must have an odd number of replicas, and production deployments should always have at least three to avoid losing quorum if a Zookeeper server fails. Losing quorum can cause ClickHouse replicate tables to go into readonly mode, hence should be avoided at all costs. 3 is the most common number of replicas used.
It is possible to use 5 replicas but any additional availability benefit is typically canceled by higher latency to reach consensus on operations (3 vs. 2 servers). We do not recommend this unless there are extenuating circumstances specific to a particular site and the way it manages or uses Zookeeper.
Disperse replicas over availability zones connected by low-latency networks
ClickHouse replicas used for writes and Zookeeper nodes should run in separate availability zones. These are operating environments with separate power, Internet access, physical premises, and infrastructure services like storage arrays. (Most pubic clouds offer them as a feature.) Availability zones should be connected by highly reliable networking offering consistent round-trip latency of 20 milliseconds or less between all nodes. Higher values are likely to delay writes.
It is fine to locate read-only replicas at latencies greater than 20ms. It is important not to send writes to these replicas during normal operation, or they may experience performance problems due to the latency to Zookeeper as well as other ClickHouse replicas.
Locate servers on separate physical hardware
Within a single availability zone ClickHouse and Zookeeper servers should be located on separate physical hosts to avoid losing multiple servers from a single failure. Where practical the servers should also avoid sharing other hardware such as rack power supplies, top-of-rack switches, or other resources that might create a single point of failure.
Use clickhouse-backup to guard against data deletion and corruption
ClickHouse supports backup of tables using the clickhouse-backup utility, which can do both full and incremental backups of servers. Storage in S3 buckets is a popular option as it is relatively low cost and has options to replicate files automatically to buckets located in other regions.
Test regularly
HA procedures should be tested regularly to ensure they work and that you can perform them efficiently and without disturbing the operating environment. Try them out in a staging environment and automate as much as you can.
References for ClickHouse HA
The following links detail important procedures required to implement ClickHouse HA and recovery from failures.
Zookeeper Cluster Setup – Describes how to set up a multi-node Zookeeper ensemble. Replacing a failed node is relatively straightforward and similar to initial installation.
Clickhouse-backup – Instructions for performing backup and restore operations on ClickHouse.
3 - Disaster Recovery Architecture
How to make ClickHouse more resilient
An optimal approach to disaster recovery builds on the resiliency conferred by the HA best practices detailed in the previous section. To increase the chance of surviving a large scale event that takes out multiple data centers or an entire region, we increase the distance between replicas for ClickHouse and Zookeeper.
The DR architecture for ClickHouse accomplishes this goal with a primary/warm standby design with two independent clusters. Writes go to a single cluster (the primary site), while the other cluster receives replicated data and at most is used for reads (the warm standby site). The following diagram shows this design.
Best Practices for ClickHouse DR
Point all replicas to main Zookeeper ensemble
ClickHouse replicas on both the primary and warm standby sites should point to the main Zookeeper ensemble. Replicas in both locations can initiate merges. This is generally preferable to using cross-region transfer which can be costly and consumes bandwidth.
Send writes only to primary replicas
Applications should write data only to primary replicas, i.e., replicas with a latency of less than 20ms to the main Zookeeper ensemble. This is necessary to ensure good write performance, as it is dependent on Zookeeper latency.
Reads may be sent to any replica on any site. This is a good practice to ensure warm standby replicas are in good condition and ready in the event of a failover.
Run Zookeeper observers on the warm standby
Zookeeper observers receive events from the main Zookeeper ensemble but do not participate in quorum decisions. The observers ensure that Zookeeper state is available to create a new ensemble on the warm standby. Meanwhile they do not affect quorum on the primary site. ClickHouse replicas should connect to the main ensemble, not the observers, as this is more performant.
Depending on your appetite for risk, a single observer is sufficient for DR purposes. You can expand it to a cluster in the event of a failover..
Use independent cluster definitions for each site
Each site should use independent cluster definitions that share the cluster name and number of shards but use separate hosts from each site. Here is an example of the cluster definitions in remote_servers.xml on separate sites in Chicago and Dallas. First, the Chicago cluster definition, which refers to Chicago hosts only .
This definition ensures that distributed tables will only refer to tables within a single site. This avoids extra latency if subqueries go across sites. It also means the cluster definition does not require alteration in the event of a failover..
Use “Umbrella” cluster for DDL operations
For convenience to perform DDL operations against all nodes you can add one more cluster and include all Clickhouse nodes into this cluster. Run ON CLUSTER commands against this cluster. The following “all” cluster is used for DDL.
Assuming the ‘all’ cluster is present on all sites you can now issue DDL commands like the following.
CREATE TABLE IF NOT EXISTS events_local ON CLUSTER all
...
;
Use macros to enable replication across sites
It is a best practice to use macro definitions when defining tables to ensure consistent paths for replicated as distributed tables. Macros are conventionally defined in file macros.xml. The macro definitions should appear as shown in the following example. (Names may vary of course.)
<yandex>
<macros>
<!-- Shared across sites -->
<cluster>cluster1</cluster>
<!-- Shared across sites -->
<shard>0</shard>
<!-- Replica names are unique for each node on both sites. -->
<replica>chi-prod-01</replica>
</macros>
</yandex>
The following example illustrates usage of DDL commands with the previous macros. Note that the ON CLUSTER clause uses the ‘all’ umbrella cluster definition to ensure propagation of commands across all sites.
-- Use 'all' cluster for DDL.
CREATE TABLE IF NOT EXISTS events_local ON CLUSTER 'all' (
event_date Date,
event_type Int32,
article_id Int32,
title String
.
ENGINE = ReplicatedMergeTree('/clickhouse/{cluster}/tables/{shard}/{database}/events_local', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_type, article_id);
CREATE TABLE events ON CLUSTER 'all' AS events_local
ENGINE = Distributed('{cluster}', default, events_local, rand())
Using this pattern, tables will replicate across sites but distributed queries will be restricted to a single site due to the definitions for cluster1 being different on different sites.
Test regularly
It is critical to test that warm standby sites work fully. Here are three recommendations.
Maintain a constant, light application load on the warm standby cluster including a small number of writes and a larger number of reads. This ensures that all components work correctly and will be able to handle transactions if there is a failover.
Partition networks between sites regularly to ensure that the primary works properly.
Test failover on a regular basis to ensure that it works well and can be applied efficiently when needed.
Monitoring
Replication Lag
Monitor ClickHouse replication lag/state using HTTP REST commands.
*curl http://ch_host:8123/replicas_status.
Also, check the absolute_delay from **system.replication_status **table.
Zookeeper Status
You can monitor the ZK ensemble and observers using: **echo stat | nc <zookeeper ip> 2181
Ensure that servers have the expected leader/follower and observer roles. Further commands can be found in the Zookeeper administrative docs.
Heartbeat Table
Add a service heartbeat table and make inserts at one site and select at another. It gives an additional check of replication status / lag across systems.
create table heart_beat(
site_id String,
updated_at DateTime .
Engine = ReplicatedReplacingMergeTree(updated_at.
order by site_id;
-- insert local heartbeat
insert into heart_beat values('chicago', now()); .
-- check lag for other sites
select site_id.
now() - max(updated_at) lag_seconds
from heart_beat
where site_id <> 'chicago'
group by site_id
References for ClickHouse DR
The following links detail important information required to implement ClickHouse DR designs and implement disaster recovery procedures..
Failures in a DR event are different from HA. Switching to the warm standby site typically requires an explicit decision for the following reasons:
Moving applications to the standby site requires changes across the entire application stack from public facing DNS and load balancing downwards..
Disasters do not always affect all parts of the stack equally, so switching involves a cost-benefit decision. Depending on the circumstances, it may make sense to limp along on the primary.
It is generally a bad practice to try to automate the decision to switch sites due to the many factors that go into the decision as well as the chance of accidental failover. The actual steps to carry out disaster recovery should of course be automated, thoroughly documented, and tested.
Failover
Failover is the procedure for activating the warm standby site. For the ClickHouse DR architecture, this includes the following steps.
Ensure no writes are currently being processed by standby nodes.
Stop each Zookeeper server on the standby site and change the configuration to make them active members in the ensemble..
Restart Zookeeper servers.
Start additional Zookeeper servers as necessary to create a 3 node ensemble.
Configure ClickHouse servers to point to new Zookeeper ensemble.
Check monitoring to ensure ClickHouse is functioning properly and applications are able to connect.
Recovering the Primary
Recovery restores the primary site to a condition that allows it to receive load again. The following steps initiate recovery.
Configure the primary site Zookeepers as observers and add server addresses back into the configuration on the warm standby site.
Restart Zookeepers as needed to bring up the observer nodes. Ensure they are healthy and receiving transactions from the ensemble on the warm standby site.
Replace any failed ClickHouse servers and start them using the procedure for recovering a failed ClickHouse node after full data loss.
Check monitoring to ensure ClickHouse is functioning properly and applications are able to connect. Ensure replication is caught up for all tables on all replicas.
Add load (if possible) to test the primary site and ensure it is functioning fully.
Failback
Failback is the procedure for returning processing to the primary site. Use the failover procedure in reverse, followed by recovery on the warm standby to reverse replication.
Cost Considerations for DR
Disasters are rare, so many organizations balance the cost of maintaining a warm standby site with parameters like the time to recover or capacity of the warm standby site itself to bear transaction load. Here are suggestions to help cut costs.
Use fewer servers in the ClickHouse cluster
It is possible to reduce the number of replicas per shard to save on operating costs. This of course raises the chance that some part of the DR site may also fail during a failover, resulting in data loss.
Use cheaper hardware
Using less performant hardware is a time-honored way to reduce DR costs. ClickHouse operates perfectly well on slow disk, so storage is a good place to look for savings, followed by the hosts themselves.
Host DR in the public cloud
ClickHouse and Zookeeper both run well on VMs so this avoids costs associated with physical installations (assuming you have them). Another advantage of public clouds is that they have more options to run with low-capacity hardware, e.g., by choosing smaller VMs.
Use network attached storage to enable vertical scaling from minimal base
ClickHouse and Zookeeper run well on network attached storage, for example Amazon EBS. You can run the warm standby site using minimum-sized VMs that are sufficient to handle replication and keep storage updated. Upon failover you can allocate more powerful VMs and re-attach the storage..
Caveat: In a large-scale catastrophe it is quite possible many others will be trying to do the same thing, so the more powerful VMs may not be available when you need them.
Discussion of Alternative DR Architectures
Primary/warm standby is the recommended approach for ClickHouse DR. There are other approaches of course, so here is a short discussion.
Active-active DR architectures make both sites peers of each other and have them both running applications and full load. As a practical matter this means that you would extend the Zookeeper ensemble to multiple sites. The catch is that ClickHouse writes intensively to Zookeeper on insert, which would result in high insert latency as Zookeeper writes across sites. Products that take this approach put Zookeeper nodes in 3 locations, rather than just two. (See an illustration of this design from the Confluent documentation.)
Another option is storage-only replication. This approach replicates data across sites but does not activate database instances on the warm standby site until a disaster occurs. Unlike Oracle or PostgreSQL, ClickHouse does not have a built-in log that can be shipped to a remote location. It is possible to use file system replication, such as Linux DRBD, but this approach has not been tested. It looks feasible for single ClickHouse instances but does not seem practical for large clusters.