Classes of Failures

The types of failures that can occur.

Failures come in many shapes and sizes. HA and DR focuses on protecting against the following:

  • Loss of data due to human error or deliberate attack.
    • Example: Deleting a table by accident.
  • Failure of an individual server.
    • Example: Host goes down/becomes unavailable due to a power supply failure or loss of network connectivity in the top-of-rack switch.
  • Large-scale failure extending to an entire site or even a geographic region.
    • Example: Severe weather or widespread outages of underlying services like Amazon Elastic Block Storage (EBS).

Database systems manage these failures using a relatively small number of procedures that have proven themselves over time. ClickHouse supports these.

  1. Replication: Create live replicas of data on different servers. If one server fails, applications can switch to another replica. ClickHouse supports asynchronous, multi-master replication. It is flexible and works even on networks with high latency.
  2. Backup: Create static snapshots of data that can be restored at will. Deleted tables, for instance, can be recovered from snapshots. ClickHouse has clickhouse-backup, an ecosystem project that handles static and incremental backups. It does not support point-in-time recovery.
  3. Distance: It is important to separate copies of data by distance so that a failure cannot affect all of them. Placing replicas in different geographic regions protects against large scale failures. Both replication and backups work cross-region.

Regardless of the approach to protection, it is important to recover from failures as quickly as possible with minimum data loss. ClickHouse solutions meet these requirements to varying degrees. ClickHouse replicas are typically immediately accessible and fully up-to-date.

Backups, on the other hand may run only at intervals such as once a day, which means potential data loss since the last backup. They also can take hours or even days to restore fully.