Bonus: Adding a Shard to your ClickHouse® Cluster
As a final topic, we’ll talk a bit about shards. But before we do, here’s something to keep in mind:
Don’t use shards unless you have at least 5 TB of data. If that doesn’t apply to you, feel free to ignore this section and get on with your day.
Shards give you horizontal scaling in ClickHouse®. When your cluster has multiple shards, data is distributed across them, which means queries against your data can be done in parallel on each shard. But ClickHouse is so powerful, the overhead of shards won’t give you any benefits until you’ve got a substantial amount of data.
Well, if you’re still here, we assume you want to add a shard to your cluster. Adding a second shard is easy. Make a copy of manifest03.yaml
and name it manifest04.yaml
. Now edit the file and change shardsCount
to 2
:
apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
name: cluster01
spec:
templates:
podTemplates:
. . .
volumeClaimTemplates:
. . .
configuration:
. . .
clusters:
- name: cluster01
layout:
shardsCount: 2
replicasCount: 2
. . .
(We’re starting with manifest03.yaml
because it’s configured to use either ClickHouse Keeper or Zookeeper, and it specifies the right storageClass
for your Kubernetes environment.)
Now apply the file to update cluster01
:
kubectl apply -f manifest04.yaml -n quick
You’ll see a success message:
clickhouseinstallation.clickhouse.altinity.com/cluster01 configured
Verify that the cluster is running - this may take a few minutes, depending on your system:
kubectl get chi -o wide -n quick
The status may be InProgress
for a short while, but at some point the cluster will be up and running again and the status will be Completed
:
NAME VERSION CLUSTERS SHARDS HOSTS TASKID STATUS HOSTS-COMPLETED HOSTS-UPDATED HOSTS-ADDED HOSTS-DELETED ENDPOINT AGE SUSPEND
cluster01 0.24.5 1 2 4 aef78108-cb54-466e-8d8d-ded8948a7606 Completed clickhouse-cluster01.quick.svc.cluster.local 46m
Notice that the number of hosts is now 4. To see our two shards, list the services running in the quick
namespace:
kubectl get service -n quick
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
chi-cluster01-cluster01-0-0 ClusterIP None <none> 9000/TCP,8123/TCP,9009/TCP 45m
chi-cluster01-cluster01-0-1 ClusterIP None <none> 9000/TCP,8123/TCP,9009/TCP 17m
chi-cluster01-cluster01-1-0 ClusterIP None <none> 9000/TCP,8123/TCP,9009/TCP 2m8s
chi-cluster01-cluster01-1-1 ClusterIP None <none> 9000/TCP,8123/TCP,9009/TCP 73s
clickhouse-cluster01 ClusterIP None <none> 8123/TCP,9000/TCP 45m
We should have two new pods as well.
kubectl get pods -n quick
And there they are, chi-cluster01-cluster01-1-0-0
and chi-cluster01-cluster01-1-1-0
:
NAME READY STATUS RESTARTS AGE
chi-cluster01-cluster01-0-0-0 1/1 Running 0 30m
chi-cluster01-cluster01-0-1-0 1/1 Running 0 18m
chi-cluster01-cluster01-1-0-0 1/1 Running 0 3m2s
chi-cluster01-cluster01-1-1-0 1/1 Running 0 2m7s
Let’s connect to one of the new pods:
kubectl exec -it chi-cluster01-cluster01-1-0-0 -n quick -- clickhouse-client
Run the following command to see what we have on the second shard:
SHOW databases;
You’ll see something like this:
┌─name───────────────┐
1. │ INFORMATION_SCHEMA │
2. │ analytics │
3. │ default │
4. │ information_schema │
5. │ system │
└────────────────────┘
5 rows in set. Elapsed: 0.002 sec.
Our database is in there, as we’d expect. And the two tables we created earlier should be here:
USE analytics; SHOW TABLES;
There they are:
┌─name───────────────────┐
1. │ page_views_distributed │
2. │ page_views_replicated │
└────────────────────────┘
2 rows in set. Elapsed: 0.002 sec.
Seems legit…let’s take a look at our data:
SELECT *
FROM analytics.page_views_replicated;
We’ve got nothing:
Ok.
0 rows in set. Elapsed: 0.001 sec.
Wait, our data isn’t here? Well, it isn’t supposed to be. Remember, we’re distributing our data across multiple shards, so no data should be in more than one shard. Our data on the first shard is still alive and well, but we’ve got nothing here. The important thing is that our shard has the same databases and tables as our original shard.
We’ll talk about how to work across all our shards in a second, but first, let’s add some new data into the table on this shard:
INSERT INTO analytics.page_views_replicated (event_time, user_id, page_url, referrer_url, device, country) VALUES
('2025-01-02 15:30:00', 105, '/home', 'direct', 'desktop', 'Germany'),
('2025-01-03 11:45:00', 107, '/contact', 'direct', 'desktop', 'India'),
('2025-01-04 18:00:00', 102, '/products', 'google.com', 'tablet', 'Canada'),
('2025-01-05 21:30:00', 109, '/checkout', 'facebook.com', 'mobile', 'USA'),
('2025-01-09 17:20:00', 110, '/cart', 'direct', 'desktop', 'Australia');
And we can run SELECT
to see our new (and different) data:
SELECT *
FROM analytics.page_views_replicated
Our new data is there:
┌──────────event_time─┬─user_id─┬─page_url──┬─referrer_url─┬─device──┬─country───┐
1. │ 2025-01-02 15:30:00 │ 105 │ /home │ direct │ desktop │ Germany │
2. │ 2025-01-03 11:45:00 │ 107 │ /contact │ direct │ desktop │ India │
3. │ 2025-01-04 18:00:00 │ 102 │ /products │ google.com │ tablet │ Canada │
4. │ 2025-01-05 21:30:00 │ 109 │ /checkout │ facebook.com │ mobile │ USA │
5. │ 2025-01-09 17:20:00 │ 110 │ /cart │ direct │ desktop │ Australia │
└─────────────────────┴─────────┴───────────┴──────────────┴─────────┴───────────┘
5 rows in set. Elapsed: 0.007 sec.
So now the data in page_views_replicated
is spread across two shards. That’s what we want for horizontal scaling, but we’d like our queries (and other SQL statements, for that matter) to run across all shards. That’s where the Distributed
table we created earlier comes into play. Here’s the same query against page_views_distributed
:
SELECT *
FROM analytics.page_views_distributed;
Now we get all the data from all the shards:
┌──────────event_time─┬─user_id─┬─page_url──┬─referrer_url─┬─device──┬─country───┐
1. │ 2025-01-02 15:30:00 │ 105 │ /home │ direct │ desktop │ Germany │
2. │ 2025-01-03 11:45:00 │ 107 │ /contact │ direct │ desktop │ India │
3. │ 2025-01-04 18:00:00 │ 102 │ /products │ google.com │ tablet │ Canada │
4. │ 2025-01-05 21:30:00 │ 109 │ /checkout │ facebook.com │ mobile │ USA │
5. │ 2025-01-09 17:20:00 │ 110 │ /cart │ direct │ desktop │ Australia │
└─────────────────────┴─────────┴───────────┴──────────────┴─────────┴───────────┘
┌──────────event_time─┬─user_id─┬─page_url──┬─referrer_url─┬─device──┬─country───┐
6. │ 2025-01-01 12:00:00 │ 101 │ /home │ google.com │ mobile │ USA │
7. │ 2025-01-01 12:00:00 │ 106 │ /home │ google.com │ mobile │ Nigeria │
8. │ 2025-01-01 12:05:00 │ 102 │ /products │ facebook.com │ desktop │ Canada │
9. │ 2025-01-01 12:10:00 │ 103 │ /cart │ twitter.com │ tablet │ UK │
10. │ 2025-01-02 14:00:00 │ 101 │ /checkout │ google.com │ mobile │ USA │
11. │ 2025-01-06 08:20:00 │ 110 │ /blog │ twitter.com │ desktop │ Australia │
└─────────────────────┴─────────┴───────────┴──────────────┴─────────┴───────────┘
11 rows in set. Elapsed: 0.003 sec.
👉 Type exit
to end the clickhouse-client
session.
So that’s how you work with shards and replication. When nes data is added, it goes to a randomly selected shard, then the new data is copied to the shard’s replica. (We connected to a particular host to make a point, but you normally don’t specify a shard.)
Again, you shouldn’t use them until you’ve got a substantial amount of data, but now you know how to add a shard to your ClickHouse cluster, and how shards change the way you work with your data.